Abhishek Kumar
I'm a technologist and am excited about research in machine learning, innovative software development and entrepreneurship. I currently work at Google with the Google Social team. I recently graduated from UC San Diego (or UCSD), with a Masters in Computer Science, and specialization in machine learning.

While at UCSD, I worked with Prof. Charles Elkan on novel multilabel learning approaches (PCCs, Beam Search Algorithms, Neural Network Models). I also worked with Prof. Wendy Chapman at the Division of Biomedical Informatics (DBMI) in developing TextVect, a tool for processing clinical text documents. With the NLP group at DBMI, I worked on analyzing text classification methods for the database of Genotypes and Phenotypes (dbGaP) and developing a recreational drug lexical taxonomy ontology by identifying new terms via crawling the internet. For more information, see my resume and selected publications below.

During the summer of 2012, I worked with the Spam and Abuse team at Google Inc., Mountain View, CA in developing algorithms and systems useful for detecting misuse of Google’s services. Prior to 2012, I helped build parts of the upstream trading infrastructure at Morgan Stanley.
Selected Publications and Conference Proceedings
Abhishek Kumar*, Shankar Vembu*, Aditya Menon, Charles Elkan
Multilabel learning is a machine learning task that is important for applications, but challenging. A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Also, PCC accuracy is sensitive to the ordering of the tags while training. In this paper, we show how to use the classical technique of beam search to solve both these problems. Specifically, we show how to apply beam search to make inference tractable, and how to integrate beam search with training to determine a suitable tag ordering. Experimental results on a range of datasets show that the proposed improvements yield a state-of-the-art method for multilabel learning.
Feature Engineering for Classification of Clinical Text.
Abhishek Kumar
Technical report, UC San Diego pdf
Submitted in support of candidature for the Master of Science degree in Computer Science.
Assigning labels to clinical text documents is challenging because it requires sophisticated feature engineering, and practically important because of the wide adoption of electronic health record (EHR) systems. In this work, we introduce TextVect - a high throughput, modularized tool that facilitates feature engineering of unstructured clinical text. Empirical evaluation of the various feature representation choices on benchmark clinical text datasets suggests that the term-frequency (tf ) and binary encoding methods are best suited for document level classification. Reducing the number of features through feature selection is also helpful, and the BestFirst method outperforms other popular techniques. The most helpful features are those belonging to controlled vocabularies, the UMLS metathesaurus in particular. For datasets with multiple labels, we empirically evaluate the performance of the state-of-the-art probabilistic classifier chains (PCC) method. Results indicate that the PCC method is a better performer compared to the binary relevance method, where a separate classifier is trained for each label. We discuss these results and demonstrate that TextVect is a valuable tool for feature engineering when applied to classification of unstructured clinical text.
Neural Network Models for Multilabel Learning.
Abhishek Kumar, Aditya Menon, Charles Elkan
Under Review
Multilabel learning is an extension of standard binary classification where the goal is to predict a set of labels (we call an individual label a tag) for each input example. The recent probabilistic classifier chain (PCC) method learns a series of probabilistic models that capture tag correlations. In this paper, we show how the PCC model may be viewed as a neural network with connections between output nodes. We then explore the benefits of using a shared hidden layer in the neural network, instead of connections between output nodes. This brings advantages that include tractable test-time inference and removing the need to select a fixed tag ordering.
Learning and Inference in Probabilistic Classifier Chains with Beam Search.
Abhishek Kumar*, Shankar Vembu*, Aditya Menon, Charles Elkan
In Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 7523, Springer Berlin Heidelberg, pp 665-680. pdf Springer link
In this paper, we show how to use the classical technique of beam search for multilabel learning (MLL). A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Also, PCC accuracy is sensitive to the ordering of the tags while training. In this work, we show how to use beam search to make inference tractable, and how to integrate beam search with training to determine a suitable tag ordering.
Text Categorization of Heart, Lung and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing N-grams and Metadata Features.
Mindy K. Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway
Biomedical Informatics Insights, 2013. pdf La press
The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contributions in genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, meaningful use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques and feature representation to improve study retrieval in the context of the dbGaP database. This study demonstrated that a combination of n-gram features, metadata features, and χ2 feature selection applied to dbGaP studies increased classification accuracy and F-measure when compared to unigram-based feature representation. We demonstrated that PubMed studies can be used effectively as a surrogate identifier for related dbGaP studies.
Recreational Drug Slang: Identification of New Terms and Populating a Lexical Taxonomy Ontology.
Mindy K. Ross, Abhishek Kumar, Myoung Lah, Rick Calvo, Mike Conway
To be presented at AMIA Annual Symposium, 2013 (Poster Presentation). pdf AMIA 2013 Posters
In this work, we present a data-driven approach to identifying recreational drug slang. We crawled recreational drug-related Internet forums to build a text corpus and used statistical keyword analysis to identify drug terms. The novelty of this work lies in the application of corpus linguistic techniques to the recreational drug slang domain. We demonstrate the viability of using online sources to mine relevant content and corpus linguistics methodology as a means to discover drug slang terms. In the future, we plan to develop a semi-automated mechanism of discovering drug related slang terms (known and new terms) and structure our taxonomy to include further classes of common slang terms, such as compound drugs and drug paraphernalia.
TxtVect: A Tool for Extracting Features from Clinical Documents.
Abhishek Kumar, Wendy Chapman
Poster presentation at the American Medical Informatics Association (AMIA) Annual Symposium 2012. pdf Poster
We present TxtVect, a tool for extracting features from clinical documents. It allows for segmentation of documents into paragraphs, sentences, entities, or tokens; and extraction of lexical, syntactic, and semantic features for each of these segments. These features are useful for various machine-learning tasks such as text classification, assertion classification, and relation identification. TxtVect enables users to access these features without installation of the many necessary text processing and NLP tools.
ProVis: An Anaglyph based Visualization Tool for Protein Molecules
Rajesh Bhasin, Abhishek Kumar
First International Conference on Intelligent Human Computer Interaction (IHCI), 2009. pdf Springer link
Proteins are highly complex and flexible structures. Rendering tools are often used to study their dynamics, functions and in some cases even deformations that arise due to slow dynamics. In this work, we present Pro Vis, a visualization tool for 3-dimensional rend ering of protein molecules with the ability to create and display anaglyph images. The tool allows for viewing the protein molecules, locating atoms, viewing bonds, viewing the protein backbone, searching for specific bonds, visualizing them and also analyzing the various scalar properties of the protein. The use of display lists allows for very fast rendering and real time animation of the complex molecules with negligible latency time between scenes. The novelty of the tool is the interface which combines the power of 3D visualization using anaglyphs with the geometry and graphics of the protein to provide a real-time interaction environment for large amounts of abstract data.
Immediate Mode Scheduling Methods for Open Online Heterogeneous Systems.
Abhishek Kumar, Navneet Chaubey, Sireesha Yakkali
Student Symposium Paper at the 16th Intl. Conf. on High Performance Computing (HiPC) pdf HIPC 2009
Grid infrastructures and grid based applications are becoming common approaches for solving large scale science and engineering problems. The efficient scheduling of independent computational jobs in a heterogeneous computing (HC) environment is an important problem in domains such as grid computing. In this work, we consider an online scheduling problem in immediate mode, where jobs arrive over time and are allocated to machines as soon as they arrive. All jobs’ characteristics are unknown before their arrival times. We implemented several scheduling algorithms and measured three metrics for comparison: response time, bounded slowdown and system utilization. Our simulation allowed us to identify which of the considered methods perform better for response time, bounded slowdown and utilization at different system loads. We also evaluate the usefulness of the methods if certain grid characteristics such as heterogeneity of jobs and resources are known in advance.
Artificially Intelligent Grid Assistant.
Roshan Sumbaly, Abhishek Kumar, Shubham Malhotra, Gaurav Paruthi
Student Paper at the 14th Intl. Conf. on High Performance Computing (HiPC) pdf HIPC 2007
We present  a Grid based application which works in collaboration with natural language processing (NLP), to act as a virtual assistant. The application can answer  queries in a conversational  manner  and is  capable of being deployed in various scenarios.  Given the prevalence of large data sources in natural  language engineering and the need for raw  computational  power in analysis of such data, the Grid Computing paradigm provides efficiency and scalability otherwise unavailable  to researchers.  In our  work we explore the integration of  Grid with NLP,  to mine relevant answers  from these distributed resources. Our system receives queries from various interfaces and then uses NLP to understand the domain of the question. The Grid then routes these queries  to the correct knowledge farm, depending on the domain found. Knowledge farms are distributed components which have large annotated domain specific  datasets.  We propose a novel  method which involves  the working of  the Grid and NLP in concert  to mine relevant  information quickly. 
Recent Selected Talks
March 2013
UC San Diego, M.S. Oral Exam
Feature Engineering for Classification of Clinical Text (slides)
December 2012
UC San Diego course presentation, MED267
Developing a lexical taxonomy for recreational drug slang terms by crawling the Internet. (slides)
June 2012
Data Mining Cup Competition, Berlin, Germany (link)
Predicting product sales from historical data.

Links to few more things that I've worked or experimented on.
Source code of projects that I've worked on.
Most are available on an Apache-like license.
Hobbies and travelogues
Recent Events
2013/04/29 — Joined Google as a software engineer - I'll work on developing algorithms and systems to track user reputation, based on social signals.
2013/03/20 — Gave a talk on TextVect and feature engineering for clinical text classification. Presented to the CSE faculty committee. slides
2012/06/17 — Talk on developing a lexical ontology terminology for recreational drugs by discovering new terms via the Internet. Based on course project for MED267. slides
2012/09/21 — Completed my summer internship at Google!
2012/06/27 — Gave a talk on predicting product sales from historical data, as part of the data mining cup competition organized by Prudsys.
2012/06/17 — Road trip from San Diego to San Francisco on my motorcycle! pics
1600 Amphitheatre Parkway
(LDAP: abhishekkr)
Mountain View, CA 94043

1200 Dale Ave, Apt 51
Mountain View, CA 94040