Diego Lab : Biomedical Informatics Lab at ASU

A Corpus for Mining Drug-related Knowledge from Twitter Chatter: Data-centric Language Models and their Utilities

Abeed Sarker, Graciela Gonzalez

Department of Biostatistics and Epidemiology, Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania



In this data article, we present to the data science, natural language processing and public heath communities an unlabeled corpus and a set of language models. We collected the data from Twitter using drug names as keywords, including their common misspelled forms. Using this data, which is rich in drug-related chatter, we developed language models to aid the development of data mining tools and methods in this domain. We generated several models that capture (i) distributed word representations and (ii) probabilities of n-gram sequences. The data set we are releasing consists of 267,215 Twitter posts made during the four-month period--November, 2014 to February, 2015. The posts mention over 250 drug-related key words. The language models encapsulate semantic and sequential properties of the texts.

Last Update: 15, Aug 2016


Quick Download:
Twitter chatter and download script (Nov, 2014-Feb2015)
Release V3: n-gram distributed model with ~1B Twitter set [*USE THIS*]
Common misspellings used for retrieval
Paper on ScienceDirect


Contact Information:
Abeed Sarker