Twitter Annotated CorpusIn the interest of competitive research, we have made a subset of our hand annotated corpus publicly available. Two types of annotations are currently available: Binary and Span. The binary annotations indicate the presence or absence of ADRs in Tweets. The span annotations (containing spans and normalized UMLS IDs) specify the exact mentions of ADRs in Tweets. This corpus is freely available to download.
The binary annotations indicate whether a tweet contains a mention of an adverse drug reaction or not. The corpus consists of 7574 tweets crawled from twitter. These tweets are preprocessed to remove advertisements and are further annotated by experts for existence of adverse drug reactions.
- Tweet Downloading Script
- File containing of User IDs, Tweet IDs and the binary annotations. This release set contains 70% the data described in the following publications. Please cite one of the following papers when using this data:
- Abeed Sarker and Graciela Gonzalez. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. Journal of Biomedical Informatics. 53 (2015) 196-207.
- Rachel Ginn, Pranoti Pimpalkhute, Azadeh Nikfarjam, Apurv Patki, Karen O’Connor, Abeed Sarker, Karen Smith and Graciela Gonzalez. Mining Twitter for Adverse Drug Reaction Mentions: A Corpus and Classification Benchmark. In proceedings of the Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (BioTxtM2014). May, 2014. Reykjavik, Iceland.
- The full annotation corpus consists of 1784 tweets. These corpus was annotated manually for ADR mentions in each Tweet. The annotations include the locations/spans of the mentions and their UMLS IDs. See link below for downloads and the publication associated with this data set.
- Data set link