Diego Lab : Biomedical Informatics Lab at ASU


A Customizable Pipeline for Social Media Text Normalization

Abeed Sarker, Graciela Gonzalez

Department of Biomedical Informatics, Arizona State University, Scottsdale, USA
Arizona State University, Tempe, USA

 

Abstract

Presence of non-standard terms in social media text poses a crucial problem for natural language processing systems. We propose a two-phase, hybrid pipeline for social media text normalization. In the first phase, basic pre-processing and publicly available social media specific vocabularies are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. In the second phase, a set of language models, generated using the partially normalized data from the first phase, are combined with supervised-learning and rule-based approaches to further improve performance. On a publicly available domain-independent data set from Twitter, our system obtained an F-score of 0.836. Preliminary extrinsic evaluations suggest that our normalization approach improves social media text classification performance. Our proposed pipeline can also be easily customized for domain-specific text, such as health-related text.

Last Update: 03, May 2016

 

Quick Download:
NoSlang Modified
1 and 2 Letter Term Mappings
English Spellings
American Spellings
Simple (rule-based) Normalizer Script

 

Contact Information:
Abeed Sarker
Email:
abeed.sarker@asu.edu