|Diego Lab : Biomedical Informatics Lab at ASU|
A Customizable Pipeline for Social Media Text Normalization
Abeed Sarker, Graciela Gonzalez
Department of Biomedical Informatics, Arizona State University, Scottsdale, USA
AbstractPresence of non-standard terms in social media text poses a crucial problem for natural language processing systems. We propose a two-phase, hybrid pipeline for social media text normalization. In the first phase, basic pre-processing and publicly available social media specific vocabularies are used to transform, with high precision, out-of-vocabulary terms into in-vocabulary terms. In the second phase, a set of language models, generated using the partially normalized data from the first phase, are combined with supervised-learning and rule-based approaches to further improve performance. On a publicly available domain-independent data set from Twitter, our system obtained an F-score of 0.836. Preliminary extrinsic evaluations suggest that our normalization approach improves social media text classification performance. Our proposed pipeline can also be easily customized for domain-specific text, such as health-related text.