Meena Nagarajan                      

What do People Write

Back to research summary

My work in identifying what people talk about on social media focuses on recognizing named entities, with particular focus on a class of entities called Cultural Entities – those that refer to artifacts of culture, for example, names of movies, TV shows, songs and book titles. In addition to referring to multiple real-world entities (e.g. “The Lord of the Rings” can refer to multiple instances of movies, different video games and a number of novels), cultural entities are particularly hard to extract because of their use of fragments from everyday language.

In a recent work with Amir Padovitz at Microsoft Research, I explored a feature-based approach to improve the accuracy of existing named entity classifiers in identifying such cultural entities. We hypothesized that knowing how hard it is to extract an entity is useful for learning better entity classifiers. With such a measure, entity extractors become “complexity aware”, i.e. they can learn to respond differently to signals depending on the entity's extraction difficulty. We proposed and developed an unsupervised algorithm to extract this prior using graph-based spreading activation and clustering techniques. We conducted evaluations in identifying movie named entities in informal weblog posts and found overwhelming evidence that this new prior improves extraction accuracy, supporting our hypothesis about engineering 'complexity aware' classifiers.

My second investigation in identifying cultural entities, along with researchers at IBM Almaden, utilized MusicBrainz, a rich domain knowledge of music entities and their relationships (encoded in RDF) to annotate artist and track/album mentions in UGC from MySpace music forums [ISWC09a]. In this work, we showed that eliminating parts of the domain model using constraints implied in the content and metadata from the domain model effectively reduces entity disambiguation scenarios and improves spotting precision. For example, a comment, ‘Saw you last night in Denver’ indicates that the artist is still alive, allowing us to rule out parts of the Ontology mentioning artists who are not. We also showed that simple ML classifiers built over such pruned models, and learning over a variety of feature types (a combination of Natural Language, domain-related words such as music, song, concert etc. and sentiment expression features) yielded better results than using any of them alone.

In other related efforts, I worked on providing spatio-temporal-thematic summaries of chatter on Twitter using contextual information from the social medium [WISE09]. As part of a targeted content delivery platform, I used an information theory based algorithm to eliminate off-topic chatter in user-generated content on MySpace and Facebook forums to detect the main topic of discussion [WI09]. All of these studies highlighted pertinent challenges that informal UGC brings to text analytics.