Skip to the content.


Information extraction (IE) 1 is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. One example of IE is to identify named entities in a text, e.g., "Barack Obama served as the president of the USA". Here, Barack Obama and USA are named entities of types of PERSON and LOCATION, respectively. Another example is to identify sentiment expressed in a text, e.g., "This movie was awesome". Here, the sentiment expressed is positive. Finally, identifying various linguistic aspects of a text, e.g., part of speech tags, noun phrases, dependency parses, etc., which can serve as features for additional IE tasks.

This tutorial introduces participants to a) the usage of Python based, open source tools that support IE from social media data (mainly Twitter), and b) best practices for ensuring the reproducibility of research. Participants will learn and practice various semantic and syntactic IE techniques that are commonly used for analyzing tweets. Additionally, participants will be familiarized with the landscape of publicly available tweet data, and methods for collecting and preparing them for analysis. Finally, participants will be trained to use a suite of open source tools (SAIL for active learning2, TwitterNER for named entity recognition3, and SocialMediaIE for multi task learning4), which utilize advanced machine learning techniques (e.g., deep learning, active learning with human-in-the-loop, and multi-task learning) to perform IE on their own or existing datasets. The tools introduced in the tutorial will focus on the three main stages of IE, namely, collection of data (including annotation), data processing and analytics, and visualization of the extracted information.

Previous and upcoming tutorial details

Date Venue
Oct 20, 2022 CIKM, Austin, TX, USA
Jun 20, 2022 LREC, Marseille, France
Apr 10, 2022 ECIR, Stavanger, Norway
May 11, 2020 LREC, Marseille, France - Cancelled due to Covid
May 11, 2021. TheWebConf, Ljubljana, Slovenia
Jul 17, 2020 IC2S2, Boston, USA
May 11, 2020 LREC, Marseille, France
Sep 17, 2019 ACM HyperText, Hof, Germany
Jul 24, 2019 University of Illinois at Urbana-Champaign, Research park

Expected background and prerequisite of audience

We want to make this tutorial as accessible as possible to the Hypertext community. However, all our tools are implemented in the Python programming language, and we will be using these tools to go through the tutorial. Providing full working knowledge of Python during this tutorial is not possible. Hence, we expect basic proficiency with Python for data analysis use cases. Those who are unfamiliar with Python can prepare themselves before the tutorial by going through the tutorial provided at:

We do not presume knowledge of machine learning concepts or deep neural networks. We will be providing just enough background to help you utilize these concepts for IE.

We expect participants to bring their own laptops with all the tools installed. Details on what tools need to be installed will be provided a week before the tutorial.

Finally, we expect basic familiarity with social media platforms like Twitter and Facebook. For participants who want to collect their own Twitter data for a tutorial exercise need to have a Twitter account, which can be bran new and just used for this workshop. For participants without a Twitter account, this exercise will still be screen presented and documentation provided such that participants can follow along and work through the examples later. The same is true for all exercises provided in the tutorial.

Presenter Bio’s

Shubhanshu Mishra

Content Understanding Research, Twitter, Inc.

Shubhanshu Mishra is a Senior Machine Learning Researcher at Twitter. He earned his Ph.D. in Information Sciences from the University of Illinois at Urbana-Champaign in 2020 His thesis was titled Information extraction from digital social trace data: applications in social media and scholarly data analysis. His current work is at the intersection of machine learning, information extraction, social network analysis, and visualizations. His research has led to the development of open source tools of open source information extraction solutions from large scale social media and scholarly data. He has finished his Integrated Bachelor’s and Master’s degree in Mathematics and Computing from the Indian Institute of Technology, Kharagpur in 2012.

Rezvaneh (Shadi) Rezapour

Department of Information Science at Drexel’s College of Computing and Informatics

Shadi is an Assistant Professor in the Department of Information Science at Drexel’s College of Computing and Informatics. Her research interests lie at the intersection of Computational Social Science and Natural Language Processing (NLP). More specifically, she is interested in bringing computational models and social science theories together, to analyze texts and better understand and explain real-world behaviors, attitudes, and cultures. Her research goal is to develop “socially-aware” NLP models that bring social and cultural contexts in analyzing (human) language to better capture attributes, such as social identities, stances, morals, and power from language, and understand real-world communication. Shadi completed her Ph.D. in Information Sciences at University of Illinois at Urbana-Champaign (UIUC) where she was advised by Dr. Jana Diesner.

Jana Diesner

School of Information Sciences, University of Illinois at Urbana-Champaign

Jana Diesner is an Associate Professor at the School of Information Sciences at the University of Illinois at Urbana-Champaign. Her research in social computing and human-centered data science combines methods from natural language processing, social network analysis, and machine learning with theories from the social sciences, humanities, and linguistics to advance knowledge and discovery about interaction-based and information-based systems. Her lab is currently working on projects related to 1) biases in data, technology, and human decision-making; 2) validating social science theories in contemporary contexts; 3) impact assessment; 4) crisis informatics; and 5) data governance and responsible computing. Diesner has published over 55 referred articles. She got her PhD (2012) from the School of Computer Science at Carnegie Mellon University. For more information, see