CrowdCorrect: a curation pipeline for social data cleansing and curation

Amin Beheshti*, Kushal Vaghani, Boualem Benatallah, Alireza Tabebordbar

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

15 Citations (Scopus)


Process and data are equally important for business process management. Data-driven approaches in process analytics aims to value decisions that can be backed up with verifiable private and open data. Over the last few years, data-driven analysis of how knowledge workers and customers interact in social contexts, often with data obtained from social networking services such as Twitter and Facebook, have become a vital asset for organizations. For example, governments started to extract knowledge and derive insights from vastly growing open data to improve their services. A key challenge in analyzing social data is to understand the raw data generated by social actors and prepare it for analytic tasks. In this context, it is important to transform the raw data into a contextualized data and knowledge. This task, known as data curation, involves identifying relevant data sources, extracting data and knowledge, cleansing, maintaining, merging, enriching and linking data and knowledge. In this paper we present CrowdCorrect, a data curation pipeline to enable analysts cleansing and curating social data and preparing it for reliable business data analytics. The first step offers automatic feature extraction, correction and enrichment. Next, we design micro-tasks and use the knowledge of the crowd to identify and correct information items that could not be corrected in the first step. Finally, we offer a domain-model mediated method to use the knowledge of domain experts to identify and correct items that could not be corrected in previous steps. We adopt a typical scenario for analyzing Urban Social Issues from Twitter as it relates to the Government Budget, to highlight how CrowdCorrect significantly improves the quality of extracted knowledge compared to the classical curation pipeline and in the absence of knowledge of the crowd and domain experts.

Original languageEnglish
Title of host publicationInformation systems in the big data era
Subtitle of host publicationCAiSE Forum 2018, Proceedings
EditorsJan Mendling, Haralambos Mouratidis
PublisherSpringer-VDI-Verlag GmbH & Co. KG
Number of pages15
ISBN (Electronic)9783319929019
ISBN (Print)9783319929002
Publication statusPublished - 1 Jan 2018
EventCAiSE Forum 2018 held as part of the 30th International Conference on Advanced Information Systems Engineering, CAiSE 2018 - Tallinn, Estonia
Duration: 11 Jun 201815 Jun 2018

Publication series

NameLecture Notes in Business Information Processing
ISSN (Print)1865-1348


ConferenceCAiSE Forum 2018 held as part of the 30th International Conference on Advanced Information Systems Engineering, CAiSE 2018


Dive into the research topics of 'CrowdCorrect: a curation pipeline for social data cleansing and curation'. Together they form a unique fingerprint.

Cite this