Quality Aspects in Big Data Systems

Foreword by the Co-Chair

Recent studies have shown that poor quality data is predominant in many Big Data systems containing a variety of sources such as linked data, mobile data, social media data, Internet of Things data, and many others. The fourth “V” of big data (veracity) directly refers to

uncertainty and data quality problems. With the variety of Big Data sources, new frameworks and methods are needed for quality assessment, management and improvement due to the sheer volume and velocity of data. Although significant progresses have been made, mainly in what concerns technologies for processing Big Data, several challenges still remain, including distributed and streaming discovery of data quality, crowdsourced data cleaning, and tools/data validators.

In this thematic track, the focus is novel contributions for addressing Quality Aspects in Big Data Systems, ranging from conceptual frameworks to case studies, from design to implementation, from data collection to data analytics, or from data cleansing to data integration.

The Quality Aspects in Big Data Systems Thematic Track 2016 has in its final program papers that consider topics as:

  • Big Data Quality and Provenance Control

  • Big Data Quality Management

  • Big Data Quality Metrics

  • Big Data Cleansing and Integration

  • Evaluating Data Validity and Consistency across Databases

  • Algorithms and Approaches for Detecting Outliers, Duplicate Data, and Inconsistent Data

  • Algorithms and Approaches for Deriving Missing Data

  • Big Data Persistence and Preservation

  • Big Data Quality Discovery

  • Efficiency versus Accuracy Trade-off

  • Data Quality in Distributed and Streaming Analytics

The program includes one accepted full paper and one work in progress.

The full paper “Models of integrity assurance in big relational databases”, by Andrey Malikov, Vladimir Voronkin and Nikolay Shiryaev, studies the problem of integrity assurance in relational databases and discusses the issue of group integrity of tuple subsets regarding

corporate integrity constraints in relational databases. The authors present the use of finite-state machine theory to guarantee group integrity of data and thus to represent acceptable states in data insertion, rejecting tuples that do not conform and advance the database towards a valid state. Also, and after creating SQL queries to manipulate data and control its integrity for real data domains, this work studies the issue of query performance, determine the level of transaction isolation, and generate query plans.

The work in progress “On the Development of A Metric for Quality of Information Content over Anonymised Data-Sets”, by Ian Oliver and Yoan Miche, presents a framework for measuring the impact of data anonymisation and obfuscation in information theoretic and data mining terms. The authors study the problem of distortion of information due to privacy functions applied to the data, proposing metrics to find a balance between the used privacy functions and the level of distortion. The authors propose the use of Mutual Information over non-Euclidean spaces as a means of measuring the distortion induced by privacy function and propose the use of Machine Learning techniques to quantify the impact of obfuscation in terms of further data mining goals.

Dr. Monica Wachowicz is Associate Professor and the NSERC/Cisco Industrial Research Chair in Real-Time Mobility Analytics at the University of New Brunswick, Canada. She is also the Director of the People in Motion Laboratory, a centre of expertise in the application of Internet of Things (IoT) to smart cities. Her research work is directly related to the vision of a constellation of inter-connected devices in the future that will contain information about the context and location of things across several geographical and temporal scales. She works at the intersection of (1) Streaming Analytics for analyzing massive IoT data in search of valuable spatio-temporal patterns in real-time; and (2) Art, Cartography, and Representations of mobility for making the maps of the future which will be culturally and linguistically designed to provide a greater “sense of people” in motion. Founding member of the IEEE Big Data Initiative and the International Journal of Big Data Intelligence, she is also joint Editor-in-Chief of the Cartographica Journal. Her pioneering work in multidisciplinary teams from government, industry and research organizations is fostering the next generation of geospatial data scientists for innovation

Maribel Yasmina Santos is an associate professor at the Department of Information Systems, University of Minho, in Portugal. Maribel received the Aggregated title (Habilitation) in Information Systems and Technologies from the University of Minho in 2012 and a Ph.D. in Information Systems and Technologies from the University of Minho in 2001. Maribel has a degree in Informatics and Systems Engineering and a MSc in Informatics both from the University of Minho (1991 and 1996, respectively). Her research interests include big data, business intelligence, and analytics, (spatial) data warehousing and mining and (spatio-temporal) data models.

Track Committee

Chair: Monica Wachowicz, University of New Brunswick, Canada

Co-Chair: Maribel Yasmina Santos, University of Minho, Portugal

Program Committee:

Oliver Teste, Institut de Recherche en Informatique de Toulouse, France

Daniel Stamate, College University of London, UK

Anisa Rula – University of Milano-Bicocca, Italy

Mohamed Mokbel, University of Minnesota, USA

Dimitris Kotzinos, Université de Cergy-Pontoise, France

Caballero Ismael, University of Castilla-La Mancha, Spain

Gerard Heuvelink, Wageningen University, The Netherlands

Ioannis Chrysakis, ICS FORTH, GreeceDavide Tosi, Università degli Studi dell'Insubria, Italy

(list not yet completed)

Call for papers:

Recent studies have shown that poor quality data is predominant in many Big Data systems containing a variety of sources such as linked data, mobile data, social media data, Internet of Things data, and many others. The fourth “V” of big data (veracity) directly refers to uncertainty and data quality problems. With the variety of Big Data sources, new frameworks and methods are needed for quality assessment, management and improvement due to the sheer volume and velocity of data. Although significant progress has been made, mainly in what concerns technologies for processing Big Data, several challenges still remain including distributed and streaming discovery of data quality, crowdsourced data cleaning, and tools/data validators.

In this thematic track, we seek novel contributions ranging from conceptual frameworks to case studies addressing quality issues in Big Data Systems, from design to implementation, from data collection to data analytics, or from data cleansing to data integration. Suggested topics of interest include, but are not restricted to:

  • Big Data Quality and Provenance Control

  • Big Data Quality Management

  • Big Data Quality Metrics

  • Big Data Cleansing and Integration

  • Evaluating Data Validity and Consistency across Databases

  • Algorithms and Approaches for Detecting Outliers, Duplicate Data, and Inconsistent Data

  • Algorithms and Approaches for Deriving Missing Data

  • Big Data Persistence and Preservation

  • Big Data Quality Discovery

  • Efficiency versus Accuracy Trade-off

  • Data Quality in Distributed and Streaming Analytics

Submission process:

Authors should submit to http://www.easychair.org/conferences/?conf=quatic2016 a PDF version of their paper. Full Papers must be in CPS format and not exceed 6 pages, including figures, references, and appendices. Work In Progress (WIP) works with relevant preliminary results are limited to 3 pages. Submissions must be original and will be reviewed by the Track Program Committee. Accepted papers will be included in the electronic proceedings of QUATIC’2016 published by Conference Publishing Services (CPS), submitted for archiving in Xplore and CSDL, and submitted for indexing in ISI Web of Science, SCOPUS, ACM Portal, DBLP and DOI System, subject to one of the authors registering for the conference. The authors of the best papers of this thematic track will be invited to submit extended versions to the main track of the conference.

Important dates:

    • Paper submission: Sunday, April 17, May 15, 2016

    • Author's notifications: Sunday, May 15, June 12, 2016

    • Camera ready submission: Sunday, June 19, June 26, 2016