Definition & Analysis of the Information

Initial step of any information mining and analysis consists in checking data consistency, highlighting outlying data and correct them.

“the best method in the science world cannot get accurate results if the input data are not”

ibs Analytics includes, in addition to classical data integration process (format, data type, business rules), a dedicated module for information analysis focused on outliers detection and correction.

This step of the information mining complete process, sometimes underestimated, is however critical as the integrity of the information is the most important pre-requisite to an efficient analysis. The Analytics Business Suite and the ibs Analytics engine deploy the cleaning engine developed by Soft Solutions Analytics research lab, based on advanced data mining techniques to detect outlying data and local inference to correct them and therefore ensure data consistency for further steps of the information mining processes (forecast, optimization…).

ON THE IMPACT OF DEVIANT INFORMATION (A.K.A. OUTLIERS)

Outlying data (identified as outliers) are information that is significantly different than the others.
This notion has been widely discussed in the literature since the rise of data mining. Indeed, this definition is quite subjective but the impact of an outlier within a dataset can totally bias extracted analysis. An outlier is not an error in the data.

The meaning is only that this is a data significantly different than the others, deserving a specific treatment.

In the above example showing weekly level of sales of an item within a store, the impact of a unique outlier totally bias the sales trend model. From a "logical evolution" (excluding the outlier) of 22% increase over the period, the model changes to a decrease of 15% when considering the outlier.

The impact of such a bias can widely change the analysis based on it and the need for identifying outliers is there obvious.

DETECTING AND CORRECTING THE OUTLIERS

In order to efficiently identify outliers, ibs Analytics engine implements a special approach. Indeed, statistical techniques do not allow an efficient detection, since they cannot "scale" all situations (an item can have wide change in sales profile over the historical period). A dynamic approach is then mandatory to achieve efficient identification.

The one implemented by ibs Analytics is based on density of the neighbourhood of observations, inspired by [LOF - Breuning & al. 2000].

It has been adapted to the specificity of sales information and allows an optimised outlier detection process, with the ability to answer situation with changes in items sales profile.

Based on this detection, the correction allows increasing the quality of the input data. The table aside shows the quality of the corrections in real conditions.

REFERENCES

The below papers are well-known references from the literature that have driven the design of the Soft Solutions data cleaning solutions:

  • LOF: Identifying Density-Based Local Outliers. Breunig & al. 2000
  • LOCI: Fast Outlier Detection Using the Local Correlation Integral. Papadimitriou & al. 2002
  • LOADED: Link-based Outlier and Anomaly Detection in Evolving Data Sets. Ghoting & al. 2004
  • A Comparison of Outlier Detection Algorithms for Machine Learning. Escalante & al. 2005
  • A Review of Statistical Outlier Methods. Walfish. 2007