Definition & Analysis of the Information
Initial step of any information mining and analysis consists in checking data consistency, highlighting outlying data and correct them.
“the best method in the science world cannot get accurate results if the input data are not”
ibs Analytics includes, in addition to classical data integration process (format, data type, business rules), a dedicated
module for information analysis focused on outliers detection and correction.
This step of the information mining complete process, sometimes underestimated, is however critical as the integrity of
the information is the most important pre-requisite to an efficient analysis. The Analytics Business Suite and the ibs
Analytics engine deploy the cleaning engine developed by Soft Solutions Analytics research lab, based on advanced data
mining techniques to detect outlying data and local inference to correct them and therefore ensure data consistency for
further steps of the information mining processes (forecast, optimization…).
ON THE IMPACT OF DEVIANT INFORMATION (A.K.A. OUTLIERS)
Outlying data (identified as outliers) are information that is significantly different than the others.
This notion has been widely discussed in the literature since the rise of data mining. Indeed, this definition is
quite subjective but the impact of an outlier within a dataset can totally bias extracted analysis. An outlier is
not an error in the data.
The meaning is only that this is a data significantly different than the others, deserving a specific treatment.
In the above example showing weekly level of sales of an item within a store, the impact of a unique outlier totally bias
the sales trend model. From a "logical evolution" (excluding the outlier) of 22% increase over the period, the model changes
to a decrease of 15% when considering the outlier.
The impact of such a bias can widely change the analysis based on it and the need for identifying outliers is there obvious.
DETECTING AND CORRECTING THE OUTLIERS
In order to efficiently identify outliers, ibs Analytics engine implements a special approach. Indeed, statistical
techniques do not allow an efficient detection, since they cannot "scale" all situations (an item can have wide change
in sales profile over the historical period). A dynamic approach is then mandatory to achieve efficient identification.
The one implemented by ibs Analytics is based on density of the neighbourhood of observations, inspired by [LOF
- Breuning & al. 2000].
It has been adapted to the specificity of sales information and allows an optimised outlier detection process, with
the ability to answer situation with changes in items sales profile.
Based on this detection, the correction allows increasing the quality of the input data. The table aside shows the quality
of the corrections in real conditions.
REFERENCES
The below papers are well-known references from the literature that have driven the design of the Soft Solutions data cleaning solutions:
- LOF: Identifying Density-Based Local Outliers. Breunig & al. 2000
- LOCI: Fast Outlier Detection Using the Local Correlation Integral. Papadimitriou & al. 2002
- LOADED: Link-based Outlier and Anomaly Detection in Evolving Data Sets. Ghoting & al. 2004
- A Comparison of Outlier Detection Algorithms for Machine Learning. Escalante & al. 2005
- A Review of Statistical Outlier Methods. Walfish. 2007