Abstract
Data preprocessing in general and data reduction in specific represent the main steps in data mining
techniques and algorithms since data in real world due to its vastness, the analysis will take a long time to complete
.Almost all mining techniques including classification, clustering, association and others have high time and space
complexities due to the huge amount of data and the algorithm behavior itself. That is the reason why data reduction
represent an important phase in Knowledge Discovery in Databases (KDD) process. Many researchers introduced
important solutions in this field. The study in this paper represents a comparative study for about 22 research papers
in data reduction fields that covers different data reduction techniques such as dimensionality reduction, numerisoty
reduction, sampling, clustering data cube aggregation and other techniques. From the conducted study, it can be
concluded that the appropriate technique that can be used in data reduction is highly dependent on the data type, the
dataset size, the application goal, the availability of noise and outliers and the compromise between the reduced data
and the knowledge required from the analysis.
techniques and algorithms since data in real world due to its vastness, the analysis will take a long time to complete
.Almost all mining techniques including classification, clustering, association and others have high time and space
complexities due to the huge amount of data and the algorithm behavior itself. That is the reason why data reduction
represent an important phase in Knowledge Discovery in Databases (KDD) process. Many researchers introduced
important solutions in this field. The study in this paper represents a comparative study for about 22 research papers
in data reduction fields that covers different data reduction techniques such as dimensionality reduction, numerisoty
reduction, sampling, clustering data cube aggregation and other techniques. From the conducted study, it can be
concluded that the appropriate technique that can be used in data reduction is highly dependent on the data type, the
dataset size, the application goal, the availability of noise and outliers and the compromise between the reduced data
and the knowledge required from the analysis.
Keywords
Data mining
Data Preprocessing
Data Reduction
Dimensionality Reduction