Introduction to Data Mining

<<< Previous    Up    Next >>>

Lesson 2.2

Data Preparation

 

    Data preparation is one of the more important tasks in Data Mining. It is also one of the most time consuming ones. The time factor is usually dependent on the size of the data we are concerned with. Datasets could be large in terms of two aspects, dimensionality or high number of instances. High dimensionality affects the time taken more than higher number of instances.

Other problems associated with data preparation are:

bullet

Missing data.

bullet

Outliers.

bullet

Erroneous data (inconsistent, misreported or distorted).

    Data preparation is also required when the data to be processed is in the raw format, e.g. pixel format for images. Such data should be converted into appropriate formats which can be processed by the the data mining algorithms.

    Some of the Data Preparation methods discussed in detail here are:

bullet

Data Normalization (e.g. for image mining)

bullet

For sequential/temporal data

bullet

Removing outliers.

 

<<< Previous    Up    Next >>>