![]()
Microarrays are a technology for simultaneously profiling the expression levels of tens of thousands of genes in a patient sample. It is hopeful that better diagnosis methods, better drugs, and better understanding of disease mechanisms can be derived from a careful analysis of microarray measurements of gene expression profiles.
A single microarray experiment can measure the expression level of tens of thousands of genes simultaneously. In other words, the microarray experiment record of a patient sample for an example - is a record having tens of thousands of features or dimensions. This extremely high dimensionality causes two problems for many data mining and machine learning methods.
The first problem is that of efficiency, because most data mining and machine learning methods have time complexities that are extremely high with respect to the number of dimensions. Most data mining and machine learning methods suffer from the "curse of dimensionality" - these methods typically require an exponential increase in the number of training samples with respect to an increase in the dimensionality of the samples in order to uncover and learn the relationship of the various dimensions to the nature of the samples. The second problem is that of noise because small changes in the distribution can change the end-results of the experiments. Many pre-processing techniques can be applied to remove noise (usually referred to as outliers), and also to deal with missing or inconsistent values.
| gene1 | gene2 | gene3 | gene4 | gene5 | ..... | class |
| 8.589e+003 | 5.468e+002 | 4.263e+003 | 4.064e+003 | 1.997e+003 | ..... | positive |
| 3.825e+003 | 6.970e+002 | 5.369e+003 | 4.705e+003 | 1.166e+003 | ..... | positive |
| 5.271e+003 | 4.740e+003 | 3.318e+003 | 6.792e+003 | 2.632e+003 | ..... | positive |
| 7.126e+003 | 3.779e+003 | 3.705e+003 | 6.594e+003 | 2.460e+003 | ..... | negative |
| 4.913e+003 | 5.215e+003 | 4.288e+003 | 3.213e+003 | 3.147e+003 | ..... | negative |
Figure 1: An partial example of a processed microarray measurement record of a patient sample. Each column represents a gene. Different microarray gene chips exist. The Affymetrix U95A Gene Chip can contain more than12,000 genes (actually probes).
Typically, microarrays are used to measure gene expression levels of diseased tissues and of normal tissues; normally a study measures between 40 to 1000 tissues. Many challenging diseases are currently being studied using microarray gene expression data, including leukemia, colon cancer, breast cancer, etc.
Project 1: Data Pre-processing
Project 2: Association Mining