Egeria Mining


Project Description:

    This project is about applying Image Mining to Egeria detection. The goal is to effectively determine certain areas of Egeria in an image via data mining. Egeria is an exotic submerged aquatic weed which is causing navigation and reservoir problems in the Sacramento-San Joaquin Delta of Northern California. The Egeria detection problem is difficult and has been evasive to an automatic detection approach. Problems like Sun glint, shadows, dark water, grainy image etc. makes it difficult for the automatic detection. The images which are under study are Scan-digitized Aerial Imagery, which were taken under various atmospheric conditions, making the effective recognition and detection a hard task. Moreover, due to the varying natures of occurrences of Egeria , good generalization is difficult. Egeria Mining is part of a larger project conducted by Dr. Patricia Foschi , Professor of Geography of San Francisco State University, and her group at the Romberg Tiburon Center for Environmental Studies (RTC) at San Francisco State University. They are conducting research to study and estimate the areal extent of Egeria. The details of the project can be found at their project web site.

Project Members at Computer Science & Engineering, Arizona State University :

Dr. Huan Liu   
Narasimha Deepak Kolippakkam
Amit Mandvikar
Jigar Mody


Project Tasks:

    In order to detect Egeria automatically, we have to achieve the following tasks: Feature selection, Feature extraction, and evaluation the performance of these features using certain Evaluation criteria, and finally Learning and Classification using Data Mining techniques. These tasks can be briefly described as follows:

 

Task 1 - Feature selection

1.      Study the images and identify the problem regions.

2.      Select features which described Egeria among many possible ones.

3.      Focus on some features such as color and textures with specific definitions.

 

Task 2 - Feature extraction

1.      Extract features automatically from  images,

2.      For color feature extraction, we try a range of intensity where Egeria's spectrum correspond to,

3.      For texture features, we take texture templates of 10X10 block size in our initial trials. The whole image is then divided into the same block size, and template match is performed by considering their histograms. 

4.      An interactive program is run for resolving the dubious regions of Egeria after combining the features. 

 

Task 3 - Measuring these features (goodness of the extraction) by Evaluation criteria

    The Confusion Matrix is a table structure which describes all the possible outcomes of a prediction. These are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN), where

·         TP are those retrieved values which are correct,

·         TN are those values which are not retrieved and are not correct,

·         FP are values which are actually incorrect, but have been retrieved (these are false alarms),

·         FN are values which were supposed to be picked up, but were missed. 

    Evaluation criteria are defined as follows

1. Precision (P): It is defined as the fraction of the retrieved information which is relevant.

                P = (TP) / (TP + FP)

2. Recall (R): It is defined as the fraction of the retrieved information relevant versus all relevant information.

                R = (TP) / (TP + FN)
. F Measure (F): It is defined as the harmonic mean of precision and recall.

                F = (2*P*R) / (P + R)
4. Generality: It is defined as the ability of the system to learn from the training images and later be applied to unseen images without significant loss in classification performance.

5. Scalability: This is a measure which tells whether a procedure can be scaled up for larger images as efficiently as for smaller ones.

 

Task 4 - Learning using Decision trees and Naive Bayesian Classifier

1.      Extract all features from a "training" image and then use the results as the basis for the classifiers.

2.      Test the learned classifiers on a new image, different areas are automatically classified by the classifiers.

3.      Combine the results given by these classifiers and repeat.

4.      An interactive program is run for resolving the dubious regions of Egeria after combining the features.

 


Experimental results:

Note: All the images in the results pages are of JPEG format. But, the actual experiments were carried out using the original TIF format. JPEG was used in the web pages due to space constraints.


    Incremental Active Learning Package: (submitted as addendum with Amit Mandvikar's Thesis) (Updated on 1st November 2003)

The zip file is available here.


    Incremental Active Learning results with new training image (1000-b05-2m-gc-sub.tif): (Updated on 1st May 2003)

The result images are here, and the tabular results are here.


    Results obtained by ICLASS by using the new training image (1000-b05-2m-gc-sub.tif): (Updated on 1st May 2003)

            The results images (jpeg) for ICLASS are here.


    Results for improved ICLASS: (Updated October 31st 2002)

Tabular results for Non-overlapping blocks Vs Sliding windows: Expt A.

Tabular results for Sliding Windows Vs Sliding windows with Rule1 from Association rules: Expt B.

Tabular results for Sliding Windows Vs Sliding windows with Rule2 from Association rules: Expt C.


    Results for the new set of images: (Updated July 25th 2002)

            Tabular results for Experiments #4 and #5.


    Results for preliminary stages of Active Learning: (Updated April 15th 2002)

            Tabular results for Experiment #3.


    Results for combined features for training image : (Updated April 7th 2002)

Tabular results for Experiment #2.


    Results for combined features for training image : (Updated April 3rd 2002)

           Tabular results for Experiment #1.


Publications: (Updated December 25, 2003)

  1. Class-Specific Ensembles for Active Learning Digital Imagery (pdf) - Amit Mandvikar, and Huan Liu. Accepted as a student paper for the SIAM International Conference on Data Mining. Florida, 2004.
  2. An Active Learning Approach to Egeria densa detection in Digital Imagery (pdf) - Huan Liu, Amit Mandvikar, and Patricia Foschi. Accepted as a book chapter in the book New Generation of Data Mining Applications, Editors: J. Zurada and M. Kantardzic, Wiley Publishers, 2003.
  3. Feature Selection via Learning for Image Data, accepted for IMMCN 2003, Cary, NC, by Deepak Kolippakkam, Huan Liu, Patricia Foschi.
  4. Active Learning with Ensembles for Image Classification (ps) - (Dr. Huan Liu, Amit Mandvikar, Dr. Patricia Foschi and Dr. Kari Torkkola). Accepted as poster presentation for the International Joint Conference on Artificial Intelligence (IJCAI 2003) to be held in Acapulco, Mexico. August 9 to 15, 2003. The entire paper can be found here.
  5. Feature Extraction for Image Mining (pdf) (doc) - (Dr. Patricia G. Foschi, Deepak Kolippakkam, Dr. Huan Liu and Amit Mandvikar) Accepted as a short paper for the International Workshop on Multimedia Information Systems (MIS 2002), held at Tempe, Arizona. Oct 30 to Nov 1, 2002. Presentation slides are here.
  6. Active learning for classifying a spectrally variable subject - (Dr. Patricia G. Foschi and Dr. Huan Liu), 2002. In Proceedings of the 2nd Pattern Recognition for Remote Sensing Workshop (PRRS2002), 16 August 2002, Niagara Falls, Canada, pp. 115-124.

Presentations and Reports:

Simulation for Image Mining system: (Updated January 28th 2003) 

PPT Slides are here , Slide show is here.

Project reports: (Updated May 22nd 2002)

A Report which explains the work done for the project for the period Jan 2002 - May 2002 (Spring) is given here. This deals with the Image Processing and the Feature Selection and Extraction aspects of the project.

A Report explaining the Active Learning part of the project by Amit Mandvikar is given here.


Related research papers:

The following is a list of papers referred for this project. The list would be updated as and when more papers are come across.

Pattern Recognition and Feature Extraction related papers:

1.      M. Antonie, O. Zaiane, and A. Coman. Application of data mining techniques for medical image classification. In Proceedings of Second International Workshop for Multimedia Data Mining (MDM/KDD'2001) in conjunction with ACM SIGKDD conference, pages 94-101, 2001.

2.      A. Natsev, R. Rastogi, and K.Shim. Walrus: A similarity retrieval algorithm for image databases. In SIGMOD, pages 394–406, 1999.

3.      Y. Rui, T. S. Huang, and S. Chang. Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10(4):39–62, 1999.

Active Learning related papers:

1.      N. Abe and H. Mamitsuka. Query learning using boosting and bagging. In Proceeding of the 15th International Conference on Machine Learning, pages 1–10, 1998.

2.      A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning, pages 19–26. Morgan Kaufmann, San Francisco, CA, 2001.

3.      A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the workshop on Computational Learning Theory. Morgan Kaufmann, 1998.

4.      L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.

5.      L. Breiman. Random forests. Technical report, Statistics Department, University of California Berkeley, 2001.

6.      I. Cohen, F. Cozman, and A. Bronstein. The effect of unlabeled data on generative classifiers, with application to model selection. Technical report, Beckman Institute, University of Illinois at Urbana Champaign, 2002.

7.      D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201– 21, 1994.

8.      D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996.

9.      R. Duin and D. Tax. Experiments with classifier combining rules. In Multiple Classifier Systems, First International Workshop, MCS 2000, volume 1857, pages 16–29. Springer, 2000.

10.  Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pages 148–156. Morgan Kaufmann, 1996.

11.  Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.

12.  S. Goldman and Y. Zhou. Enhancing supervised learning with unlabeled data. In Proceedings of the 17th International Conference on Machine Learning, pages 327–334, 2000.

13.  R. Greiner, A. Grove, and D. Roth. Learning active classifiers. In International Conference on Machine Learning, pages 207–215, 1996.

14.  D. Hakkani-Tur, G. Riccardi, and A. Gorin. Active learning for automatic speech recognition. In International Conference on Acoustics Speech and Signal Processing 2002.

15.  V. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive re-sampling. In KDD, pages 92–98, 2000.

16.  D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the11th International Conference on Machine Learning (ICML-94), pages 148–156, 1994.

17.  D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR, International Conference on Information Retrieval, pages 13–19, 1994.

18.  A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning, pages 359–367, 1998.

19.  T. Mitchell. The role of unlabeled data in supervised learning. In Proceedings of the Sixth International Colloquium on Cognitive Science, Spain. 1999.

20.  I. Muslea, S. Minton, and C. Knoblock. Selective sampling with redundant views. In Proceedings of the National Conference on Artificial Intelligence, pages 621–626, 2000.

21.  I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning, pages 435–442, 2002.

22.  K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proceedings of Conference on Information and Knowledge Management, pages 86–93, 2000.

23.  K. Nigam and A. McCallum. Pool-based active learning for text classification. In Proceedings of Conference on Automated Learning and Discovery, 1998.

24.  K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. In Proceedings of the International Conference on Machine Learning, 39(2/3):103–134, 2000.

25.  L. Paletta and A. Pinz. Active object recognition by view integration and reinforcement learning. Robotics and Autonomous Systems, 1/2(31):1–18, 2000.

26.  T. RayChaudhari and L. Hamey. An algorithm for active data collection for learning - feasibility study with neural networks. Technical report, Marquarie University, Department of Computing, 1995.

27.  M. Saar-Tsechensky and F. Provost. Active learning for class probability estimation. In Proceedings of International Joint Conference on AI, pages 911–920, 2001.

28.  M. Saar-Tsechensky and F. Provost. Active sampling for class probability estimation. In Proceedings of Machine Learning, 2002.

29.  G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, pages 839–846. Morgan Kaufmann, San Francisco, 2000.

30.  H. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the 5th ACM workshop on Computational Learning Theory (COLT-92), pages 287–294, 1992.



Web masters : Narasimha Deepak Kolippakkam , Amit Mandvikar (ASU)