Project title: Learning from imbalanced data using supervised classification methods
Dr. Cen Li

Many data collected from scientific and cognitive experiments, and from objects presented in daily life, e.g. items for sale, preferred electronic games, popular websites, etc. are labeled, where each object/experiment has a class label attached. The class label may show that an experiment is successful or failure, a website is popular or not, an email is a junk mail or a non-junk mail, or the sale volume of a computer game is in the 200K-300K range, etc. Given this type of data, supervised classification methods maybe applied to learn patterns and rules automatically from data. The rules learned show the characteristics of a given class in terms of object features.  Once these rules are learned, they may be used to classify future objects as to which class it belongs.

There have been many research works being done in developing supervised classification approaches. Some of these works include decision tree classification, naïve Bayes classification, support vector machines, and neural networks. The goal of these approaches is to derive rules that maximally differentiate objects of different classes, and to assign new objects into their classes with high accuracy.

One of the problems that is associated with supervised classification is that the classification methods fail to perform successfully when the data available is imbalanced, e.g., where the number of objects having one class label dominate the data. With this type of data, the accuracy of the supervised classification methods decreases dramatically.

There has been some research work developed to remedy this problem, for example, to over-sample the under-represented class data, to under-sample the dominating class data, or to synthetically create data of the under-represented class.

This research project will investigate new approaches that incorporate the ideas of classification ensemble to improve classification accuracy on learning from imbalanced data.

Programming requirement: C++, and any scripting language.