This package contains the data sets and codes used in the experiments of the following paper:

Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han, Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models". Advances in Neural Information Processing Systems (NIPS), 22, 2009, 585-593.

USAGE LICENSE:

Please note that the data and programs are released as the way they are. The authors assume no responsibility for any potential problems. Permission to use the data and programs is granted for academic and education uses only. Please acknowledge the use of this package by citing the above paper if you publish materials based on the data and/or programs obtained from this package. The areas of researchers (in folder data/dblp) are manually labeled by Jing Gao (jinggao3@gmail.com) and Yizhou Sun (yizhou.sun@gmail.com). The user may not redistribute any part of the package without separate permission.  

CONTENT:

1) Folder "data":
   contains three sets of data sets used in the experiments: Newsgroup, Cora and DBLP. There are six tasks in Newsgroup dataset, four tasks in Cora and one in DBLP. These tasks are summarized in Table 2 in the paper.  For each task, we have two data files: mbase?.csv and label?.csv, where ? should be replaced by the task ID. 
   In mbase?.csv, we have output of four models on the target set, which provides the input to the BGCM algorithm. Each row corresponds to one data record, and each column represents the results of a model. The first two columns are results from two classification models where the output are predicted class labels. The remaining two columns are results from two clustering models where the output are cluster ids (not class labels!) Columns are separated by comma. The class labels are numbers from 1 to c where c is the number of classes. The real meaning of classes in each learning task can be found in Table 2 of the paper where the classes are numbered in order. The two classification and two clustering models vary among datasets. The details of the models used in the experiments can be found in Section 5.1 of the paper.		
   The corresponding true class labels are in file label?.csv. BGCM algorithm can work in an unsupervised manner, and then the label file is only used for evaluation. BGCM can also work in semi-supervised manner, where a matlab progrm that can randomly select a small portion of labeled objects is included in the folder "code" and described in the following part. In this scenario, we suggest to evaluate the algorithm on the remaining unlabeled objects.
   Under folder "dblp", we also include the author names and our manual labeling of their research areas in the file author_label, which are listed in the same order as that in the data file. The authors are randomly sampled from the authors having entries in DBLP. We label their research areas by looking through their homepages (if available) and/or their publication lists. Because the four areas are overlapping, and some authors' research information cannot be found at internet, our judgement may not be correct for some cases, and thus the labels may have some errors.     
   More details about 20 Newsgroup and Cora datasets can be found at:
   http://people.caail.mit.edu/jrennie/20Newsgroups/
   http://www.cs.umass.edu/~mccallum/data/cora-classify.tar.gz 

2) Folder "code":
   contains the matlab codes of the proposed algorithm-BGCM, and the stratified sampling code to assist semi-supervised consensus maximization. 
   bgcm.m is the function that conducts consensus maximization, and how to run the program is shown in the comments of bgcm.m. It should be noted that the data sets we provide all have two classification models and two clustering models, but the program is general and can handle any other number.  
   ransel.m is the function that selects a subset of objects as labeled objects. It is NOT part of the proposed consensus maximization algorithm. It is only used when you wish to evaluate the semi-supervised version of the proposed algorithm.

For questions regarding the data and programs in this package, please send them to Jing Gao (jinggao3@gmail.com).       
             
