This package contains the data sets and codes used in the experiments of the following paper:

Jing Gao, Wei Fan, Yizhou Sun, Jiawei Han, Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France, June 2009, 339-347.

USAGE LICENSE:

Please note that the data and programs are released as the way they are. The authors assume no responsibility for any potential problems. Permission to use the data and programs is granted for academic and education uses only. Please acknowledge the use of this package by citing the above paper if you publish materials based on the data and/or programs obtained from this package. The areas of researchers (in folder data/dblp) are manually labeled by Jing Gao (jinggao3@gmail.com) and Yizhou Sun (yizhou.sun@gmail.com). The user may not redistribute any part of the package without separate permission.

CONTENT:

1) Folder "data":
   contains three sets of data sets used in the experiments: Newsgroup, Cora and DBLP. There are six tasks in Newsgroup dataset, four tasks in Cora and one in DBLP. These tasks are summarized in Table 2 in the paper.  For each task, we have two data files: mbase?.csv and label?.csv, where ? should be replaced by the task ID. 
   In mbase?.csv, we have classification outputs on the target set from four models, which provides the input to the CLSU algorithm. Each row corresponds to one data record, and each column represents the results of a model. The first two columns are results from two classification models where the outputs are predicted class labels. The remaining two columns are results from two clustering models where the outputs are cluster ids (not class labels!) Columns are separated by comma. The class labels are numbers from 1 to c where c is the number of classes. The real meaning of classes in each learning task can be found in Table 2 of the paper where the classes are numbered in order. The two classification and two clustering models vary among datasets. The details of the models used in the experiments can be found in Section 4.1 of the paper.		
   The corresponding true class labels are in file label?.csv. The CLSU algorithm is completely unsupervised, and thus the label file is only used for evaluation. 
   Under folder "dblp", we also include the author names and our manual labeling of their research areas in the file author_label, which are listed in the same order as that in the data file. The authors are randomly sampled from the authors having entries in DBLP. We label their research areas by looking through their homepages (if available) and their publication lists. Because the four areas are overlapping, and some authors' research information cannot be found at internet, our judgement may not be correct for some cases, and thus the labels may have some errors.     
   More details about 20 Newsgroup and Cora datasets can be found at:
   http://people.caail.mit.edu/jrennie/20Newsgroups/
   http://www.cs.umass.edu/~mccallum/data/cora-classify.tar.gz

2) Folder "code":
   contains the matlab codes of the proposed algorithm-CLSU. The main function is clsu in clsu.m. How to run the program is shown in the comments of clsu.m. It should be noted that the data sets we provide all have two classification models and two clustering models, but the program is general and can handle any other number.  
   
For questions regarding the data and program in this package, please send them to Jing Gao (jinggao3@gmail.com).       
             
