This package contains the data sets and part of the codes used in the experiments of the following paper:

Jing Gao, Wei Fan, Jing Jiang, Jiawei Han, Knowledge Transfer via Multiple Model Local Structure Mapping. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'08), Las Vegas, NV, Aug. 2008, 283-291.

Please note that the data and programs are released as the way they are. The authors assume no responsibility for any potential problems. Permission to use the data and programs is granted for academic and education uses only. Please cite the above paper if you find the package useful.

CONTENT:

1) Folder "dataset":
   contains five groups of data sets used in the experiments. The data sets are all in SVM format. All of the data sets are binary classification problems, i.e., there are two classes: positive (1) and negative (0).   Please refer to http://www.csie.ntu.edu/tw/~cjlin/libsvm/ if you are not familiar with this data format.	
   a) intrusion:
     dos.svm, probing.svm and r2l.svm are three sets of negative examples, representing three types of intrusions: DOS, Probing and R2L. normal.svm contains only positive examples, which are normal traffic data. The four data sets are extracted from KDDCUP'99 intrusion detection dataset, available at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. We eliminate the discrete features and keep the continuous features in the four data sets. 
     To create transfer learning problems, you can generate three data sets as follows: 1) partition the normal data into three disjoint parts, and 2) merge each part of the normal data with each type of intrusion data. To make the data sets balanced and equal-sized, under-sampling can be conducted. The three generated data sets represent data from three domains. Then three transfer learning tasks can be created by training on two domains to test on the remaining one.
   b) newsgroup:
     We generate six cross-domain learning tasks from the 20 newsgroup data as described in reference [9] in the paper. Each task involves a top category classification problem (e.g. comp vs sci) and the training and test data are drawn from different sub-categories. We use the matlab version of the newsgroup data obtained from http://people.csail.mit.edu/jrennie/20Newsgroups/. The data sets in this folder have binary feature values. In our experiments, we further process them into tf-idf representations.    
   c) reuters:
     Similar to the newsgroup data sets, the reuters transfer learning data sets involve top category classification problems but the training and test data are drawn from different sub categories. Based on the data sets obtained from the authors of reference [9], we transform binary feature values into tf-idf representations using Okapi BM25 formula.
   d) spam:
     The email spam data sets are released by ECML/PKDD 2006 discovery challenge, available at http://www.ecmlpkdd2006.org/challenge.html#download. Publicly available messages act as training data and there are three sets of email messages from individual users for test. Again, the data are transformed into tf-idf representations using Okapi BM25 formula.
   e) synthetic:
     This folder contains synthetic data sets we generated as shown in Figure 1 in the paper. We did some scaling to make the plots fit in the paper.

2) Folder "weighting":
   contains the matlab file for computing local model weights and an example to illustrate how to use this function.
   
   a) weighting.m
    This function computes the local model weights given the cluster membership and classification results on the test set. The description of input and output parameters and an example can be found in the file. This function works for binary classification problems.

   b) folder "results":
     This folder contains the cluster memberships and the classification results of three models on the "C vs T" problem of the 20 Newsgroup data sets (train6.svm and test6.svm in the released data sets). "cluster" is the clustering output from CLUTO package (http://glaros.dtc.umn.edu/gkhome/views/cluto). "predict1", "predict2" and "predict3" are the outputs from Winnow (SNoW learning package: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=SNOW), Logistic Regression (BMR software: http://www.stat.rutgers.edu/~madigan/BMR/), Support Vector Machine (LibSVM package: http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Each file contains a n by 2 matrix where n is the number of test examples. The first column is the probability of each test example belonging to class 0 and the second column is the probability of each belonging to class 1. We call it "Prob_Matrix" file format. The scripts that transform the original outputs from the learning packages to this format are put in the folder "scripts".
     If the example code is executed, which computes the local model weights for the three classifiers described above, two files will be written into this folder. "ensemble" has the weighted predictions on the test examples and "weight_matrix" contains the n by k matrix where each column has the local model weights for each of the k classifier. k equals to 3 here. Based on these two files, local structure based adjustment can then be conducted as described in the paper.

3) Folder "scripts":
   contains perl scripts to transform results files from the learning packages into the "Prob_Matrix" format described above. They works in binary classification problems.

   a) getPreds_BMR.pl:
     It works for BMR learning package.

   b) getPreds_LibSVM.pl:
     It works for libsvm-2.84.

   c) getPreds_SNoW.pl:
     It works for Snow_v3.1.

   The two parameters for all the three perl functions are:
   argv0: The name of the original output file from the learning package to be transformed.
   argv1: The name of the output file transformed into "Prob_Matrix" format.

For questions regarding the data and program in this package, please send them to Jing Gao (jinggao3@gmail.com). But I am afraid that I may not be able to answer questions about how to use the clustering and learning packages in the experiments. For such questions, please refer to their manuals and websites.      
             
