-------------------
General Information
-------------------
This directory contains the data set and codes we used in the following papers: 
Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han, "Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation". In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), 2014

Any problem, please email: qli22@buffalo.edu


-------------------
USAGE LICENSE:
-------------------
Please note that the data and codes are released as the way they are. The authors assume no responsibility for any potential problems. Permission to use the data and codes is granted for academic and education uses only. Please acknowledge the use of this package by citing the above paper if you publish materials based on the data and/or codes obtained from this package. The weather data set is crawled by Xin Liu(forkloop@gmail.com), Yaliang Li(yaliangl@buffalo.edu), Qi Li(qli22@buffalo.edu), and Jing Gao(jing@buffalo.edu). The user may not redistribute any part of the package without separate permission.  


--------------------
Directory structure
--------------------

    CRH
    |-- code              # the directory contains Matlab codes for CRH method
    |-- data              # the directory contains weather data set and the corresponding ground truths
    |-- README            # this file


---------------------------------------------------
Data Directory
---------------------------------------------------
1) data set file:
1.1) file name: weather_data_set.txt
1.2) format of each line: "entry-id entry-value source-id\n". The delimiters are tabs. 
                          If two rows have same entry-id, they are describing on the same entry.
                          For the entry-value, if it contains "w-", it indicates that this entry-value is categorical type. Otherwise, it is continuous type.
                          If two rows have same source-id, they are from the same source.
1.3) Example: the top lines of file "weather_data_set.txt":
10	w-7	1
11	61	1
12	45	1
13	w-1	2
14	63	2
15	41	2
16	w-1	3
17	63	3
18	41	3

2) ground truth files:
2.1) file name: weather_ground_truth.txt
2.2) format of each line: "entry-id entry-value\n". The delimiters are tabs.
                          Each row has a unique entry-id.
                          For the entry-value, if it contains prefix "w-", it indicates that this entry-value is categorical type. Otherwise, it is continuous type.
2.3) Example: the top lines of file "weather_ground_truth.txt":
1	w-1
2	70.0
3	30
22	w-2
23	21.5
24	12
43	w-2
44	79
45	49


---------------------------------------------------
Code Directory 
---------------------------------------------------
1) Matlab script:
1.1) file name: run.m
1.2) description:
     Run this script to conduct CRH method on the data set. The experiment results will be saved as "result_CRH.txt" under the current directory.

2) Matlab functions:
2.1) importfile_data.m
2.1.1) Input: The path and name of the weather data file.
2.1.2) Output: A cell array of the weather data. 
               Format of each line: "entry-id(number) | entry-value(string) | source-id(number)". 

2.2) importfile_gt.m
2.2.1) Input: The path and name of the weather ground truth file.
2.2.2) Output: A cell array of the ground truths. It is sorted ascending based on entry_id.
               Format of each line: "entry-id(number) | entry-value(string/number)". If an entry-value is a string, it is the ground truth for a categorical type entry. Otherwise, it is the ground truth for a continuous type entry.

2.3) CRH_weather.m
2.3.1) Input: 
            a. A cell array of the weather data. 
            b. Number of iterations.
2.3.2) Output:
            a. A cell array of aggregation results from CRH method. Format of each line: "entry-id(number) | entry-value(string/number)". If an entry-value is a string, it is the ground truth for a categorical type entry. Otherwise, it is the ground truth for a continuous type entry.
            b. A matrix of source weights. Format of each line: "source-id | source-weight"
            c. A cell array of aggregation result from voting/median method. Same format as (a).

2.4) weightedMedian.m
2.4.1) Input: 
            a. A matrix of values.
            b. A matrix of weight. Same dimension as (a).
2.4.2) Output:
            a. A matrix of weighted median results. Same dimension as input.
2.4.3) Origin URL: http://www.mathworks.com/matlabcentral/fileexchange/23077-weighted-median  

2.5) evaluate_weather.m
2.5.1) Input: 
            a. A cell array of aggregation result to be evaluated
            b. A cell array of ground truth. 
2.5.2) Output:
            a. Error rate.
            b. MNAD. 
--------------------------------- END --------------------------------
