Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation and Truth Discovery

NSF IIS-1319973
Principle Investigator

Students
  • Qi Li. PhD Student.
  • Yaliang Li. PhD Student.
  • Houping Xiao. PhD Student.
  • Chuishi Meng, PhD Student.
  • Fenglong Ma. PhD Student.
  • Yaqing Wang. PhD Student.
  • Ben Reid, Undergrad Student.
  • Stephanie Richter, Undergrad Student.
  • Tri Nguyen, Undergrad Student.
  • DeSean Abraham, Undergrad Student.
  • Wenxuan Shi, Undergrad Student.
  • Brian Green, Undergrad Student.
  • Jalil Sarwari, Undergrad Student.

Award Information

This website is based upon work supported by the National Science Foundation under Grant No. IIS-1319973, collaborative with NSF IIS-1320617. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Project Outcomes

Truth discovery is an emerging field in the data management and mining community. When conflicting information from multiple sources is collected, it is important to find reliable sources and identify the truth fact. The traditional conflict resolution approach that conducts majority voting usually fails as sources may have different reliability levels. Truth discovery can detect the truth even when truth is in the hands of the few if the few are reliable sources. This significantly improves the data aggregation performance by exploring the wisdom in the minority.

This project contributes to the development of this emerging field by developing truth discovery and information integration techniques that tackle unsolved challenges in this task. Specifically, we developed novel truth discovery methods for data of heterogeneous data types, data with long-tail distributions, streaming and time series data, distributed data and textual data. We also modeled correlations among sources and objects, derived fine-grained reliability degrees of sources and confidence degrees of the truths, considered the existence of true claims in the data set, and provided geometric interpretations of the truth discovery approach.

The effectiveness of the developed approaches was demonstrated on a variety of datasets drawn from multiple application scenarios, including crowdsourcing question answering, Internet information fusion, weather forecast integration, drug side-effect discovery, air quality monitoring, and indoor floorplan construction. These approaches can potentially benefit any other application in which decisions have to be made based on the reliable information extracted from diverse and heterogeneous sources.

Research results from this project were presented on the top conferences in the data science field, including KDD, VLDB, SIGMOD, SDM, ICDM, and CIKM. In this project, we conducted an extensive survey on truth discovery, which was published in SIGKDD Explorations, and we presented an overview of the truth discovery field in several tutorials on VLDB, KDD, SDM and CIKM conferences. The PI gave invited talks on workshops, in industrial labs and universities to present the research results of this project. The PI also discussed this research with high school students and undergraduate students at various outreach activities that promote "Women in STEM" at UB.

This project trained six PhD students, one master student and seven undergraduate students including three female students and one African American student. Their research skills have been greatly improved through this project, as demonstrated by their publications in top conferences and journals. In particular, two PhD students received "Best PhD dissertation award" in the Department of Computer Science and Engineering, University at Buffalo in 2017 and 2018 respectively. Majority of the research results in their dissertations were from this project. Research in this project has been integrated into the PI's data mining courses and seminars at UB via course projects and lectures.


Publications

KDD18

Qi Li, Meng Jiang, Xikun Zhang, Meng Qu, Timothy Hanratty, Jing Gao, Jiawei Han. TruePIE: Discovering Reliable Patterns in Pattern-Based Information Extraction. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1675-1684. Acceptance Rate: 107/983 = 10.89%.

KDD18

Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, Lu Su. TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 2729-2737. Acceptance Rate: 107/983 = 10.89%.

SDM18

Liuyi Yao, Lu Su, Qi Li, Yaliang Li, Fenglong Ma, Jing Gao, Aidong Zhang. Online Truth Discovery on Time Series Data. SIAM International Conference on Data Mining, San Diego, CA, May 2018, 162-170. Acceptance Rate: 23.2%.

ICDM17

Yaqing Wang, Fenglong Ma, Lu Su, Jing Gao. Discovering Truths from Distributed Data. New Orleans, LA, November 2017, 505-514. Acceptance Rate: 72/778 = 9.25%.

IJCAI17

Guangxu Xun, Yaliang Li, Xin Zhao, Jing Gao, Aidong Zhang. A Correlated Topic Model Using Word Embeddings. International Joint Conference on Artificial Intelligence, Melbourne, Australia, August 2017, 4207-4213. Acceptance Rate: 660/2540 = 26%.

KDD17

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, Jing Gao. Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 1903-1911. Acceptance Rate: 85/396 = 21.5%.

KDD17

Fenglong Ma, Chuishi Meng, Houping Xiao, Qi Li, Jing Gao, Lu Su, Aidong Zhang. Unsupervised Discovery of Drug Side-Effects From Heterogeneous Data Sources. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 967-976. Acceptance Rate: 131/784 = 16.7%.

KDD17

Guangxu Xun, Yaliang Li, Jing Gao, Aidong Zhang. Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 535-543. Acceptance Rate: 131/784 = 16.7%.

SDM17

Houping Xiao, Jing Gao, Long Vu, Deepak Turaga. Detecting Malicious Behavior in Computer Networks via Cost-Sensitive and Connectivity Constrained Classification. SIAM International Conference on Data Mining, Houston, TX, April 2017, 117-125. Acceptance Rate: 26%.

ICDM16

Guangxu Xun, Vishrawas Gopalakrishnan, Fenglong Ma, Yaliang Li, Jing Gao, Aidong Zhang. Topic Discovery for Short Texts Using Word Embeddings. IEEE International Conference on Data Mining, Barcelona, Spain, December 2016, 1299-1304. Acceptance Rate: 178/904 = 19.6%.

CIKM16

Hengtong Zhang, Qi Li, Fenglong Ma, Houping Xiao, Yaliang Li, Jing Gao, Lu Su. Influence-Aware Truth Discovery. ACM International Conference on Information and Knowledge Management, Indianapolis, IN, October 2016, 851-860. Acceptance Rate: 165/935 = 17.6%.

KDD16

Houping Xiao, Jing Gao, Zhaoran Wang, Shiyu Wang, Lu Su, Han Liu. A Truth Discovery Approach with Theoretical Guarantee. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1925-1934. Acceptance Rate: 142/784 = 18.1%. [Paper in PDF]

KDD16

Houping Xiao, Jing Gao, Qi Li, Fenglong Ma, Lu Su, Yunlong Feng, Aidong Zhang. Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1935-1944. [Paper in PDF]

KDD16

Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, Bo Zhao. From Truth Discovery to Trustworthy Opinion Discovery: An Uncertainty-Aware Quantitative Modeling Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1885-1894.

TKDE

Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, Jiawei Han. Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery. IEEE Transactions on Knowledge and Data Engineering, 28(8): 1986-1999, August 2016.

SoCG16

Hu Ding, Jing Gao, Jinhui Xu. Finding Global Optimum for Truth Discovery: Entropy Based Geometric Variance. International Symposium on Computational Geometry, Boston, MA, June 2016, 34:1-34:16.

SenSys15

Chuishi Meng, Wenjun Jiang, Yaliang Li, Jing Gao, Lu Su, Hu Ding, Yun Cheng. Truth Discovery on Crowd Sensing of Correlated Entities. ACM International Conference on Embedded Networked Sensor Systems, Seoul, South Korea, November 2015, 169-182. Acceptance Rate: 27/132 = 20.5%.

SenSys15

Chenglin Miao, Wenjun Jiang, Lu Su, Yaliang Li, Suxin Guo, Zhan Qin, Houping Xiao, Jing Gao, Kui Ren. Truth Discovery on Crowd Sensing of Correlated Entities. ACM International Conference on Embedded Networked Sensor Systems, Seoul, South Korea, November 2015, 183-196. Acceptance Rate: 27/132 = 20.5%.

VLDB15

Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, Jiawei Han. A Confidence-Aware Approach for Truth Discovery on Long-Tail Data. International Conference on Very Large Data Bases, Kohala Coast, HI, August 2015, 8(4): 425-436.

KDD15

Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, Jiawei Han. On the Discovery of Evolving Truth. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 675-684.

KDD15

Fenglong Ma, Yaliang Li, Qi Li, Minghui Qui, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, Jiawei Han. FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 745-754.

KDD15

Shi Zhi, Bo Zhao, Wenzhu Tong, Jing Gao, Dian Yu, Heng Ji, Jiawei Han. Modeling Truth Existence in Truth Discovery. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 1543-1552.

SDM15

Houping Xiao, Yaliang Li, Jing Gao, Fei Wang, Liang Ge, Wei Fan, Long Vu, Deepak Turaga. Believe It Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data. SIAM International Conference on Data Mining, Vancouver, Canada, April 2015, 397-405.

SDM15

Bowen Dong, Sihong Xie, Jing Gao, Wei Fan, Philip S. Yu. OnlineCM: Real-time Consensus Classification with Missing Values. SIAM International Conference on Data Mining, Vancouver, Canada, April 2015, 685-693.

KDD14

Sihong Xie, Jing Gao, Wei Fan, Deepak Turaga, Philip S. Yu. Class-Distribution Regularized Consensus Maximization for Alleviating Overfitting in Model Combination. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, August 2014, 303-312. Acceptance Rate: 151/1036 = 14.6%. [Paper in PDF] [BIBTEX]

IAAI14

Bahadir Aydin, Yavuz Yilmaz, Yaliang Li, Qi Li, Jing Gao, Murat Demirbas. Crowdsourcing for Multiple-Choice Question Answering. Annual Conference on Innovative Applications of Artificial Intelligence, Quebec City, Canada, July 2014, 2946-2953. [Paper in PDF]

SIGMOD14

Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, Jiawei Han. Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation. ACM SIGMOD International Conference on Management of Data, Snowbird, UT, June 2014, 1187-1198. [Paper in PDF] [Code&Data in ZIP] [More Informationn] [BIBTEX]


Survey & Tutorials

Survey

Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, Jiawei Han. A Survey on Truth Discovery. SIGKDD Explorations, 17(12): 1-16, December 2015. [Paper]

Tutorial

Jing Gao, Qi Li, Wei Fan. Truth Discovery for Passive and Active Crowdsourcing. ACM International Conference on Information and Knowledge Management (CIKM'16), Indianapolis, IN, October 2016. [Slides]

Tutorial

Jing Gao, Qi Li, Bo Zhao, Wei Fan, Jiawei Han. Enabling the Discovery of Reliable Information from Passively and Actively Crowdsourced Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), San Francisco, August 2016.

Tutorial

Jing Gao, Qi Li, Bo Zhao, Wei Fan, Jiawei Han. Towards Veracity Challenge in Big Data. SIAM International Conference on Data Mining (SDM'16), Miami, FL, May 2016. [Slides]

Tutorial

Jing Gao, Qi Li, Bo Zhao, Wei Fan, Jiawei Han. Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective. International Conference on Very Large Data Bases (VLDB'15), Kohala Coast, HI, August 2015. [Slides]


Courses

Codes & Datasets Resulted from this Project
  • CATD: "A Confidence-Aware Truth Discovery Approach" in [VLDB15]

Resources

Last updated: November 2018.