Mining Reliable Information from Crowdsourced Data

NSF IIS-1553411
Principle Investigator

Current Students
  • Yaqing Wang. PhD Student.
  • Tianqi Wang. PhD Student.
  • Rui Li. PhD Student.
Alumni
  • Qi Li. Assistant Professor, Iowa State University.
  • Yaliang Li. Senior Engineer, Alibaba.
  • Houping Xiao. Assistant Professor, Georgia State University.
  • Fenglong Ma. Assistant Professor, Penn State University.
  • Wendy Shi. Graduate Student, University of Illinois.
Award Information

This website is based upon work supported by the National Science Foundation under Grant No. IIS-1553411. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Project Summary

With the proliferation of mobile devices and social media platforms, any person can publicize observations about any activity, event or object anywhere and at any time. The confluence of these enormous crowdsourced data can contribute to an inexpensive, sustainable and large-scale decision system that has never been possible before. Such a system could vastly improve the efficiency and cost of transportation, healthcare, and many other applications. The main obstacle in building such a system lies in the problem of information veracity, i.e., individual users might provide unreliable or even misleading information. This project identifies important research questions in the task of mining reliable information from noisy and unreliable crowdsourced data, and pursues an integrated research and education plan to address these questions. Through integrating data from various sources, this project addresses information veracity, which will benefit the many applications where crowdsourced data are ubiquitous but veracity can be suspect.

In particular, this project develops novel methods to mine reliable information by taking into consideration various properties of crowdsourcing: 1) Crowdsourcing platforms collect users' observations about certain objects. Other valuable information sources, such as spatial-temporal, user influence, and textual data, are leveraged to effectively detect reliable information from these observations. 2) Effective privacy protection and budget allocation mechanisms are designed to better motivate active crowdsourcing. These investigations are integrated with the exploration of both theoretical and practical aspects of the proposed methods. From the theoretical perspective, fundamental questions regarding the confidence in the estimated reliability and the convergence of the proposed methods are explored. From the practical perspective, the proposed methods are adapted to tackle challenging problems in various applications such as transportation, healthcare and education to enable new insights into these domains. In addition to the research advances, this project contributes to educational innovation, as the proposed methods are applied to educational methodologies such as peer assessment and question answering.


Publications

ICDCS20

Yaliang Li, Houping Xiao, Zhan Qin, Chenglin Miao, Lu Su, Jing Gao, Kui Ren, Bolin Ding. Towards Differentially Private Truth Discovery for Crowd Sensing Systems. International Conference on Distributed Computing Systems, Singapore, June 2020, to appear.

SDM20

Fenglong Ma, Yaqing Wang, Jing Gao, Houping Xiao, Jing Zhou. Rare Disease Prediction by Generating Quality-Assured Electronic Health Records. SIAM Conference on Data Mining, Cincinnati, OH, May 2020, to appear.

INS

Chen Ye, Hongzhi Wang, Kangjie Zheng, Jing Gao, Jianzhong Li. Multi-source Data Repairing Powered by Integrity Constraints and Source Reliability. Information Sciences, Vol.507, pp.386-403, January 2020.

BigData19

Tianqi Wang, Houping Xiao, Fenglong Ma, Jing Gao. IProWA: A Novel Probabilistic Graphical Model for Crowdsourcing Aggregation. IEEE International Conference on Big Data, Los Angeles, CA, December 2019, 677-682.

BigData19

Rui Li, Fenglong Ma, Wenjun Jiang, Jing Gao. Online Federated Multitask Learning. IEEE International Conference on Big Data, Los Angeles, CA, December 2019, 215-220.

IJCAI19

Qiuling Suo, Weida Zhong, Fenglong Ma, Ye Yuan, Jing Gao, Aidong Zhang. Metric Learning on Healthcare Data with Incomplete Modalities. International Joint Conference on Artificial Intelligence, Macao, China, August 2019, 3534-3540.

EDM19

Tianqi Wang, Fenglong Ma, Jing Gao. Deep Hierarchical Knowledge Tracing. International Conference on Educational Data Mining, Montreal, Canada, July 2019, 671-674.

EDM19

Tianqi Wang, Qi Li, Jing Gao, Xia Jing, Jie Tang. Improving Peer Assessment Accuracy by Incorporating Relative Peer Grades. International Conference on Educational Data Mining, Montreal, Canada, July 2019, 450-455.

KBS

Chen Ye, Hongzhi Wang, Tingting Ma, Jing Gao, Hengtong Zhang, Jianzhong Li. PatternFinder: Pattern discovery for truth discovery. Knowledge-Based Systems, Vol.176, pp.97-109, July 2019.

SDM19

Liuyi Yao, Yaliang Li, Yezheng Li, Hengtong Zhang, Mengdi Huai, Jing Gao, Aidong Zhang. DTEC: Distance Transformation Based Early Time Series Classification. SIAM Conference on Data Mining, Calgary, Canada, May 2019, 486-494.

WWW19

Fenglong Ma, Yaliang Li, Chenwei Zhang, Jing Gao, Nan Du, Wei Fan. MCVAE: Margin-based Conditional Variational Autoencoder for Relation Classification and Pattern Generation. The Web Conference, San Francisco, CA, May 2019, 3041-3048.

TKDE

Houping Xiao, Jing Gao, Qi Li, Fenglong Ma, Yunlong Feng, Aidong Zhang. Towards Confidence Interval Estimation in Truth Discovery. IEEE Transactions on Knowledge and Data Engineering, 31(3): 575-588, March 2019.

CIKM18

Fenglong Ma, Quanzeng You, Houping Xiao, Radha Chitta, Jing Zhou, Jing Gao. KAME: Knowledge-based Attention Model for Diagnosis Prediction in Healthcare. ACM International Conference on Information and Knowledge Management, Turin, Italy, October 2018, 743-752.

ASONAM18

Hengtong Zhang, Fenglong Ma, Yaliang Li, Chao Zhang, Tianqi Wang, Yaqing Wang, Jing Gao, Lu Su. Leveraging the Power of Informative Users for Local Event Detection. IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, Barcelona, Spain, August 2018, 429-436.

KDD18

Yaliang Li, Chenglin Miao, Lu Su, Jing Gao, Qi Li, Bolin Ding, Zhan Qin, Kui Ren. An Efficient Two-Layer Mechanism for Privacy-Preserving Truth Discovery. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1705-1714.

KDD18

Fenglong Ma, Jing Gao, Qiuling Suo, Quanzeng You, Jing Zhou and Aidong Zhang. Risk Prediction on Electronic Health Records with Prior Medical Knowledge. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1910-1919.

KDD18

Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, Lu Su. TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 2729-2737.

SDM18

Mengdi Huai, Chenglin Miao, Qiuling Suo, Yaliang Li, Jing Gao, Aidong Zhang. Uncorrelated Patient Similarity Learning. SIAM Conference on Data Mining, San Diego, CA, May 2018, 270-278.

WWW18

Chenglin Miao, Qi Li, Lu Su, Mengdi Huai, Wenjun Jiang, Jing Gao. Attack under Disguise: An Intelligent Data Poisoning Attack Mechanism in Crowdsourcing. The Web Conference, Lyon, France, April 2018, 13-22.

BigData17

Chuishi Meng, Yu Cui, Qing He, Lu Su, Jing Gao. Travel Purpose Inference with GPS Trajectories, POIs, and Geotagged Social Media Data. IEEE Conference on Big Data, Boston, MA, December 2017, 1319-1324.

SIGSPATIAL17

Chuishi Meng, Xiuwen Yi, Lu Su, Jing Gao, Yu Zheng. City-wide Traffic Volume Inference with Loop Detector Data and Taxi Trajectories. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, November 2017, 1:1-1:10.

KDD17

Fenglong Ma, Chuishi Meng, Houping Xiao, Qi Li, Jing Gao, Lu Su, Aidong Zhang. Unsupervised Discovery of Drug Side-Effects from Heterogeneous Data Sources. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 967-976.

KDD17

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, Jing Gao. Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 1903-1911.

WSDM17

Yaliang Li, Nan Du, Chaochun Liu, Yusheng Xie, Wei Fan, Qi Li, Jing Gao, Huan Sun. Reliable Medical Diagnosis from Crowdsourcing: Discover Trustworthy Answers from Non-Experts. ACM International Conference on Web Search and Data Mining, Cambridge, UK, February 2017, 253-261.

CIKM16

Hengtong Zhang, Qi Li, Fenglong Ma, Houping Xiao, Yaliang Li, Jing Gao, Lu Su. Influence-Aware Truth Discovery. ACM International Conference on Information and Knowledge Management, Indianapolis, IN, October 2016, 851-860.

TBD

Yaliang Li, Chaochun Liu, Nan Du, Wei Fan, Qi Li, Jing Gao, Chenwei Zhang, Hao Wu. Extracting Medical Knowledge from Crowdsourced Question Answering Website. IEEE Transactions on Big Data, accepted, September 2016.

KDD16

Houping Xiao, Jing Gao, Zhaoran Wang, Shiyu Wang, Lu Su, Han Liu. A Truth Discovery Approach with Theoretical Guarantee. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 1925-1934.

KDD16

Houping Xiao, Jing Gao, Qi Li, Fenglong Ma, Lu Su, Yunlong Feng, Aidong Zhang. Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1935-1944.


Courses
  • UB CSE 469: Introduction to Data Mining
  • UB CSE 601: Data Mining and Bioinformatics
  • UB CSE 706: Selected Topics in Data Mining

Tutorial

KDD19

Yao Zhou, Fenglong Ma, Jing Gao, Jingrui He. Optimize the Wisdom of the Crowd: Inference, Learning, and Teaching. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AL, August 2019.


Codes & Datasets Resulted from this Project
  • PRIME: "Risk Prediction with Prior Medical Knowledge" in [KDD18]

Last updated: May 2020.