Research Summary (Under Construction)
The main focus of my research centers around exploring the power of multiple heterogeneous information
sources. Recent years have witnessed a dramatic increase in our ability to extract and collect data from
the physical world. An important feature of the data collection is its wide variety, i.e., data about the same
object can be obtained from various sources. For example, customer information can be found from multiple
databases in a company; a patient's medical records may be scattered at different hospitals; a news event can
be characterized by text, images, and video; and an activity can be captured by multiple surveillance cameras
and live video feeds. Many interesting patterns cannot be extracted from a single data collection, but have
to be discovered from the integrative analysis of all heterogeneous data sources available. My solution to
the problem of learning from multiple sources is to extract trustworthy information from different sources
and integrate their complimentary perspectives to reach a more accurate and robust decision. My major
research topics are summarized as follows.
Truth Discovery from Multi-Source Data
An ongoing project is to detect true facts from multiple
conflicting data sources. Huge amount of information is generated every day and one of the fundamental
difficulties is that freely-created information is massive in volume, but it is usually of low quality. Facing
the daunting scale of data, it is unrealistic to expect human to label or tell which data source is more
reliable or which piece of information is correct. Our position is to detect truths without supervision, by integrating
source reliability estimation and truth finding. We developed a series of innovative techniques that
are widely recognized as the state-of-the-art solutions to the truth discovery problem. These methods can
successfully resolve conflicts among multiple sources of heterogeneous, dynamic, and long-tail data [SIGMOD14, VLDB15, KDD15, KDD15, KDD16, KDD16, SoCG16, TKDE16]. It is shown
that the proposed approaches can outperform not only the naive majority voting/averaging schemes but also
the other truth discovery schemes. Our research in this direction can benefit numerous applications where
critical decisions have to be made based on the correct information extracted from noisy input sources. We have several tutorials in recent KDD,
VLDB and SDM conferences, and an extensive survey on this topic [KDDEXP16].
Crowdsourcing Data Aggregation
Recent years have witnessed an astonishing growth of crowdcontributed
data, which has become a powerful information source that covers almost every aspect of our
lives, including traffic conditions, environmental conditions, health, public events, and many others. With
the proliferation of mobile devices and social media platforms, now any person can publicize his observations
about any activities, events or objects anywhere and at any time. The confluence of these enormous
crowdsourced data can contribute to an inexpensive, sustainable and large-scale decision system that has
never been possible before. In this research direction, we propose to extract "crowd wisdom" by wisely aggregating
massive crowdsourced data. We adapted the truth discovery technique to the task of crowdsourced
data aggregation in which each participating user's weight is estimated based on the user's ability of giving
correct answers [IAAI16]. Experimental results on aggregating answers for "who wants to be a millionaire"
game demonstrate the significant improvement in accuracy achieved by the proposed weighted aggregation
approach. In [WSDM16], we designed an effective budget allocation strategy that can adjust allocation
policy according to the requirement on aggregation accuracy when we only have a fixed tight budget during
crowdsourcing. We also developed effective ways to model the expertise of crowd workers during crowdsourced
data aggregation [KDD15, SDM16]. The proposed research potentially can benefit numerous
applications where crowdsourced data are ubiquitous. Particularly, we have investigated challenges in crowd sensing systems and developed novel approaches to conduct data aggregation and privacy preservation in these systems [SenSys15,SenSys15].
Multi-source Information Trustworthiness Analysis
In recent years, information trustworthiness has become a
serious issue when user-generated contents prevail in our
information world. We investigated the important problem of estimating information trustworthiness
from the perspective of correlating and comparing multiple
data sources. To a certain extent, the consistency degree
is an indicator of information reliability--Information unanimously agreed by all the sources is more likely to be reliable.
Based on this principle, we developed effective computational approaches to identify consistent information from multiple data sources [KDD13,ICDM12]. The idea was applied to network traffic anomaly detection and showed
its advantages in detecting meaningful anomalies [SDM15, WWW15]. In a more general setting,
we proposed to detect unusual, suspicious and anomalous behavior across multiple heterogeneous sources [ICDM11]. We proposed to link the various sources by the common knowledge hidden in the data and detect
the anomalies that do not follow the multi-source correlation patterns. By
exploring the discrepancies across heterogeneous sources, our approaches can detect anomalies that cannot be
found by traditional anomaly detection techniques and provide new
insights into the application area.
Consensus Maximization among Multiple Sources
In the task of classification, we need a training set consisting of labeled data to infer the correlations between feature values and class labels for future label prediction. Although unsupervised information sources do not directly
generate classification results, they provide useful constraints for
the task. To fully utilize all possible types of knowledge to
facilitate classification, we proposed to calculate a consolidated
solution for a set of objects
by maximizing the consensus among both supervised predictions and unsupervised grouping constraints [NIPS09,TKDE13].
This consensus maximization approach crosses
the boundary between supervised and unsupervised learning. With this framework, labeled data are no longer a requirement for
successful classification, instead, the use of existing labeling
efforts were maximized by integrating knowledge from relevant
domains and unlabeled information sources. This framework's power has been demonstrated in many real-world problems. power has been shown in many real-world problems. In particular, it was used to solve the problems of video categorization [MM09], network traffic anomaly detection [INFOCOM11], classification in sensor networks [SenSys11,RTSS12], and informative gene discovery [BCB12]. Results showed the advantages of the proposed method in combining
heterogeneous channels of information to provides a robust and
accurate solution. We also held a well-received tutorial at SDM'10 conference based on these research results and a survey on the ensemble approaches in supervised and unsupervised learning [SDM10]. Recently, we began to develop model combination techniques for alleviating overfitting or conducting multi-label classification [KDD14].
Ensemble Learning on Stream Data
Ensemble methods, which combine competing models learnt from labeled
data, have demonstrated to be effective in many disciplines. My
contribution is to utilize ensemble methods to improve
classification performance on stream data, i.e., continuously
arriving data. In fact, many real applications, such as chronicle
disease monitoring, traffic flow and electricity meter readings,
generate such stream data, and our goal is to correctly classify an
incoming data record based on the model learnt from historical
labeled data. The challenge is the existence of distribution
evolution or concept drifts because one actually may never know
either how or when the distribution changes. We proposed robust
model averaging frameworks combining multiple supervised models, and
demonstrated both formally and empirically that it can reduce
generalization errors and outperform single models on stream data [SDM07, ICDM07, IEEEIC]. This work drew people's attention to
the inevitable concept drifts in data streams, showed how the
traditional approaches become inapplicable when data distributions
evolve continuously, and most importantly, demonstrated the power of
ensemble methods in stream data classification. A series of our follow-up work tackled other challenges in stream data classification [ICDM08, PKDD10, ICDM10, ICDM11, KAIS11, TKDE11, TKDE12, PKDD10, SDM14], such as novelty class, time constraints and dynamic feature space.
Multi-view Clustering and Fusion
Many real-world datasets are comprised of different representations or views which often provide information
complementary to each other. To integrate information
from multiple views in the unsupervised setting, multi-view clustering algorithms have been developed to cluster multiple views simultaneously to derive a solution
which uncovers the common latent structure shared by
multiple views. We proposed a novel NMF-based multi-view clustering algorithm by searching for a
factorization that gives compatible clustering solutions
across multiple views [SDM13]. We also proposed a multimodal feature fusion framework to construct meaningful feature sets from image and text views [CIKM13]. This framework is trained
as a combination of multi-modal deep networks having two
integral components: An ensemble of image descriptors and
a recursive bigram encoder with fixed length output feature
vector. The proposed framework can not only model
the unique characteristics of images or texts, but also take
into account their correlations at the semantic level.
Multiple Source Transfer Learning
In many applications, it is
expensive or impossible to collect enough labeled data for accurate
classification in the domain of interest (target domain), however,
there are abundant labeled data in some relevant domains (source
domains). For example, when recognizing gene names in biomedical
literature, we may want to recognize gene names related to a new
organism (e.g., honey bee), but we only have labeled data for some
well-studied old organisms (e.g., fly and yeast). The challenge is
that the data from the source domains may be in a different feature
space or follow a different data distribution compared with that in
the target domain. To solve this problem, we proposed a locally
weighted ensemble framework to adapt useful knowledge from multiple
source domains to the target domain [KDD08].
We developed and analyzed a new paradigm to effectively transfer knowledge from multiple sources when facing the challenges of imbalanced distributions and discrepancies between source and target label distributions [SDM13].
Online transfer learning methods are proposed to transfer knowledge from multiple domains in an online manner when data continuously arrive [CIKM13].
We also proposed transfer learning methods for the task of denoising biological networks [BIBM12] and mobile device based arrhythmia detection [BCB13].
Outlier Detection in Networked Data
Representative work: [KDD10]
Networked data consists of node attribute values and relationships between nodes. For example, we can collect people's profiles and friends networks from online social networks. In networked data, closely
related objects that share the same properties or interests form a
community. For example, a community in blogsphere could be users
mostly interested in cell phone reviews and news. Outlier detection
in networked data can reveal important anomalous and interesting
behavior that is not obvious if community information is ignored.
An example could be a low-income person being friends with many rich
people even though his income is not anomalously low when considered
over the entire population. To automatically detect such outliers,
we proposed probabilistic approaches [KDD10] to characterize the normal patterns deeply embedded in networked data and find out abnormal behavior. We also developed outlier detection techniques for heterogeneous information networks in which nodes or links are of different types. We further proposed to analyze the evolutionary behavior of temporal networks and take the time factor into consideration when identifying outliers [KDD12,ECML/PKDD12]. We later tackled the problem of outlier detection in heterogeneous networks in which nodes and links possese a variety of types [ASONAM13,ECML/PKDD13] The critical points, events or activities that are detected by the proposed approaches can greatly benefit the security and safety of the cyber and physical world.
Some of the research results are integrated into our book and conference tutorials.
Information Network Analysis
Information networks refer to networks that are formed by individual components and their interactions. Examples include communication and computer systems, the Internet, biological networks, transportation systems, epidemic networks, criminal rings, and hidden terrorist networks.
Due to the prevalence and importance of these networks, information network analysis has received considerable attention recently. My research in this area focuses on the following three major analytical tasks:
1) Topic modeling: Identifying a set of popular topics discussed in information networks [ICDM09]; 2) Classification: Predicting the role of each node in an information network by learning from nodes with their roles labeled [ICDM13,ASONAM14];
3) Evolution analysis: Detect and analyze the evolution in dynamic networks [BCB12,ICDM13]. These challenging research problems were tackled by conducting integrative analysis of the whole networks and analyzing both node and link behavior.
Discriminative Pattern Mining
Representative work: [KDD08]
Frequent patterns provide solutions to datasets that do not have well-structured feature vectors.
Traditional frequent pattern mining is performed in two sequential steps: enumerating a set of frequent patterns, followed by feature selection.
We proposed a novel one-step pattern mining approach that outputs highly compact and discriminative patterns [KDD08]. It builds a decision tree that partitions the data onto different nodes. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the two-step method.