Research
Projects during MS Study
|
|
Smart Miner: A New Framework for Mining Large Scale Web
Usage Data Problem: Web usage mining can be
defined as the application of data mining techniques to web log data in order
to discover user access patterns. It has various applications such as link
prediction, site reorganization and web personalization. The success of all
of these applications is significantly related to the outcomes of web usage
mining process which includes session construction and frequent navigation
pattern discovery phases. Among these phases, session reconstruction is the
most important part of web usage mining since the performance of any
application using frequent patterns or sessions are significantly affected by
session reconstruction phase. Previous approaches for session construction
didn’t consider forward link information between pages which is very
precious. The problem we attack here is to develop web usage mining framework
with session construction and pattern discovery phases which considers link
information between pages during the web user navigation. Another issue is
the scalability, popular web sites may have significant amount of web page
access daily which can only be handled by distributed systems. Approach: In order to solve the
problem above, we propose a novel framework called Smart-Miner which uses
link information for producing accurate user sessions and frequent navigation
patterns. Unlike the simple session concepts in previous methods, where
sessions are sequences of web pages requested from the server or viewed in
the browser, Smart Miner sessions are set of paths traversed in the web graph
that corresponds to users’ navigations among web pages. We have modeled
session construction as a new graph problem and utilized a new algorithm,
Smart-SRA to solve this problem efficiently. For the pattern discovery phase,
we have developed an efficient version of the Apriori-All technique which
uses the structure of web graph to increase the performance. In order to
solve the scalability issues, we have implemented distributed version of the
Smart Miner framework by employing Map/Reduce Paradigm on Hadoop framework. Contributions: Using both simulated and
real data, we have showed that in terms of our accuracy measure the maximal
frequent patterns discovered by Smart-Miner framework is at least 30% much better
than the ones obtained by web usage mining techniques utilizing previous
session construction methods. We also showed that frequent patterns generated
by Smart Miner framework has better performance for web usage mining
applications such as link prediction and decision support systems. Our
Experimental results imply that the run time performance of distributed
Smart-Miner increases linearly and it can be scaled-up easily to process any
size of data. We conclude that we can efficiently process terabytes of web
server logs belonging to multiple web sites by our scalable framework. Publications: ·
Murat
Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar, Guven Fidan: Smart Miner: a new
framework for mining large scale web usage data. WWW 2009: 161-170. ·
Murat
Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar: A New Approach for Reactive Web
Usage Data Processing. ICDE Workshops 2006: 44 |
|
|
Integration of topological measures for eliminating
non-specific interactions in protein interaction networks Problem: High-throughput protein
interaction assays aim to provide a comprehensive list of interactions that
govern the biological processes in a cell. These large-scale sets of
interactions, represented as protein-protein interaction networks, are often
analyzed by computational methods for detailed biological interpretation.
However, as a result of the tradeoff between speed and accuracy, the
interactions reported by high-throughput techniques occasionally include
false-positive interactions. Unfortunately, many computational methods are
sensitive to noise in protein interaction networks; and therefore they are
not able to make biologically accurate inferences. Approach: In this work, we propose a
novel technique based on integration of topological measures for removing
non-specific interactions in a large-scale protein-protein interaction
network. After transforming a given protein interaction network using line
graph transformation, we compute clustering coefficient and betweenness
centrality measures for all the edges in the network. Motivated by the
modular organization of specific protein interactions in a cell, we remove
edges with low clustering coefficient and high betweenness centrality values.
We also utilize confidence estimates that are provided by probabilistic
interaction prediction techniques. We validate our proposed method by
comparing the results of a molecular complex detection algorithm (MCODE) to a
ground truth set of known Saccharomyces cerevisiae complexes in the MIPS
complex cataloguedatabase. Contributions: Our results show that by
removing false-positive links in the protein interaction networks, we can
significantly increase the biological accuracy of the complexes reported by
MCODE. Publications: ·
Murat
Ali Bayir, Tacettin Dogacan Guney, Tolga Can: Integration of topological
measures for eliminating non-specific interactions in protein interaction
networks. Discrete Applied Mathematics 157(10): 2416-2424 (2009) |
|
|
Genetic
Algorithm for Multiple Query Optimization Problem Problem:
Producing answers to a set of queries with common tasks efficiently is known as
the multiple-query optimization (MQO) problem. Each query can have several
alternative evaluation plans, each with a different set of tasks. Therefore,
the goal of MQO is to choose the right set of plans for queries which
minimizes the total execution time by performing common tasks only once.
Since MQO is an NP-hard problem, several, mostly heuristics based, solutions
have been proposed for solving it. However previous approaches such as A*
algorithm for MQO problem has scalability problems. Approach:
One of the most popular meta heuristic techniques used for solving complex
optimization problems is the genetic algorithms (GA). Since its introduction
in early 80s, the GA has attracted a lot of interest from the computer
science community, and it has been successfully used for developing efficient
solutions too many NP-complete problems. Here we propose GA approach to the
MQO problem since MQO can easily be modeled for a GA. In the context of MQO,
a chromosome corresponds to a solution instance for the set of queries of the
MQO problem. In a chromosome, each gene of a chromosome represents a plan to
the corresponding query. Each gene of a chromosome corresponds to a query.
The value of the gene is the plan selected for the evaluation of the
corresponding query. To select the chromosomes for the next generation, the
quality of the solution represented by the chromosome is used. This quality
is represented by the fitness function, which is simply the inverse of the
total execution time of all the tasks in the selected plans for the queries.
Under this modeling, MQO is also very suitable for genetic operations to
solve the MQO problem. Contributions:
Our work presents evolutionary techniques for solving the MQO problem.
Successful A* heuristics have been used for finding optimal solutions to the
moderately small (up to ten queries) sized MQO problems. Our work shows that
GA is scalable and it is very good candidate for solving bigger MQO problems
including more than ten queries. We also showed that it is not practical to
solve complex MQO problems including large amount of queries with NP-Complete
approaches like A*. Publications: ·
Murat
Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar: Genetic Algorithm for the
Multiple-Query Optimization Problem. IEEE Transactions on Systems, Man, and
Cybernetics, Part C 37(1): 147-153 (2007) |