Research Projects during MS Study

[Back to Homepage]

 

SMiner.jpg

 

SMiner-2.jpg

 

Smart Miner: A New Framework for Mining Large Scale Web Usage Data

Problem: Web usage mining can be defined as the application of data mining techniques to web log data in order to discover user access patterns. It has various applications such as link prediction, site reorganization and web personalization. The success of all of these applications is significantly related to the outcomes of web usage mining process which includes session construction and frequent navigation pattern discovery phases. Among these phases, session reconstruction is the most important part of web usage mining since the performance of any application using frequent patterns or sessions are significantly affected by session reconstruction phase. Previous approaches for session construction didn’t consider forward link information between pages which is very precious. The problem we attack here is to develop web usage mining framework with session construction and pattern discovery phases which considers link information between pages during the web user navigation. Another issue is the scalability, popular web sites may have significant amount of web page access daily which can only be handled by distributed systems.

 

Approach: In order to solve the problem above, we propose a novel framework called Smart-Miner which uses link information for producing accurate user sessions and frequent navigation patterns. Unlike the simple session concepts in previous methods, where sessions are sequences of web pages requested from the server or viewed in the browser, Smart Miner sessions are set of paths traversed in the web graph that corresponds to users’ navigations among web pages. We have modeled session construction as a new graph problem and utilized a new algorithm, Smart-SRA to solve this problem efficiently. For the pattern discovery phase, we have developed an efficient version of the Apriori-All technique which uses the structure of web graph to increase the performance. In order to solve the scalability issues, we have implemented distributed version of the Smart Miner framework by employing Map/Reduce Paradigm on Hadoop framework.

 

Contributions: Using both simulated and real data, we have showed that in terms of our accuracy measure the maximal frequent patterns discovered by Smart-Miner framework is at least 30% much better than the ones obtained by web usage mining techniques utilizing previous session construction methods. We also showed that frequent patterns generated by Smart Miner framework has better performance for web usage mining applications such as link prediction and decision support systems. Our Experimental results imply that the run time performance of distributed Smart-Miner increases linearly and it can be scaled-up easily to process any size of data. We conclude that we can efficiently process terabytes of web server logs belonging to multiple web sites by our scalable framework.

 

Publications:

 

·         Murat Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar, Guven Fidan: Smart Miner: a new framework for mining large scale web usage data. WWW 2009: 161-170.

 

·         Murat Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar: A New Approach for Reactive Web Usage Data Processing. ICDE Workshops 2006: 44

 

 

 

PPINetwork.jpg

graphBio.jpg

Integration of topological measures for eliminating non-specific interactions in protein interaction networks

Problem: High-throughput protein interaction assays aim to provide a comprehensive list of interactions that govern the biological processes in a cell. These large-scale sets of interactions, represented as protein-protein interaction networks, are often analyzed by computational methods for detailed biological interpretation. However, as a result of the tradeoff between speed and accuracy, the interactions reported by high-throughput techniques occasionally include false-positive interactions. Unfortunately, many computational methods are sensitive to noise in protein interaction networks; and therefore they are not able to make biologically accurate inferences.

 

Approach: In this work, we propose a novel technique based on integration of topological measures for removing non-specific interactions in a large-scale protein-protein interaction network. After transforming a given protein interaction network using line graph transformation, we compute clustering coefficient and betweenness centrality measures for all the edges in the network. Motivated by the modular organization of specific protein interactions in a cell, we remove edges with low clustering coefficient and high betweenness centrality values. We also utilize confidence estimates that are provided by probabilistic interaction prediction techniques. We validate our proposed method by comparing the results of a molecular complex detection algorithm (MCODE) to a ground truth set of known Saccharomyces cerevisiae complexes in the MIPS complex cataloguedatabase.

 

Contributions: Our results show that by removing false-positive links in the protein interaction networks, we can significantly increase the biological accuracy of the complexes reported by MCODE.

 

Publications:

 

·         Murat Ali Bayir, Tacettin Dogacan Guney, Tolga Can: Integration of topological measures for eliminating non-specific interactions in protein interaction networks. Discrete Applied Mathematics 157(10): 2416-2424 (2009)

 

Genetic-1.jpg

 

Genetic-2.jpg

Genetic Algorithm for Multiple Query Optimization Problem

Problem: Producing answers to a set of queries with common tasks efficiently is known as the multiple-query optimization (MQO) problem. Each query can have several alternative evaluation plans, each with a different set of tasks. Therefore, the goal of MQO is to choose the right set of plans for queries which minimizes the total execution time by performing common tasks only once. Since MQO is an NP-hard problem, several, mostly heuristics based, solutions have been proposed for solving it. However previous approaches such as A* algorithm for MQO problem has scalability problems.

Approach: One of the most popular meta heuristic techniques used for solving complex optimization problems is the genetic algorithms (GA). Since its introduction in early 80s, the GA has attracted a lot of interest from the computer science community, and it has been successfully used for developing efficient solutions too many NP-complete problems. Here we propose GA approach to the MQO problem since MQO can easily be modeled for a GA. In the context of MQO, a chromosome corresponds to a solution instance for the set of queries of the MQO problem. In a chromosome, each gene of a chromosome represents a plan to the corresponding query. Each gene of a chromosome corresponds to a query. The value of the gene is the plan selected for the evaluation of the corresponding query. To select the chromosomes for the next generation, the quality of the solution represented by the chromosome is used. This quality is represented by the fitness function, which is simply the inverse of the total execution time of all the tasks in the selected plans for the queries. Under this modeling, MQO is also very suitable for genetic operations to solve the MQO problem.

Contributions: Our work presents evolutionary techniques for solving the MQO problem. Successful A* heuristics have been used for finding optimal solutions to the moderately small (up to ten queries) sized MQO problems. Our work shows that GA is scalable and it is very good candidate for solving bigger MQO problems including more than ten queries. We also showed that it is not practical to solve complex MQO problems including large amount of queries with NP-Complete approaches like A*.

 

Publications:

 

·         Murat Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar: Genetic Algorithm for the Multiple-Query Optimization Problem. IEEE Transactions on Systems, Man, and Cybernetics, Part C 37(1): 147-153 (2007)

 

[Back to Homepage]