My Ph.D. dissertation deals with design and implementation of a comprehensive checkpointing and recovery scheme for distributed systems. It covered three major areas: Distributed systems, Software Engineering and Computer Architecture. The error detection aspects of the dissertation involved architectural design. Checkpointing and recovery component dealt with distributed systems. The Object-oriented testbed designed and implemented to evaluate the performance of my design, applied software engineering principles.
Dissertation Title: Hardware-Assisted Rollback Recovery in Distributed Systems.
The dissertation addresses issues related to error detection and rollback recovery in distributed systems. A comprehensive scheme consisting of a novel error detection technique, an innovative message validation method, and an efficient rollback recovery technique has been designed, implemented and evaluated.
A low latency, time/space efficient signature analysis-based technique for error detection is presented. A known signature function is applied to the instruction stream during compilation and when the accumulated signature forms an $m$-out-of-$n$ code, the corresponding instructions are tagged. Error checking is done at runtime by monitoring the signatures accumulated at the tagged locations to determine whether they form $m$-out-of-$n$ codes. The proposed technique eliminates reference signatures, which are the main source of memory and time overhead in concurrent monitoring.
Rollback recovery in a distributed system involves rollback of one or more processes to a consistent state and resuming execution from that state. It is necessary to contain the rollback distance and the number of processes involved in the rollback for responsive computation. The rollback distance of the process in error is reduced with the aid of the low latency error detection scheme. The number of processes rolled back is limited by a novel message validation mechanism. The error detection and the validation schemes are integrated into the recovery process to form a Comprehensive Recovery Protocol (CReP) that can be readily used to implement recovery in distributed systems.
An object-oriented testbed (OTEC) for simulation and analysis of CReP has been designed and implemented. OTEC is an extension of DEPEND, an integrated design and fault-injection environment. This allows for unique system-level dependability analysis under realistic fault conditions, unlike existing simulation tools.
I have diversified into applied research which involves design using Open Systems' Distributed Computing Environment (OSF's DCE). More specifically I am interested in evaluating viability and performance of three-tier client-server models using OSF's DCE. The issues that will be addressed in this research are: Three-tier model, security provided by DCE, using multi-threads for service, cross-platform (currently only Solaris and Windows NT) service, plugging in realistic database server (such as Oracle). I am guiding two undergraduates on this project. Two DCE vendors Transarc and Gradient have provided free evaluation licenses for their version of DCE. Both students working with me are graduating this semester: One is going on to graduate school with plans to continue research in the same area (DCE) and the other is going to work at IBM in a related area (CORBA). These two undergraduates chose me over the other faculty for doing their independent study research. I plan to continue my research in this area to explore building model systems using OSF's DCE. Many campus-wide computing organizations are interested in this project. We recently presented one of our pilot projects to a campus-wide DCE group. This research has excellent potential for attracting grants from industry.
The checkpointing and recovery scheme designed for my dissertation provides the basis for a very interesting theoretical work. An important contribution of this work is the notion of {\em safety} of error detection points of the various processes involved in a distributed computation. An error detection point of a process is safe if the process need not be rolled back beyond this point during recovery. A global state representing the safe detection points of various processes is maintained using a matrix structure. The checkpointing during normal system operation, the safety criteria, and recovery are formally explained based on this matrix. I am currently working on a journal paper on this topic. This work provides a theoretical framework for analyzing similar schemes in distributed systems.
The object-oriented testbed (OTEC) designed for my dissertation is a very useful tool for practitioners and designers in the area of checkpointing and recovery. I am working on refining the design, adding more facilities and a graphical user-interface to make it available publicly. I hope to gain experience in designing robust software packages using software engineering principles. The package is written in C++ and we plan to use Xlib for the Graphical User Interface. A future graduate student of the Computer Science department is working on this project under my guidance. A technical paper based on this topic has been accepted for presentation at International Symposium on Fault-Tolerant Computing, FTCS-27.
My special research interest is in Computer Architecture. I have taught RISC architecture in all my hardware-based courses since its introduction. I am particularly interested is in super-scalar architectures and architectures for multi-threading. With the proliferation of client-server computing and internet, applications are becoming increasingly multi-threaded. However, the underlying architectures are designed for single-threads of execution. I have ideas for exploring compiler technology and super-scalar architectures for supporting multi-threads. Most solutions for control and data dependencies in the context of super-scalar architectures assume single thread of flow. By presenting multiple independent threads to the instruction dispatcher, and with additional logic for keeping track of various threads, it should be possible to increase issue rate as well as processor untilization. I have recently taught a graduate course in Computer Architecture that helped me gain some useful insights into recent trends in this area. I will be seeking funding from industry as well as federal sources such as DARPA (see DARPA BAA97-03) and NSF for research projects in this area.
In summary, my research interests cover three very interesting areas that have many open problems and offer ample opportunity for applied research and for attracting research grants.