CSE 250 Spring 2004

Programming Project #2

Performance comparison: linear recursive structure
vs.
STL list

Important note

It goes without saying that you must not alter the code provided to you in any way. You can include the code given code along with your submission, but be prepared that we will replace it with our copy of the code and recompile.

Introduction

Congratulations! You have been hired by an upstart Hollywood North studio. It turns out that when they do the casting for a new movie it is important to find names of actors in certain age groups to encourage them to audition for parts. You have just joined their software development team. Your job is to help their crackerjack software developers test two alternate data structures empirically to determine which will provide the best response time to queries.

The company is using data about actors stored in a simple (but large) text file. They wish to read this data into a data structure so that they can extract information from it. For the purposes of your performance testing you need to store the name of the actor as a string, their date of birth, and their date of death. You can choose whatever representation you wish for the date as long as your chosen representation will support the operations outlined below. If an actor is still living they do not have a date of death. You may choose to represent this however you choose in your structure, with the same restriction as above: you must support the operations outlined below.

The actor's name will serve as a key into data structure, and the dates for the actor are the associated data value. If you find more than one entry in the data file with the same name string, ignore any after the first one. Doing things this way you can assume that in your data structure the keys are unique. The key-value pairs must be stored in the data structure ordered by their keys (i.e. in alphabetical order by the actors' full name).

The company wants to be sure that the time required to search for actors which match a given set of requirements (based on date of birth and date of death in your empirical testing project) is reasonable. The management, against the advice of their software developers, decided some time ago to invested heavily in linked-list technology at the expense of any other data structures, so you are required to use a linked-list. They are interested in knowing the run-time performance tradeoffs between the C++ Standard Template Library list implementation and the state-based object-oriented Linear Recursive Structure implementation (see Note 1).

Main tasks

You must define two implementations of the following interface (since we're working in C++, this is specified in an abstract class named Container):
void insert(KT *, VT *);
void remove(KT *);
pair * find(KT *);
Container * filterOnKey(Predicate * p);
Container * filterOnValue(Predicate * p);
One implementation must be based on the LRStruct, and one on the Standard Template Library (STL) list.

Your LRStruct-based structure ABSOLUTELY MUST BE NAMED OrderedLRS, and one on the Standard Template Library (STL) list ABSOLUTELY MUST BE NAMED Orderedlist. Make careful note of the spelling and capitalization. Note that each of these classes must subclass Container and must define the methods of Container. If you do not follow these requirements the code which the TAs will use to grade the functionality of your code will fail and you will receive no credit for that part of the project.

The methods must provide the following functionality:

Gathering empirical data

In this project you will carry out an experiment which compares the behavior of two different list implementations. You will populate each data structure with data (actor names as keys and dates as values) and measure the actual time required to perform certain sequences of operations on the data structures, for various problem sizes. The data file you are using is discussed in the File reading section below.

In order to perform the experiment, you will perform the same sequence of operations on both data structures, and record the total time required to complete the sequence of operations, from which you can compute an average time per operation. You will repeat the experiment with data structures of different sizes.

In order to test the lists, you will use a class named ContainerCommandSequencewhich holds a sequence of ContainerCommands. The ContainerCommandSequence class provides two execute methods (executeTimeEach and executeTimeAll), both of which execute each command in the sequence on a given Container and report an execution time. The two methods measure the time slightly differently. executeTimeEach times each command individually, while executeTimeAll measures the overall time.

You will build a variety of ContainerCommandSequence objects, and execute each on both of your data structures, measuring the time in each case. Note that I am not specifying explicitly what performance tests you must run. Part of your task is to decide what performance tests are useful to run. You will be graded in part on the quality of your tests. You must provide a short write-up on the testing you have done, comparing the empirical results for each list implementation to each other and also to the theoretical runtime analysis (big-oh worst-case analysis).

ContainerCommand and its subclasses

The ContainerCommand class and most of its subclasses are provided to you. ContainerCommand's subclasses are FindCommand, InsertCommand, RemoveCommand, KeyFilterCommand and ValueFilterCommand. Each overrides the execute method ContainerCommand. Each also provides a constructor which provides relevant data for the command. See the code for details.

ContainerCommandSequence

A ContainerCommandSequence is a simple wrapper around an STL vector of ContainerCommand objects. This class is provided to you.

Partial Class Diagram

Here is a partial class diagram, showing you the basic relationships between the classes in the code I am providing to you. Note that not all details are shown. In particular, some details of the templating are not shown. Not all classes are represented in the diagram. Use this as a rough guide only to help you navigate the code. The LRSList code is provided mostly as a model of how you can build your list structures, and is not intended to be used directly.

UML class diagram

Base code

I am providing a base of code for you to work from. Copy the PP2.zip file from /projects/CSE250.

Advice

General advice

Make sure you read the project description carefully and thoroughly before starting the project. Also, be sure that you analyse what you need to do, and make a design long before you start coding. Not every aspect of this project is spelled out in this description!

Leverage what you know about object oriented design in doing your design. The time you invest in design up-front will pay off in significantly decreased time required to code and debug. Notice too, that I have done part of the design for you.

Design, implement and test your project incrementally!

C++ advice

Pointers

Main advice: make sure you initialize them!

Strings

Use the C++ <string> class, not C-style arrays. C-style arrays are unsafe.

Overloading the << operator

Overload this operator as appropriate so you can stream objects of classes you have defined to an output stream, such as cout.

#include and forward declarations

Have a look at the #include directives and forward declarations in LRStruct.h and IAlgo.h.

separating interface from implementation using .h and .cc files

You need to do this to be able to properly include files.

function prototypes

Functions must be declared before they are used. You can either place a function's defintion before its use, or declare the function before you use it (with a promise to provide a definition later) by providing a function prototype.

Virtual methods

To get polymorphic dispatch to work properly (i.e. to get the method definition associated with the actual type of an object rather than the declared type of the variable used to invoke the method) you must declare methods virtual.

Abstract methods and = 0

= 0 is the C++ way to express what Java expresses through abstract in a method header. For example, in Java:
   abstract int foo(double x);
in C++:
   int foo(double x) = 0;

Makefiles

To be discussed in more detail recitation. Make sure you compile with -template=wholeclass if you're using CC.

File reading

You will read the data from the file /projects/alphonce/IMDB/bioShort.list This file is a truncated form of the complete current biographies.list, and contains data from www.imdb.com. See this site for allowable uses of their data. I truncated the biographies.list file because it is so large (148 Mb) that standard editing tools cannot even process it (!).

You must read the comments at the top of this file to learn the structure of it. Your program must read directly from this file, not a "processed" version. Your program will read in only those records for which both the NM and DB fields are available. If they are not, skip the record and go on to the next one. The DD field may or may not be present. If it is not your program should assume that they are still alive.

Here's some information about how to read data from a file in C++. To read from a file you will need to use ifsteams. You can read about them in some on-line documentation at:

http://www-local.cse.buffalo.edu/Consulting/UNIX/Developer-6/manuals/stdlib/user_guide/loc_io/9.htm
Here's an example of using ifstreams:
#include 
#include 
#include 

using namespace std;

// File reading modified from: C++ Primer, 3rd edition Lippman and Lajoie
// page 1099.

int main() {
  cout << "Please enter a filename: ";
  string fileName;
  cin >> fileName;

  // Check to make sure filename is not empty
  if (fileName.compare("") == 0) {
    cout << "Bailing! \n";
    return -1;
  }

  // Filename is not empty - let's try to open the stream
  // Note the conversion of the filename from a C++ string
  // to a C-style string.
  ifstream inFile( fileName.c_str() ) ;

  // Check whether the file was opened successfully
  if (!inFile) {
    cerr << "unable to open input file: "
         << fileName << " -- bailing out!\n";
    return 1;
  } else {
    char ch;

    // Here we read a character from the input file and echo it to cout.
    while (inFile.get( ch )){
      cout << ch;
    }
    return 0;
  }
}

Design patterns

The code you are using and writing for this lab employs several design patterns, including most but not all listed below. The patterns listed below are linked to pages which discuss them. You can find more pages by doing a web search.

Questions?

If you have questions, please ask! If you need clarification or would like more information, please ask! Send e-mail to cse-250-alphonce@cse.buffalo.edu

Submission

Place all and only the files you intend to submit in a directory named PP2. You must name your directory PP2. If you do not you risk having points taken off.

The PP2 directory must include only your source code (declaration and implementation) files as well as your makefile and your runtime report (which must be a plain text file, such as you would produce writing in Emacs). Do NOT submit reports in any other format.

Your submission must not include object files, executable files or any files whose filenames begin with `#' or end in `~'. You also must not include the SunWS_cache directory and its contents in your submission.

Use the zip utility to create a file named `PP2.zip'. From the directory containing the `PP2' directory, issue the command

    zip -r PP2 PP2
Submit only the resultant `PP2.zip' file.

Submit only plain text (ASCII) files. Do not submit files in any other format, such as Microsoft Word. They will not be graded.

Use the submit_cse250 command to submit your work.

This project is due on or before 11:59:59 PM on March 1, 2004.


Notes


Carl G. Alphonce
Last modified: Wed Feb 11 01:00:41 EST 2004