Listed below are some papers that are suitable for presentations (unless mentioned otherwise) along with some background information. If you prefer to see a proper "list", you can take a look here (however, note that some of them we will cover in the lecture part and so you will not be able to present them in class). Another good resource for papers is to browse through the bibliography of the papers mentioned below.

This section talks about papers that are arguably the classics for data streams. We will be covering most of the papers during the lectures (except for [MP78] and [GM99], which can be used for student presentations).

Perhaps the most influential of them all is the one due to Alon,Matias
and Szegedy [AMS96]. This paper
received the Gödel prize
in 2005. Here is the citation.
This paper is about approximating frequency moments of items in a data stream.
It was perhaps the first paper to use communication complexity lower bounds to prove
limits on how well the frequency moments can be approximated.
The algorithms for computing frequency moments also fall into two broad
algorithmic techniques that we will see a lot of in the course:
the algorithm for *F _{k}*
for

The Henzinger et al. paper ([GRR98]) introduced the more "general" form of data stream where the algorithm can take multiple passes. The paper proves lower bounds for many problems, including graph theoretic ones (such problems apparently have not gotten much attention in the literature and might have network monitoring applications), using the connection to communication complexity. We might or might not have time to go through this paper. If not, this can be used for a student presentation.

The other two papers that are mentioned in the historical section in Muthu's paper, which I have not read at all (but will be good presentation papers) are as follows. The Munro-Paterson paper ([MP78]), which I could not find online, among others gave algorithms for estimating quantiles. The the Gibbons-Matias paper ([GM99]) introduced the concept of synopsis data structure.
I will also put in Indyk's paper on stable distributions ([I00]) here, which we will cover in the lectures.
The paper gives an algorithm to compute *F _{p}*, where

The "usual" data stream model is where one has a *read-only* data
stream on which the algorithm can make multiple passes (with the usual
restrictions on memory usage-- this small space requirement will be common to
all models that we'll talk about in this seminar). During the lectures,
we will mostly
focus on *single* pass model. Technically speaking, I guess
the multiple-pass model was first defined in [HRR98].

In many applications, data "decays" over time, i.e., the more recent data
is more relevant for computation. There have been models proposed to
extend the data stream model to these situations. One model is where there
is a strict *sliding window* such that the computation needs to be done
only on the items that lie within that window. The model was proposed in
[DGIM].
Interestingly, the
lower bound arguments do *not* use communication complexity.
A less "abrupt" way to handle this decay is to specify a decay function
that reduces the "importance" of an item in the data stream, the "older"
it gets ([CM06],
[CKT08]).

The paper [ADRR04] (also see Chapter 6 in the thesis [R03]) introduces a model where sorting of the entire data stream has "unit cost." Note that with this model, computing frequency moments become trivial. However, the paper considers problems that are still interesting (and non-trivial) in this model. The thesis chapter also consider some other streaming models where there are multiple streams (and one stream can be an input to another stream).

The paper [FMSSS06] introduces a data stream model (called "MUD") that models the Map-Reduce model ([DG04]), which is the model of computation for Google.

All the models discussed till now deal with *read-only* streams. This is
somewhat of an modeling "problem" when dealing with database applications where
the data is actually stored in slow disk memory (on which it is expensive to
make random accesses but is cheap to make sequential access, i.e. *passes*
on disk memory is expensive) on which the algorithm can *write*. The
paper [GKS05] introduced the data
stream model is also writable (though the number of passes are still limited).
In this paper however, there is only *one* stream. The paper uses
communication complexity to prove lower bounds. The model was extended to
*multiple* read/write streams in [GS05].
This paper proved lower bounds for deterministic algorithms. More interestingly,
the usual communication complexity argument fails miserably so the authors
used some combinatorial arguments (which they built from scratch). The results
were extended to randomized algorithms with 1-sided error ([GHS06])
and 2-sided error ([BJR07]). All
these papers consider some very database specific problems (e.g., determining
if the join of two tables is empty or not).

The main technique used to obtain lower bounds for data stream algorithms is the use of communication complexity lower bounds. We will see this connection in the lectures when we talk about the [AMS] paper.

As mentioned earlier [DGIM] uses a non-communication complexity argument for lower bounds.

The paper [BJKS02] uses tools from information theory to derive results in communication complexity. The connection to data streams is that communication complexity lower bounds in data stream. So I think this might not be that suitable a paper for presentation unless someone is really interested in communication complexity.

There are some new models where communication complexity does not work. See the section on alternate models.

For approximating *F _{k}* for

Counting the number of distinct elements in a column of a relational
table is a fundamental problem in databases. On the networking side,
the number of distinct source addresses/subnets (or destination
addresses for that matter) is a useful statistic for net admin. For approximating *F _{0}*, i.e. the number of distinct elements,
the best upper bound is in the paper [BJKST02]
while the best lower bound is in the paper [IW03] (also see the thesis [W07]).
For this problem the more interesting question is the dependence of the space
usage on

The paper [W04] gives a lower bound
for approximating any frequency moment to within an *1 ±e*
factor in terms of *e*.

In some (nost?) applications, instead of the frequency moments what is needed are the
elements that occur "more frequently." More preceisely the *heavy hitters*
problem is to output all elements that occur in more than some fixed fraction of
the elements in the stream. A related problem is that is identifying the
*top-k* elements (which is what you would think it should be).

The paper [MAA05] has results for both the
problems. This paper seems to have the current best bounds. The work in this
space seems to be divided into two worlds: those that maintain *counters*
([MG82], [DLM02], [KSP03],
[MM02]) and those that maintain *skteches*
([CCF02], [CM03]).

Histograms are used heavily in database engines to help query evaluations. Thus, the study of data stream algorithms to compute (near optimal) histograms is well motivated. There are different types/ definitions of histograms. Two such popular ones are as follows.

The so called *V-optimal* histograms try to minimize the
"sum of squared error". The paper [GGIKMS02] deals with this type of histograms.
Say we are interested in building a histogram with *B* buckets: this
problem is important since database engines maintain histograms to help
in query evaluations. [GGIKMS02] first builds a histogram with *poly(B, logn)*
buckets and then from it builds a histogram with the required *B* buckets.

Another class of histograms try to main the *quantiles*. [GK01] gives data stream algorithms to maintain quantiles.

Kevin Chang's thesis and SODA paper [CK06] are more heavily machine-learning oriented. The main question is: "how can good clustering be attained with few passes?". The main tradeoff is between the number of passes and the memory space used.

Another relevant paper is [GMMMO03].

Estimating the empirical entropy and the (closely related meausre) entropy norm helps in applications such as network monitoring and traffic classification. The paper [LSOXZ06] proposed two algorithms for estimating the entropy norm, along with an evaluation of the algorithms on real network traffic traces. Basic ideas come from the classic [AMS96]. The paper [CCM07] gives near optimal upper and lower bounds on estimating the empirical entropy of the data stream. A student with interests in network traffic classification should read and present both [LSOXZ06] and [CCM07] since the former has empirical data and the latter has the best theoretical bounds.

The paper [IM08] considers
the following problem. Suppose the items are of the form *(i,j)*,
where (say) both *i* and *j* takes values from *{1,...,n}*.
Further, the frequencies of both positions define a joint distribution
on pair of random variables *(X,Y)*. The paper deals with the problem
of deciding whether the random variables *X* and *Y* are
correlated or not. The proofs are very cute extensions of techniques from
[AMS96] (and [I00]).

The paper [BFRSW00] gives a property tester to check if two input distributions are close.