Papers for presentation

Introduction

Listed below are some papers that are suitable for presentations (unless mentioned otherwise) along with some background information. If you prefer to see a proper "list", you can take a look here (however, note that some of them we will cover in the lecture part and so you will not be able to present them in class). Another good resource for papers is to browse through the bibliography of the papers mentioned below.

The "Classics"

This section talks about papers that are arguably the classics for data streams. We will be covering most of the papers during the lectures (except for [MP78] and [GM99], which can be used for student presentations).

Perhaps the most influential of them all is the one due to Alon,Matias and Szegedy [AMS96]. This paper received the Gödel prize in 2005. Here is the citation. This paper is about approximating frequency moments of items in a data stream. It was perhaps the first paper to use communication complexity lower bounds to prove limits on how well the frequency moments can be approximated. The algorithms for computing frequency moments also fall into two broad algorithmic techniques that we will see a lot of in the course: the algorithm for F_k for k>2 uses the sampling technique while the algorithm for F₂ uses the sketching technique.

The Henzinger et al. paper ([GRR98]) introduced the more "general" form of data stream where the algorithm can take multiple passes. The paper proves lower bounds for many problems, including graph theoretic ones (such problems apparently have not gotten much attention in the literature and might have network monitoring applications), using the connection to communication complexity. We might or might not have time to go through this paper. If not, this can be used for a student presentation.

The other two papers that are mentioned in the historical section in Muthu's paper, which I have not read at all (but will be good presentation papers) are as follows. The Munro-Paterson paper ([MP78]), which I could not find online, among others gave algorithms for estimating quantiles. The the Gibbons-Matias paper ([GM99]) introduced the concept of synopsis data structure.

I will also put in Indyk's paper on stable distributions ([I00]) here, which we will cover in the lectures. The paper gives an algorithm to compute F_p, where 0<p≤2 is a real number. The algorithm also works when items in the data stream can be deleted. The final algorithm uses fun stuff like PRGs and such like but we will just cover the basic algorithm to give an idea of how stable distributions are used.

Alternate Data Stream Models

The "usual" data stream model is where one has a read-only data stream on which the algorithm can make multiple passes (with the usual restrictions on memory usage-- this small space requirement will be common to all models that we'll talk about in this seminar). During the lectures, we will mostly focus on single pass model. Technically speaking, I guess the multiple-pass model was first defined in [HRR98].

In many applications, data "decays" over time, i.e., the more recent data is more relevant for computation. There have been models proposed to extend the data stream model to these situations. One model is where there is a strict sliding window such that the computation needs to be done only on the items that lie within that window. The model was proposed in [DGIM]. Interestingly, the lower bound arguments do not use communication complexity. A less "abrupt" way to handle this decay is to specify a decay function that reduces the "importance" of an item in the data stream, the "older" it gets ([CM06], [CKT08]).

The paper [ADRR04] (also see Chapter 6 in the thesis [R03]) introduces a model where sorting of the entire data stream has "unit cost." Note that with this model, computing frequency moments become trivial. However, the paper considers problems that are still interesting (and non-trivial) in this model. The thesis chapter also consider some other streaming models where there are multiple streams (and one stream can be an input to another stream).

The paper [FMSSS06] introduces a data stream model (called "MUD") that models the Map-Reduce model ([DG04]), which is the model of computation for Google.

All the models discussed till now deal with read-only streams. This is somewhat of an modeling "problem" when dealing with database applications where the data is actually stored in slow disk memory (on which it is expensive to make random accesses but is cheap to make sequential access, i.e. passes on disk memory is expensive) on which the algorithm can write. The paper [GKS05] introduced the data stream model is also writable (though the number of passes are still limited). In this paper however, there is only one stream. The paper uses communication complexity to prove lower bounds. The model was extended to multiple read/write streams in [GS05]. This paper proved lower bounds for deterministic algorithms. More interestingly, the usual communication complexity argument fails miserably so the authors used some combinatorial arguments (which they built from scratch). The results were extended to randomized algorithms with 1-sided error ([GHS06]) and 2-sided error ([BJR07]). All these papers consider some very database specific problems (e.g., determining if the join of two tables is empty or not).

All the models above assume that the stream is generated by an adversary. The paper [JKV07] looks at a data stream model where the data are generated by some random process. The results have applications in probabilistic databases, which is one of the hottest topics in databases these days. Also I attended the talk for this paper and this one has a very cute application of generating functions. The recent paper [CJP08] has a lower bound for computing the median for a probabilistic data stream model considered in the Munro-Paterson paper.

Lower Bounds

The main technique used to obtain lower bounds for data stream algorithms is the use of communication complexity lower bounds. We will see this connection in the lectures when we talk about the [AMS] paper.

As mentioned earlier [DGIM] uses a non-communication complexity argument for lower bounds.

The paper [BJKS02] uses tools from information theory to derive results in communication complexity. The connection to data streams is that communication complexity lower bounds in data stream. So I think this might not be that suitable a paper for presentation unless someone is really interested in communication complexity.

There are some new models where communication complexity does not work. See the section on alternate models.

Computing statistics

This is easily the most common topic for data stream algorithms: part of the reason being that maintaining statistics about the input stream is very important for both network and database applications.

Approximating Frequency Moments

For approximating F_k for k>2 the best upper bound is in the paper by Indyk and Woodruff ([IW05]-- also see Woodruff's thesis [W07]) while the best lower bound is due to Chakrabarti, Khot and Sun [CKS03].

Counting the number of distinct elements in a column of a relational table is a fundamental problem in databases. On the networking side, the number of distinct source addresses/subnets (or destination addresses for that matter) is a useful statistic for net admin. For approximating F₀, i.e. the number of distinct elements, the best upper bound is in the paper [BJKST02] while the best lower bound is in the paper [IW03] (also see the thesis [W07]). For this problem the more interesting question is the dependence of the space usage on e (where we are aiming for a 1 ±e approximation).

The paper [W04] gives a lower bound for approximating any frequency moment to within an 1 ±e factor in terms of e.

Heavy Hitters

In some (nost?) applications, instead of the frequency moments what is needed are the elements that occur "more frequently." More preceisely the heavy hitters problem is to output all elements that occur in more than some fixed fraction of the elements in the stream. A related problem is that is identifying the top-k elements (which is what you would think it should be).

The paper [MAA05] has results for both the problems. This paper seems to have the current best bounds. The work in this space seems to be divided into two worlds: those that maintain counters ([MG82], [DLM02], [KSP03], [MM02]) and those that maintain skteches ([CCF02], [CM03]).

Histograms

Histograms are used heavily in database engines to help query evaluations. Thus, the study of data stream algorithms to compute (near optimal) histograms is well motivated. There are different types/ definitions of histograms. Two such popular ones are as follows.

The so called V-optimal histograms try to minimize the "sum of squared error". The paper [GGIKMS02] deals with this type of histograms. Say we are interested in building a histogram with B buckets: this problem is important since database engines maintain histograms to help in query evaluations. [GGIKMS02] first builds a histogram with poly(B, logn) buckets and then from it builds a histogram with the required B buckets.

Another class of histograms try to main the quantiles. [GK01] gives data stream algorithms to maintain quantiles.

Clustering algorithms

Clustering is a very common practical problem which is an ever present problem in applications such as data mining. [COR03] first a k logn-median solution is obtained from which a k-median solution is obtained.

Kevin Chang's thesis and SODA paper [CK06] are more heavily machine-learning oriented. The main question is: "how can good clustering be attained with few passes?". The main tradeoff is between the number of passes and the memory space used.

Another relevant paper is [GMMMO03].

Entropy and Distributions

Estimating the empirical entropy and the (closely related meausre) entropy norm helps in applications such as network monitoring and traffic classification. The paper [LSOXZ06] proposed two algorithms for estimating the entropy norm, along with an evaluation of the algorithms on real network traffic traces. Basic ideas come from the classic [AMS96]. The paper [CCM07] gives near optimal upper and lower bounds on estimating the empirical entropy of the data stream. A student with interests in network traffic classification should read and present both [LSOXZ06] and [CCM07] since the former has empirical data and the latter has the best theoretical bounds.

The paper [IM08] considers the following problem. Suppose the items are of the form (i,j), where (say) both i and j takes values from {1,...,n}. Further, the frequencies of both positions define a joint distribution on pair of random variables (X,Y). The paper deals with the problem of deciding whether the random variables X and Y are correlated or not. The proofs are very cute extensions of techniques from [AMS96] (and [I00]).

The paper [BFRSW00] gives a property tester to check if two input distributions are close.

Graph Streams

A passive network monitor (e.g. a sniffer) is presented with a stream of (source IP, dest IP) pairs. These are edges of a multigraph. What can we say about the structure of such graph? If the current structure deviates too much from the "typical structure" we've learned from the past, then that might indicate some security problem. For instance, if a source is generating too much traffic (than it usually does), then its degree in the graph is too high compared to normal. A streaming algorithm for detecting these "superspreaders" was proposed and experimentally tested in [VSGB05]. Algorithms for estimating degree frequency moments, range sums, and heavy hitters can be found in [CM05]. [FKMSZ04] and [FKMSZ05] have some results on graph spanners and diameters under the semi-streaming model where space is propotional to the number of nodes and sublinear in the number of edges. For lowerbounds (which justifies the semi-streaming model) see [HRR98]. For an application in DDoS detection, see this paper.