Paul Beame, T. S. Jayram and Atri Rudra

Lower Bounds for Randomized Read/Write Stream Algorithms

Abstract : The use of massive data in many application areas has led to an interest in computational models that adequately capture the notion of efficient computation over such data sets. Perhaps the most important of these is the data stream model in which access to the input is restricted to be sequential. Motivated by the capabilities of modern storage architectures, we consider the following generalization of the data stream model where the algorithm has sequential access to multiple streams. Unlike the data stream model, where the stream is read only, in this new model, introduced by Grohe and and Schweikardt (PODS05), the algorithms can also write onto streams. There is no limit on the size of the streams but the number of passes made on the streams is restricted. On the other hand, the amount of internal memory used by the algorithm is scarce, similar to data stream model.

We prove lower bounds in this model for algorithms that are allowed to have 2-sided error. Previously, such lower bounds were shown only for deterministic and 1-sided error randomized algorithms. We consider the classical set disjointness problem, where the goal is to decide if the two input sets have an empty intersection. For this problem, we show a near-linear lower bound on the size of the internal memory used by a randomized algorithm with 2-sided error that is allowed to have sub-logarithmic number of passes. Applications include near-linear lower bounds on the internal memory for well-known problems in the literature: (1) approximately counting the number of distinct elements in the input; (2) approximating the frequency of the mode of an input sequence; (3) computing the join of two relations; and (4) deciding if some node of an XML document matches an XQuery (or XPath) query.

Our results asymptotically improve previously known bounds for any problem even in deterministic and 1-sided error models of computation.