Testing Self-Similarity of Sequences

Funda Ergun, Simon Fraser University

A sequence S of length n is said to be p-periodic if it consists of at least two repetitions of a block b of size p. The periodicity of a sequence can be expressed in a multitude of equivalent ways involving comparing substrings to each other, as well as comparing the entire sequence to a shifted version of itself. Any of these definitions can be used to check in linear time whether S is p-periodic.

In this talk, we are interested in what can be determined regarding the self-similarity (or p-periodicity) of a sequence given access to o(n) characters from the sequence, and o(n) space. Within the property testing context, our goal is to test with sublinear resources whether a given sequence S is p-periodic, or "far" from being p-periodic. To do this, we need a notion of self-distance. We use the several equivalent notions of periodicity to define several notions of such distance and call S approximately-p-periodic if its self-distance is small.

While it is readily seen that if S has zero self-distance it must be p-periodic regardless of the particular distance notion used, if the distance is positive, the various notions of self-distance, and their implied notions of approximate-p-periodicity are not equivalent. We analyze these notions to show that they are constant approximations of one another. We then show how one can test whether S is approximately p-periodic with respect to any of these notions using O(sqrt(n) polylog(n)) characters from the input. Our techniques involve re-using samples that we have seen, and a two-stage sampling approach in order to keep the sample complexity low. The resulting algorithms can also be used to find, with similar resources, the smallest period, the largest period, all periods of S, etc. We also show how to improve our techniques so that p-periodicity can be tested not only using sublinear space and samples, but also sublinear time.