# Testing Self-Similarity of Sequences

## Funda Ergun, Simon Fraser University

A sequence S of length n is said to be p-periodic if it consists of
at least two repetitions of a block b of size p.
The periodicity of a sequence can be expressed in a multitude of
equivalent ways involving comparing substrings to each other, as well
as comparing the entire sequence to a shifted version of itself. Any of
these definitions can be used to check in linear time whether S
is p-periodic.

In this talk, we are interested in what can be determined
regarding the self-similarity (or p-periodicity) of a sequence given
access to o(n) characters from the sequence, and
o(n) space. Within the property testing context, our goal is to
test with sublinear resources whether a given sequence S is
p-periodic, or "far" from being p-periodic.
To do this, we need a notion of self-distance.
We use the several equivalent notions of periodicity to define several
notions of such distance and call S approximately-p-periodic if its
self-distance is small.

While it is readily seen that if S has zero self-distance it must be
p-periodic regardless of the particular distance notion used, if
the distance is positive, the various notions of self-distance,
and their implied notions of approximate-p-periodicity are not equivalent.
We analyze these notions to show that they are constant approximations
of one another. We then show how one can test whether S is approximately
p-periodic with respect to any of these notions using O(sqrt(n) polylog(n))
characters from the input. Our techniques involve re-using samples
that we have seen, and a two-stage sampling approach in order to keep
the sample complexity low. The resulting algorithms can also be used to find,
with similar resources, the smallest period, the largest period,
all periods of S, etc. We also show how to improve our techniques so that
p-periodicity can be tested not only using sublinear space and samples,
but also sublinear time.