With the explosion of the Internet, LDAP directories and XML, there is an ever greater
need to evaluate queries involving (sub)string matching. In many cases, matches need to be
on multiple attributes/dimensions, with correlations between the dimensions. Effective
query optimization in this context requires good selectivity estimates. In this paper, we
use multi-dimensional suffix trees as the basic framework for substring selectivity
estimation. Given the enormous size of these trees for large databases, we develop a space
and time efficient probabilistic algorithm to construct multi-dimensional pruned suffix
trees directly. We then present two techniques to obtain good estimates for a given
multi-dimensional substring matching query, using a pruned suffix tree. The first one,
called GNO (for Greedy Non-Overlap), generalizes the greedy parsing suggested by
Krishnan et al., 1996 for one-dimensional substring selectivity estimation. The second one,
called MO (for Maximal Overlap), uses all maximal multi-dimensional substrings of the query
for estimation; these multi-dimensional substrings help to capture the correlation that
may exist between strings in the multiple dimensions. We demonstrate experimentally, using
real data sets, that MO is substantially superior to GNO in the quality of the estimate.