# Instances for Learning the Parity Function

The following description is based on the original description of James Crawford which is available here from the DIMACS ftp-site.

These instances are a propositional versions of parity learning problems. This problem is believed to be quite difficult because it is self-reducible (i.e. any instance of the problem can be reduced to a collection of randomly distributed instances of the problem). The problem was suggested by Mike Kearns. Rob Schapire and Haym Hirsh helped with building the generator.

The files are named:

parn-i.cnf
where n is the number of variables in the original parity learning problem and i is just the instance number. There are five instances each with n=8, 16, and 32. All instances are satisfiable.

There are also files named:

parn-i-c.cnf
These are just compressed versions of the other files (using a simple polynomial time simplifier). Some algorithms work better on the compressed versions and some may work better on the uncompressed versions.

#### Informal Problem Statement

The general problem is to identify an unknown Boolean function given (potentially noisy) I/O samples of the function (and given the class of the function). A successful solution to this problem need not actually be the function that gave rise to the I/O samples, as long as the function that is returned as a solution satisfies some suitably low bound on the number of mistakes it makes on the samples.

The class of Boolean functions considered here is the set of all parity functions over a set of Boolean variables and subsets of those variables. Stated more precisely:

PROBLEM STATEMENT:

Given:
m vectors: X_1,...,X_m of length n [these are the sample inputs]
m bits: y_1,...,y_m [these are the sample outputs]
An error tolerance 0 <= E < 1.

Find:
n bits: a_1,...,a_n such that there are at most mE numbers t for which

parity( a_1*X_t^1,..., a_n*X_t^n) NOT= y_t
EXPLANATION:

The X_i represent m sample inputs over n Boolean variables, the y_i the (potentially incorrect) value of the unknown parity function on each of the m inputs. E represents a bound on the acceptable error rate.

The hard part of this learning problem is to identify the subset of the n Boolean variables over which the parity function is computed. This is represented by the n bits a_1,...,a_n: for those variables i that are part of the parity function, a_i=1 (and for those that are not a_i=0). Again, learning need not identify the correct set of variables, as long as the parity function over the set that is identified gets a sufficient number of the samples (all but mE of them) correct.

GENERATING PROBLEMS:

One can generate guaranteed solvable instances by randomly generating the X's and the a's, calculating the parities, and then randomly corrupting Em (or less) of them.

GENERATING HARD PROBLEMS:

For E=0 a polynomial algorithm is known. For E=.5 almost any random assignment of the a's is likely to work. The problem is believed to be quite difficult for m=2n, E ~= 1/4. We have found empirically that for small n, m=2n, E = 1/8 seems to work well (since for small n there tend to be too many solutions for E ~= 1/4).

PROPOSITIONALIZATION:

We generate a propositional wff saying "a_1,...,a_n is a solution to this parity learning problem". This formula will contain propositional variables for a_1,...,a_n (thus one can read out the "answer" by looking at the values assigned to a_1,...,a_n by a model of the formula).

0. Setup: Randomly choose the X's. Randomly choose bits s_1,...,s_n (this will be the unknown target that the a's will try to "learn"). Compute y's by taking parity and corrupting the appropriate number of them.

1. Generate propositional formula that "calculate" the parities of a*X_t. We do this with a matrix R of propositional variables. We generate propositional formula equivalent to the following:

For j=1 to m: R^j_0 = y^j
For i=1 to n: R^j_i = R^j_(i-1) XOR a_i * x^j_i

For all t, these formula force R^t_n = 0 iff y^t is the parity of the vector x^t.

2. Generate propositional formula to sum the R^t_n. We do this using matrices S and c (carry) of propositional variables. The invariant here is that for i = 0 to log_2(m): S^t_i = ith bit of sum of first t terms.

For i=0 to log_2(m): S^0_i = 0.

For j=1 to m:
c^j_0 = R^j_n
For i = 0 to log_2(m): S^j_i = S^(j-1)_i XOR c^j_i
For i = 1 to log_2(m): c^j_i = S^(j-1)_(i-1) AND c^j_(i-1)

These formula force S^m to be a base two representation of the number of disagreements between the parity of X*a and y.

3. We generate propositional formula saying "S^m is at least m*E". We currently set n to a power of two, E to 7/8, and m to 2n. This reduces this last check to just a check that the high order bits of S^m are not set.

#### Instance hardness

These instances have been shown to be rather hard for systematic as well as local search algorithms. Only very recently an algorithm for solving the n=32 instances has been developped and they were shown to be satisfiable [WM99]. Standard state-of-the-art systematic algorithms solve the instances upto n=16 [LA97].
Local search algorithms perform worse on these instances than the systematic algorithms when, for example, comparing CPU-time [HS99]. The best performing local search algorithms for these instances is the R-Novelty, a variant of the WalkSAT family of algorithms. It still can solve the compressed instances for n=16 in reasonable time (ca. 40 Mio. variable flips). The preprocessing (compression) of the instances (the parn-i-c.cnf instances are preprocessed by a polynomial simplifier) largely improves the performance of the local search algorithms, as can be noted when comparing local search performance on the n=8 instances (We observed a factor of roughly 35 of the mean search effort for R-Novelty).

#### Acknowledgements

The instances have originally been contributed to the DIMACS benchmark set by James Crawford.

#### Bibliography

 [HS99] Holger H. Hoos and Thomas Stützle. Systematic vs. Local Search for SAT. Technical Report TR-99-06. Department of Computer Science, University of BC, Canada, 1999. [LA97] C.M. Li and Anbulagan. Look-Ahead Versus Lock-Back for Satisfiability Problems. In Proceedings of CP'97. LNCS, pages 341--355, 1997. [WM99] Joost P. Warners and Hans van Maaren. A two phase algorithm for solving a class of hard satisfiability problems. Operations Research Letters. Vol. 23(3-5), pages 81-88, 1999.