CpSc 538G: Industrial Strength Distributed Explicit State Model Checking

October 3, 2016

CpSc 513
October 5, 2016


  0.  The big picture
    A.  Even though OBDDs and SAT can produce impressive results, brute-force,
    	    explicit model checking sometimes works quite well
      1.  Easy to distribute
      2.  Easy to implement symmetry reduction
      3.  Some boolean functions don't have succinct symbolic representations.

    B.  Software engineering issues in implementing a distributed model checker
      1.  Needs to be simple
      2.  Needs to be robust
      3.  Should work with exisiting design methods
      4.  Must exploit the combined memory of all the machines
      5.  Nice if it runs faster as well

    C.  Proposed solution:
      1.  Erlang communication layer on top of legacy C++ code for murphi
      2.  The Erlang code is small and simple
        a.   Robust -- "obviously" correct
        b.   Flexible platform for experiments
      3.  The C++ code is highly optimized and tested by 25 year of use

    D.  Validation
      1.  Run it on real verification problems
      2.  Set the largest explicit-state model record
      3.  Measure effect of performance optimizations.



  I.  Explicit state model checking: murphi
    0.  See "Protocol Verification as a Hardware Design Aid"
          It's on the reading list for "optional reading", Oct. 3.
    A.  Focus on safety properties:  AG p
      1.  Find all reachable states
      2.  For each state, verify that p holds
      3.  murphi models: a collection of "rules"
        a.  syntax:  Rule guard => action
	  *   guard: a boolean valued expression
	  *   action: a sequence of assignment statements
        b.  semantics (very informally):
	      Let s be a state that satisfies guard. the current state, s, satisifies guard,
	      Let s' be the state reached by performing the assignments
	        from state s.
	      Then s' is a successor of s.
        c.  example: Euclid's gcd algorithm
	      initially done == false;
	      rule (n == 0) OR (m == 0) =>
	        gcd := n+m;
		done := true;
	      rule (n >= m) =>
	        n := n-m;
	      rule (n < m) =>
	        m := m-n;

    B.  Pseudo-code
      1.  The code
	    alreadySeen = new Set(); // initially empty
	    workQueue = new Set(initialStates);
	    while (workQueue != {}) {
	      s = workQueue.take();
	      newStates = expand(s);  // find all successors of s
	      for newState in newStates {
	        if not (newState in alreadySeen) {
		  if not p(newState) {
		    reportError(newState);
		    // even better -- print a counterexample path
		    return(fail);
		  } // if not p
		  alreadySeen.put(newState);
		  workQueue.put(newState);
		} // if not alreadySeen
	      } // for newState
	    } // while WorkQueue
	    return(success);
      2.  Implementation notes
        a.  memory usage is a big concern
        b.  workQueue can be stored on disk
	  *   Note that workQueue.take() can return states in an arbitrary order
	  *   This lets us keep a small buffer of workQueue states in memory and
	        read and write large blocks of states to disk.
	  *   Execution can overlap the reads and writes.
        c.  alreadySeen needs to be in memory (or the algorithm becomes very
	        slow).
	  *   Hash compaction helps
	  *   Verification becomes probabilistic -- a small probability of
	        failing to visit one or more states.
	  *   Failing to visit a state could result in a "false positive":
	        reporting success when there's really an error.
	  *   We can compute a lower bound on this probability -- in practice
	        it's often in the 10^-5 to 10^-3 range.
	  *   Run the model checker a few times to make this probability
	        arbitrarily small -- use different random parameters for
		the hash functions each time.
        d.  expand() computes the successors of each state
	  *   murphi compiles the model description into C++ which is then
	        compiled to executable code.
	  *   this makes expand run very fast (but it still tends to be
	        the bottleneck).

    C.  Optimizations:
      1.  Hash compaction
        a.  see Stern and Dill, "Improved probabilistic verification by hash
	      compaction", citation [5] in the PReach paper.
        b.  alreadySeen is a hash-table that stores a hash of the states
	c.  Verification becomes probabilistic
	  *   a small probability of failing to visit one or more states
	        if two states have the same hash_index and hash_signature
		  values
        d.  Failing to visit a state could result in a "false positive":
	        reporting success when there's really an error.
        e.  We can compute a lower bound on this probability -- in practice
	        it's often in the 10^-5 to 10^-3 range.
        f.  Run the model checker a few times to make this probability
	        arbitrarily small -- use different random parameters for
		the hash functions each time.
      2.  Symmetry reduction
        a.  Explicit state model checking often used for protocol verification
	b.  Typically, there are N clients, all executing the same protocol
	c.  Each state appears in N! versions (for the permutations of the N
	      clients).
	d.  Create a "cannonical" version of each state
	  *   For example each state is a bit-vector
	  *   We can treat this bit-vector as an integer
	  *   The "cannonical" version is the smallest integer among the
	        permuted states.
	  *   We can compute the cannonical version by sorting the components
	        of the state.
	e.  Symmetry reduction offers a factor of N! reduction in the number
	      of states we need to store.
	f.  There aren't good symbolic schemes for symmetry reduction.

    D.  The distributed version
      0.  see Stern & Dill, optional reading for Oct. 5.
      1.  Each state has an "owner" -- computed as yet another hash function
            that maps states to processes
      2.  The pseudo-code
            if self() == rootProcess() {
	      for s0 in initialStates {
	        send(owner(s0), s0)
	      }
	    }
	    alreadySeen = new Set(); // initially empty
	    newlyReceived = new Set();
	    workQueue = new Set();
	    while (not done())
	      while(not inbox.empty()) { // fetch the mail.
		newState = inbox.take();
		if not (newState in alreadySeen) {
		  if not p(newState) {
		    reportError(newState);
		    fail();  // a little more involved than
		             // the single-processor version 
		  } // if not p
		  alreadySeen.put(newState)
		  workQueue.put(newState)
		} // if not alreadySeen
	      }
	      if(not workQueue.empty()) {
		s = workQueue.take();
		newStates = expand(s);  // find all successors of s
		for newState in newStates {
		  send(owner(newState), newState);
		} // for newState
	      } // if not workQueue.empty()
	    } // while not done()

      3.  How do we know if we're done?
        a.  it's not sufficient just to wait until each process reports
	      that it has reached a point where it's inbox and workQueue
	      are empty
        b.  there could be messages in flight
	c.  the basic approach
	  *   The root periodically queries the workers.
	  *   A worker responds when it's inbox and workQueue are empty.
	    +   The first response is "tryAgain"
	    +   Subsequently, if the worker has received any messages since
	          the last query, it replies "tryAgain"
	    +   If the worker has not received any messages since the last
	          query, it responds with a tuple:
		    the total number of messages it has sent,
		    the total number of messages it has received
	    +   The root computes the total of the number of messages sent
	          and the total of the number of messages received.
	    +   If the no worker replied with "tryAgain" and if the sent
	          message count matches the received message count, then
		  we're done.
	    +   Otherwise, the root tries again.



 II.  PReach
    A.  Main idea: an "industrial strength" implementation of Stern & Dill's
            distributed approach
      1.  Use legacy C++ code for murphi
        a.  state expansion
	b.  hash table for tracking states we've already seen
      2.  Use Erlang for the communication layer
        a.  sending states to owners
	b.  receive states
	c.  termination detection
      3.  Why?
        a.  the existing murphi code is well-optimize and stable
        b.  using the existing murphi front-end makes PReach compatible with
	        previous work with murphi
	  ->  easy to test and make performannce measurements

    B.  Making PReach "industrial strength"
      1.  Erlang
        a.  writing parallel programs is hard
	b.  Erlang provides support to avoid common pitfalls
	  *   Erlang is functional: model-checking is easily written in a
	        functional style
	  *   Erlang uses message passing:
	        no memory races
		communication is explicit
	  *   Erlang uses pattern matching, including for messages
	        easy to get the message you intended
	  *   Erlang is slow
	    +   perhaps 5-10x slower than C
	    +   even with that, most of the time is spent in state expansion
	          state expansion is highly optimized C++ code
		  who cares that Erlang is slow

      2.  Flow control
        a.  The Erlang inbox is a list
	  *  Incoming messages are prepended to the list
	  *  A linear search is done to find the oldest message that matches
	       the receive expression's pattern
	  *  This is harmless if the inbox only holds a few messages
	  *  This becomes a serious performance issue if there are thousands of
	       pending messages
        b.  Positive-feedback and failures
	  *   If process P gets slightly behind:
	    +   The inbox for P becomes larger than those of its peers
	    +   It takes the P longer to take messages out of its inbox
	    +   P gets further behind its peers, and P's inbox grows
	    +   P runs out of memory for its inbox and crashes
	    +   :(
        c.  Solution
	  *   Use a crediting scheme
	  *   Let P and P' be processes
	  *   P can have at most C messages that it has sent to P' and that P'
	        has not yet acknowledged.
	  *   Later in the paper, they describe combining crediting with message
	        batching

      3.  Load balancing
        a.  Some nodes are outliers on execution time
	b.  Key idea: once a state is in the workQueue,
	      *any* process can perform the state expansion
	c.  When P sends a batch of states to P', P also tells P' how many
	     states are in P's workQueue
	d.  If P' sees that P'.workQueue.size() > LBFactor * P.workQueue.size(),
	     then P' sends some of its states (e.g. half) to P.
	e.  This is a very coarse-grained approach to load balancing
	  *   easy to implement
	  *   works well in practice

      4.  State batching
        a.  Sending a message between machines has a fairly large amount of
	      overhead
        b.  Save up states and send a few hundred or a few thousand states
	      at a time
        c.  Include a "time-out" mechanism to avoid starving another process
	  *   If this process has an empty workQueue, send all pending states
	  *   If more than a second has elapsed since P sent states to P',
	        then P sends its pending states to P'.

    C.  Evaluation
      1.  Run PReach on
        a.  Standard benchmarks
	  *   Many references to "German"
	  *   This is a simple, cache coherence protocol proposed by
	        Steven German at IBM
	  *   Commonly used benchmark for model checkers
	  *   GermanK mean the German protocol with K processors.
	b.  Examples included in the murphi distribution
	c.  Real designs at Intel
      2.  Capacity
        a.  PReach model-checked models with >30 billion states
	      largest published explict-state model checking result
        b.  Since then, PReach has verified models with several hundred
	      billion states
      3.  Robustness
        a.  PReach doesn't crash
        b.  Extensively used at Intel
        c.  Used in other industrial design teams and by some academic
	      computer architecture groups.
      4.  Speed
        a.  linear speed-up for large models
        b.  reported about half the throughput (state expansions/second) of
	      native, single processor murphi
        c.  performance tuning since then has reduced the overhead of the
	      Erlang code.
        d.  there may be some room for improvement, but
	  *   The people we interact with in industry don't want more
	        performance tuning
	  *   Simple, robust, easy to maintain code is mandatory
	  *   Using a few more machines is *way* cheaper than any
	        sophisticated optimizations
	 
	    
	        
III.  More PReach
    A.  Brad Bingham continued his work with Intel verification teams to
          explore explicit state model checking for his Ph.D. thesis
    B.  Focus on liveness properties
      1.  quiescence: AG EF q -- there is a path to a quiescent state, e.g.
            one where all pending requests have been granted
      2.  parameterized quiescence:
        a.  show that verifying a small number of nodes verifies a symmetric
	      protocol for any number of nodes
        b.  a very nice use of abstract
	  *   needs both over-approximate abstraction --
	        to find all reachable states
	  *   and under-approximate abstraction --
	        to to show each reachable state has a path to a q state
      3.  response: AG p AF q -- every request eventually gets granted
      4.  Stronger properties require simpler models and more compute
              resources
	a.  simple, quickly checked properties are helpful in finding bugs
	      even if they aren't guaranteed to find *all* bugs
	b.  the stronger properties are valuable once the "easy" bugs have
	      been eliminated.
      5.  Comparing PReach with other model checkers
        a.  best scaling to large problems of any explicit state model checkers
	b.  compared with state-of-the-art symbolic checkers
	  *   The symbolic methods nearly always win for safety properties
	  *   PReach nearly always wins for liveness properties

    C.  Valerie Ishida did a M.Sc. thesis where she implemented a fault
            tolerant version of PReach: PreachDB
      1.  Used Erlang's persistent store features
      2.  PReachDB is slow, but it points to how we could make a 
            fault-tolerant version.


This document was last modified on: (GMT)