CpSc 538G: Bebop: A Symbolic Model Checker for Boolean Programs

October 12, 2016
  0.  The big picture
    A.  What's a boolean program, and why should we care?
      1.  A boolean program is a program where all variables are
              boolean-valued
        a.  Note that this means no pointers
	b.  By extension, all functions are call-be-value --
	      the callee can't modify local variables of the caller.
      2.  Boolean programs are useful abstractions of regular programs
        a.  consider a "regular" program with assertions --
	  *   we want to show that the assertions never fail
	  *   this is equivalent to showing that the "Error" statement is
	        unreachable
        b.  consider all execution paths
	  *  replace conditions for if, while, and assert with boolean
	       variables
	  *  model check this *boolean program*
	  *  if the model-checker finds a path to the error-statement
	  *  check to see if that path is feasible in the original program
	         (e.g. with SMT)
	    +  If the counter-example is feasible, report the error
	    +  If the counter-example is infeasible,
	      -  add a predicate to exclude this path and others that are
		   infeasible for the same reason
	      -  this may require adding new boolean variables.
	  *  continue until the property is verified or a counter-example
	       is found.

    B.  Key insight
      1.  State explosion is a big problem
        a.  If the model checker keeps track of the variables for all
	      functions that might be live on the call stack, this can
	      be huge.
        b.  Depth of recursive calls can be unbounded.
      2.  Bebop handles each function separately
        a.  Observes the values for parameters and global variables at all
	      calls
	b.  Computes the possible outcomes for each call.
	c.  The bebop algorithm composes the results for all of the functions
	      to verify a program.
      3.  How many states can bebop encounter?
        a.  Assume a program with E edges in the call graph, g global
	      variables, and at most l local variables in any function.
        b.  The runtime is O(E * 2^(g+l))
	c.  Note that E is proportional to the number of lines of code --
	      Verification time is *linear* in source code size.  YAY!
	c.  l is per function, and can be of moderate size, regardless of the
	      number of functions.

    C.  Ideas in the paper
      1.  Control-flow-graph
      2.  Valuations
      3.  Context-free grammars
      4.  Model checking
      5.  Path and summary edges
      6.  And some more that are used implicitly
        a.  Partial order reduction
	b.  Probably more
	   

  I.  Languages and Automata (a.k.a. CpSc 421/501 in 20 minutes)
    A.  Definitions
      1.  An alphabet is a finite set of symbols
        a.  a symbol just means "a unique element" for the set
	b.  we will usually write Sigma for the alphabet
	      (the Greek letter if this were in latex on on the whiteboard)
	c.  E.g. Sigma = {a, b, c, d, potato, _}
      2.  A string is a sequence of zero or more symbols from the alphabet
        a.  I'll write s1 . s2 to denote the concatenation of s1 with s2
	      (a "centered dot" if this were latex or the whiteboard)
        b.  epsilon indicates the "empty string" (the string of length 0).
	c.  example: a._.b.a.d._.potato
      3.  A language is a set of strings
        a.  The set can be infinitely large
	b.  All kinds of computational complexity theory about how hard it is
	      to determine if a string is a member of a language.
	c.  For various classes of languages, we can define a "recognizer"s
	  *   The recognizer is a "machine" that reads the string and outputs
	    +   "Accept" if the string is in the language, and outputs
	    +   "Reject" if the string is not in the language.
	  *   For the classes of languages described below
	    +   There's a way to describe the language as a mathematical
	          expression
	    +   There's also a way to describe the language with a recognizer
	          machine
	    +   This correspondence of expressions and machines is at the
	          heart of computer science:
		    "Computer science is mathematics that is executable."
		  OK, there's more than just that, but this is a pretty good
		  10 word or less description of the field.

    B.  Regular languages
      1.  Recognized by finite automata
            An alphabet, a set of states, an initial state, a state transition
	       relation, and a set of accepting states
	a.  alphabet: Sigma, as defined above
	b.  set of states: another finite set of symbols.
	  *  I'll usually write Q for the set of automaton states.
	c.  a state transition relation: Delta
	  *  Let q1, q2 be elements of Q, and c be an element of Sigma
	  *  (q1, c, q2) in Delta means that the machine can move from
	       state q1 to state q2 if c is the next symbol in the input
	       string.  c is consumed.
	  *  If for each pair (q1, c) there is exactly one q2 such that
		 (q1, c, q2) in Delta
	    +  Then we say that the machine is deterministic.
	    +  We can write delta(q1, c) to denote the unique q2 such that
		 (q1, c, q2) in Delta.
	  *  If there are choices of (q1, c) such that there is more than
	       one q2 with (q1, c, q2) in Delta,
	    +  Then we say that the machine is non-deterministic.
	    +  It can make an arbitrary choice
	d.  a set of accepting states
	  *  I'll usually write F for this set.
	  *  A string s is "recognized" by the automaton if it reaches a
	      state in F after reading the symbols in s.
	  *  If the automaton is non-deterministic, then we say that s is
	       in the language if there is some choice of moves that leads
	       to an accepting state.  We don't require that all choices
	       lead to an accepting state.
      2.  Generated by regular expressions
        a.  epsilon is a regular expression
	b.  If c in Sigma, then c is a regular expression
	c.  If e1 and e2 are regular expressions, then
	  *   e1 . e2 is a regular expression (concatenation)
	  *   e1 INTERSECT e2 is a regular expression
	        s matches e1 INTERSECT e2 iff s matches e1 and s matches e2
	  *   e1 UNION e2 is a regular expression
	        s matches e1 UNION e2 iff s matches e1 or s matches e2
	d.  If e is a regular expression, then
	  *  ~e is a regular expression, negation
	       s matches ~e iff s does not match e
	  *  e* is a regular expression, 0 or more occurrences of e
	       The occurrences can be different strings that match e.
	  *  e+ is a regular expression, 1 or more occurrences of e
	       The occurrences can be different strings that match e.
      3.  Notes
        a.  a language can be recognized by a deterministic finite automaton
	      iff the language can be recognized by a non-deterministic
	        finite automaton
	      iff the language can be generated by a regular expression
        b.  regular languages are handy for
	  *   string searches
	  *   network protocols
	  *   lexical analysis in compilers
	  *   formal verification
	  *   many other applications
        c.  The equivalence of regular expressions and finite automata means
	  *  we can compile a regular expression into an automaton that
	       recognizes the same language.
	  *  typically, this automaton is a simple, table-driven piece of
	       code and the compiler generates the table.
	  *  this produces efficient recognizers for regular expressinos.
	       
    C.  Context free languages
      0.  Not all languages are regular
        a.  Example, "balanced parentheses": 0^n 1^n
	b.  Why isn't this regular?
      1.  Pushdown automata
        a.  Start with a finite automaton, and add
	  *   a stack alphabet: Gamma
	  *   a stack
	b.  Now, when we read a symbol from the input string, based on
	      that symbol and the symbol at the top of the stack we:
	  *  Move to a new state
	  *  Pop the symbol off of the top of the stack
	  *  Push zero or one symbols onto the stack
	c.  We also allow stack operations (push and pop, based on the
	       top-of-stack symbol) to be performed without consuming
	       any input.
	  *  This allows the machine to make a move based on any fixed number
	       of symbols on the top of the stack.
	  *  It allows the machine to pop any fixed number of symbols off the
	       stack for each symbol read.
	  *  It allows the machine to push any fixed number of symbols onto
	       the stack for each symbol read.
	d.  The pushdown automaton recognizes a string if it can reach a
	      state with an empty state at the end of reading the string.
	e.  example 0^n 1^n
	  *  Start in state q0
	  *  while in q0,
	    +  if the next input symbol is 0, push a 0 onto the stack, and
		 stay in state 0
	    +  if the next input symbol is 1, and the stack is empty,
		 reject the string
	    +  if the next input symbol is 1, and the stack is not-empty,
		 pop a 0 off of the stack and transition to state q1
	  *  while in q1,
	    +  if the next input symbol is 0, reject the string
	    +  if the next input symbol is 1, and the stack is empty,
		 reject the string
	    +  if the next input symbol is 1, and the stack is not-empty,
		 pop a 0 off of the stack and stay in state q1
	     
      2.  Context-free grammars (CFGs)
        a.  A set of terminals, Sigma, and non-terminals, Theta,
	     and a special non-terminal, S0, the start-symbol.
        b.  Productions: if S is non-terminal, we can have productions for S
	      S -> string of terminals and non-terminals
	         | another string of terminals and non-terminals
	         | another string of terminals and non-terminals
        c.  A string is generated by a grammar iff there is some application
	        of productions starting at S0 that produces the string
	  *.  Each application of a production chooses a non-terminal in the
	        current string and replaces it with the right-side for a
		production for that non-terminal.
	  *.  The derivation is complete when it produces a string that
	        consists entirely of terminals.
        d.  Example: a grammar for 0^n 1^n
	      S0 -> epsilon
	          | 0 S0 1
	        
      3.  Remarks:
        a.  non-deterministic pushdown-automata (NPDAs) are more powerful than
	        deterministic ones (DPDAs)
	  *  Example: the language (0^n 1^n) | (0^n 1^n 1^n) can be recognized
	        by a NPDA but not by a DPDA.
        b.  the languages recognized by NPDAs and the languages generated by
	       context-free grammars are the same.
	  *  We write PDA to mean NPDA -- DPDAs are the "special case".
        c.  CFLs (context free languages) are used in
	  *   programming language syntax
	  *   an approximation of natural language syntax --
	        "sentence diagramming" from an English writing class is
		using a CFL for English.
        d.  We can automatically generate parsers from CFGs
	  *   For arbitrary CFGs, the time complexity is O(n^3) where n is
	        the length of the input string.
	  *   If we make some restrictions that are reasonable for real
	         programming languages, there is an O(n) algorithm.
	    +  Humans do much better reading code written in languages
	         that have O(n) parsers.

    D.  The rest of CpSc 421/501
      1.  General languages
        a.  Recognized by a Turing machine (TM)
	  *   Replace the stack of a PDA with an infinite "tape" that can
	        be read and written.
	  *   The tape head can move left or right at each step.
        b.  Any non-trivial property of TMs is undecidable.
      2.  One more case: "context-sensitive languages"
        a.  Recognized by linear bounded automata
        b.  The tape is just the input tape
        c.  But the tape alphabet can be larger than the input alphabet
        d.  The Chomsky hierarchy:
	      Regular Languages StrictSubSet ContextFreeLanguages
	                        StrictSubSet ContextSensitiveLanguages
	                        StrictSubSet GeneralLanguages
      3.  For more information
        a.  "Introduction to the Theory of Computation", M. Sipser.
        b.  "Introduction to Automata Theory, Languages, and Computation,
	      J.E. Hopcroft, J.D. Ullman -- not a "gentle" introduction.
        c.  CpSc 421/501
This document was last modified on: (GMT)