- The range of behaviors is expanded: the agent can do more.
- The accuracy on tasks is improved: the agent can do things better.
- The speed is improved: the agent can do things faster.

**task**The behavior or task that's being improved.

For example: classification, acting in an environment**data**The experiences that are being used to improve performance in the task.**measure of improvement**How can the improvement be measured?

For example: increasing accuracy in prediction, new skills that were not present initially, improved speed.

- The richer the representation, the more useful it is for subsequent problem solving.
- The richer the representation, the more difficult it is to learn.

**Supervised classification**Given a set of pre-classified training examples, classify a new instance.**Unsupervised learning**Find natural classes for examples.**Reinforcement learning**Determine what to do based on rewards and punishments.**Analytic learning**Reason faster using experience.**Inductive logic programming**Build richer models in terms of logic programs.

Action | Author | Thread | Length | Where | |

e1 | skips | known | new | long | home |

e2 | reads | unknown | new | short | work |

e3 | skips | unknown | old | long | work |

e4 | skips | known | old | long | home |

e5 | reads | known | new | short | home |

e6 | skips | known | old | long | work |

**Supervised learning**What has to be learned is specified for each example.**Unsupervised learning**No classifications are given; the learner has to discover categories and regularities in the data.**Reinforcement learning**Feedback occurs after a sequence of actions.

- The measure of success is not how well the agent performs on the training examples, but how well the agent performs for new examples.
- Consider two agents:
claims the negative examples seen are the only negative examples. Every other instance is positive.*P*claims the positive examples seen are the only positive examples. Every other instance is negative.*N*

- Both agents correctly classify every training example, but disagree on every other example.

- The tendency to prefer one hypothesis over
another is called a
**bias.** - Saying a hypothesis is better than
*N*'s or*P*'s hypothesis isn't something that's obtained from the data. - To have any inductive process make predictions on unseen data, you need a bias.
- What constitutes a good bias is an empirical question about which biases work best in practice.

- Given a representation and a bias, the problem of learning can be reduced to one of search.
- Learning is search through the space of possible representations looking for the representation or representations that best fits the data, given the bias.
- These search spaces are typically prohibitively
large for systematic search. Use
**hill climbing.** - A learning algorithm is made of a search space, an evaluation function, and a search method.

- Data isn't perfect:
- some of the attributes are assigned the wrong value
- the attributes given are inadequate to predict the classification
- there are examples with missing attributes

**overfitting**occurs when a distinction appears in the data, but doesn't appear in the unseen examples. This occurs because of random correlations in the training set.

- Find the best representation given the data.
- Delineate the class of consistent representations given the data.
- Find a probability distribution of the representations given the data.

- Representation is a decision tree.
- Bias is towards simple decision trees.
- Search through the space of decision trees, from simple decision trees to more complex ones.

- The nonleaf nodes are labeled with attributes.
- The arcs out of a node labeled with attribute
*A*are labeled with each of the possible values of the attribute*A*. - The leaves of the tree are labeled with classifications.

prop(Obj,user_action,skips) <-

prop(Obj,length,long).

prop(Obj,user_action,reads) <-

prop(Obj,length,short) & prop(Obj,thread,new).

prop(Obj,user_action,reads) <-

prop(Obj,length,short) & prop(Obj,thread,old) &

prop(Obj,author,known).

prop(Obj,user_action,skips) <-

prop(Obj,length,short) & prop(Obj,thread,old) &

prop(Obj,author,unknown).

- Given some data, which decision tree should be generated? A decision tree can represent any discrete function of the inputs.
- You need a
**bias.**Example, prefer the smallest tree. Least depth? Fewest nodes? Which trees are the best predictors of unseen data? - How should you go about building a decision tree? The space of decision trees is too big for systematic search for the smallest decision tree.

- The input is a target attribute (the
*Goal*), a set of examples, and a set of attributes. - Stop if all examples have the same classification.
- Otherwise, choose an
attribute to split on,
- for each value of this attribute, build a subtree for those examples with this attribute value.

dtlearn(Goal, Exs, Atts , Val) <-

all_examples_agree(Goal, Exs, Val).

dtlearn(Goal, Exs, Atts, if(Cond,YT,NT)) <-

examples_disagree(Goal, Exs) &

select_split(Goal, Exs, Atts, Cond, Rem_Atts) &

split(Exs, Cond, Yes, No) &

dtlearn(Goal, Yes, Rem_Atts, YT) &

dtlearn(Goal, No, Rem_Atts, NT).

- Attributes can have more than two values. This complicates the trees.
- This assumes attributes are adequate to represent the concept. You can return probabilities at leaves.
- Which attribute to select to split on isn't defined. You want to choose the attribute that results in the smallest tree. Often we use information theory as an evaluation function in hill climbing.
- Overfitting is a problem.

- This algorithm gets into trouble overfitting the data. This occurs with noise and correlations in the training set that are not reflected in the data as a whole.
- To handle overfitting:
- You can restrict the splitting, so that you split only when the split is useful.
- You can allow unrestricted splitting and prune the resulting tree where it makes unwarranted distinctions.

- These representations are inspired by neurons and their connections in the brain.
- Artificial neurons, or
**units,**have inputs, and an output. The output can be connected to the inputs of other units. - The output of a unit is a parameterized non-linear function of its inputs.
- Learning occurs by adjusting parameters to fit data.
- Neural networks can represent an approximation to any function.

- As part of neuroscience, in order to understand real neural systems, researchers are simulating the neural systems of simple animals such as worms.
- It seems reasonable to try to build the functionality of the brain via the mechanism of the brain (suitably abstracted).
- The brain inspires new ways to think about computation.
- Neural networks provide a different measure of simplicity as a learning bias.

- Feed-forward neural networks are the most common models.
- These are directed acyclic graphs:

prop(Obj,output,V) <-

prop(Obj,in_1,I_1) &

prop(Obj,in_2,I_2) &

···

prop(Obj,in_k,I_k) &

Visf(w_0+w_1×I_1+w_2×I_2+···+w_k×I_k).

*I*are real-valued inputs._{j}*w*are adjustable real parameters._{j}*f*is an activation function.

f(x)= (1)/(1+e^{-x}) f'(x)=f(x)(1-f(x))

- The values of the attributes are real numbers.
- Thirteen parameters
*w*are real numbers._{0},...,w_{12} - The attributes
*h*and_{1}*h*correspond to the values of hidden units._{2} - There are 13 real numbers to be learned. The hypothesis space is thus a 13-dimensional real space.
- Each point in
this 13-dimensional space corresponds to a particular logic program
that predicts a value for
*reads*given*known*,*new*,*short*, and*home*.

predicted_prop(Obj,reads,V) <-

prop(Obj,h_1,I_1) & prop(Obj,h_2,I_2) &

Visf(w_0+w_1× I_1+w_2× I_2).

prop(Obj,h_1,V) <-

prop(Obj,known,I_1) & prop(Obj,new,I_2) &

prop(Obj,short,I_3) & prop(Obj,home,I_4) &

Visf(w_3+w_4× I_1+w_5× I_2+w_6× I_3+
w_7× I_4).

prop(Obj,h_2,V) <-

prop(Obj,known,I_1) & prop(Obj,new,I_2) &

prop(Obj,short,I_3) & prop(Obj,home,I_4) &

Visf(w_8+w_9× I_1+w_10× I_2+w_11× I_3+w_12× I_4).

- For particular values for the parameters
and a set**w**=w_{0},...w_{m}*E*of examples, the**sum-of-squares error**is*Error*_{E}(**w**) = SUM_{e in E}(p_{e}^{w}-o_{e})^{2},*p*is the predicted output by a neural network with parameter values given by_{e}^{w}for example**w***e**o*is the observed output for example_{e}*e*.

- The aim of neural network learning is, given a set of examples, to find parameter settings that minimize the error.

- Aim of neural network learning: given a set of examples, find parameter settings that minimize the error.
**Back-propagation learning**is gradient descent search through the parameter space to minimize the sum-of-squares error.

*Inputs:*- A network, including all units and their connections
- Stopping Criteria
- Learning Rate (constant of proportionality of gradient descent search)
- Initial values for the parameters
- A set of classified training data

*Output:*Updated values for the parameters

- Repeat
- evaluate the network on each example given the current parameter settings
- determine the derivative of the error for each parameter
- change each parameter in proportion to its derivative

- until the stopping criteria is met

- At each iteration, update parameter
*w*_{i}*w*_{i}<- (w_{i}- eta(pderror(w_{i}))/(pdw_{i}))*eta*is the learning rate - You can compute partial derivative:
- numerically: for small
*Delta**(error(w*_{i}+Delta) - error(w_{i}))/(Delta) - analytically:
*f'(x)=f(x)(1-f(x))*+ chain rule

- numerically: for small

Para- | iteration 0 | iteration 1 | iteration 80 | |

meter | Value | Deriv | Value | Value |

w _{0} | 0.2 | 0.768 | -0.18 | -2.98 |

w _{1} | 0.12 | 0.373 | -0.07 | 6.88 |

w _{2} | 0.112 | 0.425 | -0.10 | -2.10 |

w _{3} | 0.22 | 0.0262 | 0.21 | -5.25 |

w _{4} | 0.23 | 0.0179 | 0.22 | 1.98 |

Error: | 4.6121 | 4.6128 | 0.178 | |

w _{0} | w _{1} | w _{2} | Logic |

-15 | 10 | 10 | and |

-5 | 10 | 10 | or |

5 | -10 | -10 | nor |

A single unit can't represent *xor*.

- It's easy for a neural network to represent "at least two of
*I*are true":_{1},...,I_{k}This concept forms a large decision tree.*w*_{0}*w*_{1}*···**w*_{k}-15 10 *···*10 - Consider representing a conditional: "If
*c*then*a*else*b*":- Simple in a decision tree.
- Needs a complicated neural network to represent
*(c & a) or (¬c & b)*.

- Meaning is attached to the input and output units.
- There is no a priori meaning associated with the hidden units.
- What the hidden units actually represent is something that's learned.

©David Poole, Alan Mackworth, Randy Goebel and Oxford University Press, 1998-2002