foundations of computational agents
Give the weights and structure of a neural network with a sigmoid output activation and one hidden layer with an ReLU activation, that can represent the exclusive-or function () of two Booleans, which is true when the inputs have different truth values; see Figure 7.13. Assume true is represented as 1, and false as 0. [Hint: Write exclusive-or in terms of other logical operators. See Exercise 7.9 and Example 8.1. You need to think about how many units need to be in the hidden layer.]
Run the AIPython (aipython.org) neural network code or an other learner on the “Mail reading” data of Figure 7.1 with a single hidden layer with two hidden units.
Suppose that you decide to use any predicted value from the neural network greater than as true, and any value less than as false. How many examples are misclassified initially? How many examples are misclassified after 40 iterations? How many examples are misclassified after 80 iterations?
Try the same example and the same initial values, with different step sizes for the gradient descent. Try at least , , and . Comment on the relationship between step size and convergence.
Given the final parameter values you found, give a logical formula for what each of the units is computing. [Hint: As a brute-force method, for each of the units, build the truth tables for the input values and determine the output for each combination, then simplify the resulting formula.] Is it always possible to find such a formula?
All of the parameters were set to different initial values. What happens if the parameter values are all set to the same (random) value? Test it out for this example, and hypothesize what occurs in general.
For the neural network algorithm, comment on the following stopping criteria.
Learn for a limited number of iterations, where the limit is set initially.
Stop when the squared error is less than some threshold close to zero.
Stop when the derivatives all become within some of zero.
Split the data into training data and validation data, train on the training data and stop when the error on the validation data increases.
Which would you expect to better handle overfitting? Which criteria guarantee the gradient descent will stop? Which criteria would guarantee that, if it stops, the network can be used to predict the test data accurately?
Adam was described as a combination of momentum and RMS-Prop. Using AIPython (aipython.org), Keras, or PyTorch (see Appendix B.2), find two datasets and compare the following:
How does Adam with , differ from plain stochastic gradient descent without momentum? [Hint: How does setting simplify Adam, considering first the case where ?] Which works better on the datasets selected?
How does Adam with differ from stochastic gradient descent, when the momentum parameter is equal to in Adam? [Hint: How does setting simplify Adam, considering first the case where ?] Which works better on the datasets selected?
How does Adam with differ from RMS-Prop, where the parameter in RMS-Prop is equal to in Adam? Which works better on the datasets selected?
The Conv2D code of Figure 8.9 does not include a stride. Show how a stride can be incorporated into the pseudocode, where the stride is a pair of numbers, one for each dimension. Implement it in AIPython (aipython.org).
Give the pseudocode for Conv1D, for one-dimensional convolutions (the one-dimensional version of Figure 8.9). What hyperparameters are required? This pseudocode does not include all of the hyperparameters of Keras or PyTorch. For two of the hyperparameters of one of these, show how the pseudocode can be extended to include this.
In Example 8.11, the LSTM was character-based, and there were about 3.5 million parameters.
How many parameters would there be in an LSTM if it was word-based with a vocabulary of 1000 words and a hidden state of size 1000?
How many parameters would there be if the vocabulary had 10,000 words and the hidden state was of size 10,000?
Consider a simple character-based transformer with a single attention mechanism that performs self-attention to predict the next character in a text. Suppose the window size is 100, an embedding size is 1000, and there are 64 characters. Suppose as part of the transformer there are dense functions for , , and as inputs to the attention mechanism, and the output of the attention goes directly into a softmax. How many parameters are there?
Suppose instead of the character-based transformer in (c), the transformer was word-based, with a vocabulary of 10,000 words. How many parameters are there?
Take the text of some classic work, such as can be found on gutenberg.org. Repeat the experiment of Example 8.11 with that text. Increase the number of hidden nodes from 512 to 2048 and double the number of epochs. Is the performance better? What evidence can you provide to show it is better?