Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.
11.1.3 EM for Soft Clustering
The EM algorithm can be used for soft clustering. Intuitively, for clustering, EM is like the kmeans algorithm, but examples are probabilistically in classes, and probabilities define the distance metric. We assume here that the features are discrete.
As in the kmeans algorithm, the training examples and the number of classes, k, are given as input.
When clustering, the role of the categorization is to be able to predict the values of the features. To use EM for soft clustering, we can use a naive Bayesian classifier, where the input features probabilistically depend on the class and are independent of each other given the class. The class variable has k values, which can be just {1,...,k}.
Model  Data  →  Probabilities  


Given the naive Bayesian model and the data, the EM algorithm produces the probabilities needed for the classifier, as shown in Figure 11.2. The class variable is C in this figure. The probability distribution of the class and the probabilities of the features given the class are enough to classify any new example.
To initialize the EM algorithm, augment the data with a class feature, C, and a count column. Each original tuple gets mapped into k tuples, one for each class. The counts for these tuples are assigned randomly so that they sum to 1. For example, for four features and three classes, we could have the following:
X_{1}  X_{2}  X_{3}  X_{4} 
...  ...  ...  ... 
t  f  t  t 
...  ...  ...  ... 
>
X_{1}  X_{2}  X_{3}  X_{4}  C  Count 
...  ...  ...  ...  ...  ... 
t  f  t  t  1  0.4 
t  f  t  t  2  0.1 
t  f  t  t  3  0.5 
...  ...  ...  ...  ...  ... 
If the set of training examples contains multiple tuples with the same values on the input features, these can be grouped together in the augmented data. If there are m tuples in the set of training examples with the same assignment of values to the input features, the sum of the counts in the augmented data with those feature values is equal to m.
The EM algorithm, illustrated in Figure 11.3, maintains both the probability tables and the augmented data. In the E step, it updates the counts, and in the M step it updates the probabilities.
2: Inputs
3: X set of features X={X_{1},...,X_{n}}
4: D data set on features {X_{1},...,X_{n}}
5: k number of classes Output
6: P(C), P(X_{i}C) for each i∈{1:n}, where C={1,...,k}. Local
7: real array A[X_{1},...,X_{n},C]
8: real array P[C]
9: real arrays M_{i}[X_{i},C] for each i∈{1:n}
10: real arrays P_{i}[X_{i},C] for each i∈{1:n}
11: s← number of tuples in D
12: Assign P[C], P_{i}[X_{i},C] arbitrarily
13: repeat
14: // E Step
15: for each assignment ⟨X_{1}=v_{1},...,X_{n}=v_{n}⟩∈D do
16: let m ←⟨X_{1}=v_{1},...,X_{n}=v_{n}⟩∈D
17: for each c ∈{1:k} do
18: A[v_{1},...,v_{n},c]←m×P(C=cX_{1}=v_{1},...,X_{n}=v_{n})
19:
20:
21: // M Step
22: for each i∈{1:n} do
23: M_{i}[X_{i},C]=∑_{X1,...,Xi1,Xi+1,...,Xn} A[X_{1},...,X_{n},C]
24: P_{i}[X_{i},C]=(M_{i}[X_{i},C])/(∑_{C} M_{i}[X_{i},C])
25:
26: P[C]=∑_{X1,...,Xn} A[X_{1},...,X_{n},C]/s
27: until termination
The algorithm is presented in Figure 11.4. In this figure, A[X_{1},...,X_{n},C] contains the augmented data; M_{i}[X_{i},C] is the marginal probability, P(X_{i},C), derived from A; and P_{i}[X_{i},C] is the conditional probability P(X_{i}C). It repeats two steps:
 E step: Update the augmented data based on the probability distribution. Suppose there are m copies
of the tuple
⟨X_{1}=v_{1},...,X_{n}=v_{n}⟩ in the
original data.
In the augmented data, the count associated with class c, stored in
A[v_{1},...,v_{n},c],
is updated to
m ×P(C=cX_{1}=v_{1},...,X_{n}=v_{n}).
Note that this step involves probabilistic inference, as shown below.
 M step: Infer the maximumlikelihood probabilities for the model from the augmented data. This is the same problem as learning probabilities from data.
The EM algorithm presented starts with madeup probabilities. It could have started with madeup counts. EM will converge to a local maximum of the likelihood of the data. The algorithm can terminate when the changes are small enough.
This algorithm returns a probabilistic model, which can be used to classify an existing or new example. The way to classify a new example, and the way to evaluate line 18, is to use the following:
P(C=cX_{1}=v_{1},...,X_{n}=v_{n}) = (P(C=c) ×∏_{i=1}^{n} P(X_{i}=v_{i}C=c))/(∑_{c'}P(C=c') ×∏_{i=1}^{n} P(X_{i}=v_{i}C=c')) .
The probabilities in this equation are specified as part of the model learned.
Notice the similarity with the kmeans algorithm. The E step (probabilistically) assigns examples to classes, and the M step determines what the classes predict.
In the M step, P(C=i) is set to the proportion of the counts with C=i, which is
(∑_{X1,...,Xn} A[X_{1},...,X_{n},C=i])/(m) ,
which can be computed with one pass through the data.
M_{1}[X_{1},C], for example, becomes
∑_{X2,X3 ,X4} A[X_{1},...,X_{4},C].
It is possible to update all of the M_{i}[X_{i},C] arrays with one pass though the data. See Exercise 11.3. The conditional probabilities represented by the P_{i} arrays can be computed from the M_{i} arrays by normalizing.
The E step updates the counts in the augmented data. It will replace the 0.4 in Figure 11.3 with
P(C=1x_{1} ∧¬x_{2} ∧x_{3} ∧x_{4}) = (P(x_{1}C=1) P(¬x_{2}C=1)P(x_{3}C=1)P(x_{4}C=1)P(C=1))/(∑_{i=1}^{3} P(x_{1}C=i) P(¬x_{2}C=i)P(x_{3}C=i)P(x_{4}C=i)P(C=i)) .
Each of these probabilities is provided as part of the learned model.
Note that, as long as k>1, EM virtually always has multiple local maxima. In particular, any permutation of the class labels of a local maximum will also be a local maximum.