TrainCRF UGM Demo

The MRF model we used in the last demo was fairly naive, and it can be improved in various ways. As one example, in this demo we will take into account the month of the year. One way to do this would be add a 12-state 'month' variable to the model that is connected to every node. We could then do parameter estimation in the model by cutset conditioning (i.e. we condition on the month variable so that the remaining variables form a chain). We could subsequently do conditional decoding/inference/sampling in the model, where we condition on the month.

If we are only interested in conditional queries, then a conditional random field (CRF) might be a better option. In CRFs, we have two types of variables: (i) the 'features' (also known as 'covariates') X are treated as fixed non-random variables, and (ii) the 'labels' y are treated as random variables in a UGM, where the parameters of the UGM depend on the features. In our case, the 'labels' will be the 28 variables representing days of the month, and the features will consist of a 'bias' feature and a set of indicator variables that will represent the month. That is, each possible value that the features can take leads to a different UGM model of the labels.


Vancouver Rain Data, with Months

In addition to the matrix of binarized daily precipitation values, the file 'rain.mat' also contains a vector months that contains a number in the range 1-12, representing the month of the year. We might hope to get a better model by using this data, since common sense would indicate that it is more likely to rain in December than July. Of course, since we have the relevant data we don't need to just assume that our common sense is correct; we can use the data to find a good way to incorporate the month information. In particular, we are going to use a CRF where the months are used to define the features, and the binarized daily precipitation values represent the labels. As before, the first step will be loading the data and making the edgeStruct:
load rain.mat
y = int32(X+1);
[nInstances,nNodes] = size(y);
nStates = max(y);
adj = zeros(nNodes);
for i = 1:nNodes-1
    adj(i,i+1) = 1;
end
adj = adj+adj';
edgeStruct = UGM_makeEdgeStruct(adj,nStates);
maxState = max(nStates);
Note that we have used y to denote the samples of the random variables in this demo, since X is typically used to denote the features in a CRF model.


Representing the Features

To train CRFs, we need to make the arrays Xnode and Xedge. Xnode represents features that affect the node potentials, while Xedge represents features that affect the edge potentials. Note that unlike the labels y, UGM assumes that Xnode and Xedge are stored in Matlab's default double format.

For our first CRF model, we will only use a single 'bias' feature for the node and edge potentials. A 'bias' feature is a feature that always set to one, and it is included to reflect any effects on the states that are independent of the features. MRFs correspond to the special case where we only have a bias variable. In general, the features in CRFs can be arbitrary real-valued numbers and thus we do not necessarily need to include a bias variable. However, unless the states are perfectly balanced in the training data it usually makes sense to include a bias variable. After training this simple model (and verifying that it gives the same results as the previous demo), we will move on to using node and edge features that incorporate the month information to give a better model.

If our CRF has nInstances (conditional) samples of nNodes random variables, and we have nNodeFeatures features for each node, then the Xnode variable will be a nInstances by nNodeFeatures by nNodes array, each element containing the relevant feature for the appropriate node in the particular instance. Therefore, for the simple case where we only have a single unit-valued feature, we would make Xnode as follows:

Xnode = ones(nInstances,1,nNodes);
Similarly, the Xedge variable will be a nInstances by nEdgeFeatures by nEdges array. To make the simple unit-valued edge features we use:
nEdges = edgeStruct.nEdges;
Xedge = ones(nInstances,1,nEdges);

Making the nodeMap and edgeMap

For training CRFs, the nodeMap and edgeMap have an extra dimension since they must take into account how the features are parameterized. As in the case of MRFs, UGM uses a log-linear parameterization of CRFs. This again allows us to again state maximum likelihood estimation as a convex optimization problem. In particular, for training example i UGM uses the following convention: The following image (due to Yu Zhou) illustrates how the nodeMap/edgeMap interact with the Xnode/Xedge and w to give the potentials:

Maximum Likelihood Training of Log-Linear Conditionals Random Fields

Training CRFs with UGM is very similar to the case of MRFs. The main differences are that the nodeMap and edgeMap have the extra dimension that takes into account how each feature is parameterized, and that we pass the features as opposed to the sufficient statistics to the negative log-likelihood function.

In the case where have only one feature, we don't use the extra dimension of the nodeMap and edgeMap variables so they can be constructed as before:

nodeMap = zeros(nNodes,maxState,'int32');
nodeMap(:,1) = 1;

edgeMap = zeros(maxState,maxState,nEdges,'int32');
edgeMap(1,1,:) = 2;
edgeMap(2,1,:) = 3;
edgeMap(1,2,:) = 4;
To initialize the parameters, we proceed as before:
nParams = max([nodeMap(:);edgeMap(:)]);
w = zeros(nParams,1);
We will again consider minimizing the negative log-likelihood in order to maximize the likelihood. The negative log-likelihood (and its) gradient are computed by UGM_CRF_NLL. With log-linear CRFs, the negative log-likelihood is still differentiable and jointly convex in the node and edge parameters, so we can again computes its optimal value using minFunc. The negative log-likelihood for CRFs takes similar arguments to the MRF case, but we pass in the features instead of sufficient statistics:
w = minFunc(@UGM_CRF_NLL,randn(size(w)),[],Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain)

 Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     3.19683e-05     2.83032e+04     9.07391e+03
         2          3     1.00000e+00     2.42379e+04     7.08784e+03
         3          4     1.00000e+00     2.13785e+04     3.39832e+03
         4          5     1.00000e+00     1.95148e+04     2.92734e+03
         5          6     1.00000e+00     1.79881e+04     5.36254e+02
         6          7     1.00000e+00     1.79304e+04     6.91542e+02
         7          8     1.00000e+00     1.79114e+04     1.85252e+02
         8         10     1.00000e+01     1.78885e+04     7.54564e+02
         9         11     1.00000e+00     1.78582e+04     6.71792e+02
        10         12     1.00000e+00     1.78035e+04     5.24928e+01
        11         13     1.00000e+00     1.78032e+04     1.59129e+01
        12         14     1.00000e+00     1.78031e+04     2.55631e+01
        13         15     1.00000e+00     1.78028e+04     5.82554e+01
        14         16     1.00000e+00     1.78021e+04     9.26333e+01
        15         17     1.00000e+00     1.78007e+04     1.23620e+02
        16         18     1.00000e+00     1.77990e+04     1.09300e+02
        17         19     1.00000e+00     1.77979e+04     3.57499e+01
        18         20     1.00000e+00     1.77977e+04     4.54387e+00
        19         21     1.00000e+00     1.77977e+04     1.01720e-01
        20         22     1.00000e+00     1.77977e+04     3.55628e-02
        21         23     1.00000e+00     1.77977e+04     1.99919e-03
Directional Derivative below progTol

w =

    0.5968
   -0.4499
   -1.1506
   -1.0029
As we expected, this degenerate CRF gives us exactly the model we had in the last MRF demo. However, note that training was a bit slower. This is because each instance in a CRF is potentially a different UGM, so the loss function needs to call inferFunc for every sample.

In CRFs, the potentials depend on the features. So to make a set of potentials, we need to choose a set of features. The function UGM_CRF_makePotentials takes a training instance number and returns the corresponding potentials. For example, to return the potentials associated with the first training example we would use

instance = 1;
[nodePot,edgePot] = UGM_CRF_makePotentials(w,Xnode,Xedge,nodeMap,edgeMap,edgeStruct,instance);
nodePot(1,:)
edgePot(:,:,1)

ans =

    1.8163    1.0000


ans =

    0.6377    0.3668
    0.3165    1.0000
As expected, these are the same potentials we had before.


Using Node Features

We now want to make a set of node features that incorporate the months of the year. Since the potentials will be formed by exponentiating a linear function of the features, we will represent the months as a set of 12 binary indicator variables:
nFeatures = 12;
Xnode = zeros(nInstances,nFeatures,nNodes);
for m = 1:nFeatures
    Xnode(months==m,m,:) = 1;
end
For example, Xnode(3,4,1) is set to 1 while Xnode(3,[1:3 5:12],1) is 0 because the month variable for the third sample is set to 4 (representing April). We will also want a bias in the model (since we expect there to be some baseline probability of rain that is independent of the month), so we add a feature that is always set to 1:
Xnode = [ones(nInstances,1,nNodes) Xnode];
nNodeFeatures = size(Xnode,2);
We now make the nNodes by maxState by nNodeFeatures nodeMap, which lets UGM know which features multiplied together and then added and exponentiated for each variable. In our case, all variables will use all 13 features but share the same parameters, so we use:
nodeMap = zeros(nNodes,maxState,nNodeFeatures,'int32');
for f = 1:nNodeFeatures
    nodeMap(:,1,f) = f;
end
We also need to update the edgeMap, so that the nodeMap and edgeMap numbers don't overlap (it rarely makes sense to share node and edge parameters):
edgeMap = zeros(maxState,maxState,nEdges,'int32');
edgeMap(1,1,:) = nNodeFeatures+1;
edgeMap(2,1,:) = nNodeFeatures+2;
edgeMap(1,2,:) = nNodeFeatures+3;
With the updated nodeMap and edgeMap, training proceeds exactly as before:
nParams = max([nodeMap(:);edgeMap(:)]);
w = zeros(nParams,1);
w = minFunc(@UGM_CRF_NLL,w,[],Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain)

 Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     5.50957e-05     1.87895e+04     1.71063e+03
         2          3     1.00000e+00     1.83557e+04     1.71053e+03
         3          4     1.00000e+00     1.76199e+04     3.06222e+03
         4          5     1.00000e+00     1.73411e+04     7.91786e+02
         5          6     1.00000e+00     1.73097e+04     2.19752e+02
         6          7     1.00000e+00     1.72916e+04     2.66328e+02
         7          8     1.00000e+00     1.72552e+04     5.10685e+02
         8          9     1.00000e+00     1.72132e+04     4.40088e+02
         9         10     1.00000e+00     1.71942e+04     8.79763e+01
        10         11     1.00000e+00     1.71925e+04     5.99177e+01
        11         12     1.00000e+00     1.71918e+04     8.47407e+01
        12         13     1.00000e+00     1.71893e+04     1.21715e+02
        13         14     1.00000e+00     1.71855e+04     1.22600e+02
        14         15     1.00000e+00     1.71756e+04     2.08079e+02
        15         16     1.00000e+00     1.71695e+04     1.12069e+02
        16         17     1.00000e+00     1.71670e+04     1.95784e+01
        17         18     1.00000e+00     1.71668e+04     1.46669e+01
        18         19     1.00000e+00     1.71665e+04     1.99287e+01
        19         20     1.00000e+00     1.71661e+04     1.94539e+01
        20         21     1.00000e+00     1.71659e+04     3.94953e+00
        21         22     1.00000e+00     1.71658e+04     1.08766e+00
        22         23     1.00000e+00     1.71658e+04     1.80154e-01
        23         24     1.00000e+00     1.71658e+04     8.30007e-02
        24         25     1.00000e+00     1.71658e+04     2.70300e-02
        25         26     1.00000e+00     1.71658e+04     2.17654e-03
        26         27     1.00000e+00     1.71658e+04     2.90511e-04
Directional Derivative below progTol

w =

    0.5662
   -0.3639
   -0.1622
   -0.0970
    0.0183
    0.1848
    0.2275
    0.6032
    0.6072
    0.3090
   -0.0487
   -0.3435
   -0.3684
   -0.4218
   -1.0356
   -0.8798
As expected, conditioning on the month information leads to a better fit. In fact, the improvement between the likelihood of the CRF model and the MRF model is much bigger than the improvement that we obtained by using more expressive edge potentials in the previous demo.


Using Edge Features

In the above CRF model, the node potentials depend on the features but the edges do not. That is, the model doesn't take into account that the months could also affect the chances of transitioning from/to rain. To make our edge potentials also depend on the months, we will make a new Xedge that incorporates month information. One way to make the edge features given the node features is with the function UGM_makeEdgeFeatures. The function looks at the node features for each node on the edge, and does one of two things: (i) if the feature is 'shared' then it uses the node feature from the first node (as defined by the edgeEnds structure), (ii) otherwise, it uses the features from both edge ends as features. In our case, it makes sense to use the 'shared' case for all the node features, since all nodes have the same features (for a given training example):
sharedFeatures = ones(13,1);
Xedge = UGM_makeEdgeFeatures(Xnode,edgeStruct.edgeEnds,sharedFeatures);
nEdgeFeatures = size(Xedge,2);
We now make the maxState by maxState by nEdges by nEdgeFeatures edgeMap:
edgeMap = zeros(maxState,maxState,nEdges,nEdgeFeatures,'int32');
for edgeFeat = 1:nEdgeFeatures
    for s1 = 1:2
        for s2 = 1:2
            f = f+1;
            edgeMap(s1,s2,:,edgeFeat) = f;
        end
    end
end
With the updated Xedge and edgeMap, we proceed as before:
nParams = max([nodeMap(:);edgeMap(:)]);
w = zeros(nParams,1);
w = minFunc(@UGM_CRF_NLL,w,[],Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain)

 Iteration   FunEvals     Step Length    Function Val        Opt Cond

         1          2     2.99873e-05     1.90436e+04     2.55653e+03
         2          3     1.00000e+00     1.76400e+04     2.34032e+03
         3          4     1.00000e+00     1.72738e+04     5.65236e+02
         4          5     1.00000e+00     1.72512e+04     1.96187e+02
         5          6     1.00000e+00     1.72162e+04     4.73894e+02
         6          7     1.00000e+00     1.71816e+04     4.29853e+02
         7          8     1.00000e+00     1.71676e+04     9.36649e+01
         8          9     1.00000e+00     1.71657e+04     8.10747e+01
         9         10     1.00000e+00     1.71645e+04     1.35856e+02
        10         11     1.00000e+00     1.71614e+04     2.06251e+02
        11         12     1.00000e+00     1.71575e+04     2.03290e+02
        12         13     1.00000e+00     1.71545e+04     1.92202e+02
        13         14     1.00000e+00     1.71492e+04     4.64233e+01
        14         15     1.00000e+00     1.71475e+04     7.23352e+01
        15         16     1.00000e+00     1.71461e+04     1.38713e+02
        16         17     1.00000e+00     1.71440e+04     1.75369e+02
        17         18     1.00000e+00     1.71414e+04     1.30053e+02
        18         19     1.00000e+00     1.71397e+04     2.82008e+01
        19         20     1.00000e+00     1.71390e+04     5.28906e+01
        20         21     1.00000e+00     1.71386e+04     9.04927e+01
        21         22     1.00000e+00     1.71378e+04     1.27399e+02
        22         23     1.00000e+00     1.71366e+04     1.29590e+02
        23         24     1.00000e+00     1.71350e+04     8.02345e+01
        24         25     1.00000e+00     1.71338e+04     2.14189e+01
        25         26     1.00000e+00     1.71328e+04     6.68356e+01
        26         27     1.00000e+00     1.71323e+04     8.78730e+01
        27         28     1.00000e+00     1.71310e+04     1.01143e+02
        28         29     1.00000e+00     1.71288e+04     9.07294e+01
        29         30     1.00000e+00     1.71255e+04     2.70413e+01
        30         31     1.00000e+00     1.71234e+04     2.33995e+01
        31         32     1.00000e+00     1.71226e+04     3.42237e+01
        32         33     1.00000e+00     1.71223e+04     2.23663e+01
        33         34     1.00000e+00     1.71220e+04     5.62033e+00
        34         35     1.00000e+00     1.71220e+04     1.69890e+01
        35         36     1.00000e+00     1.71218e+04     3.02428e+01
        36         37     1.00000e+00     1.71217e+04     2.99272e+01
        37         38     1.00000e+00     1.71211e+04     3.03047e+01
        38         39     1.00000e+00     1.71202e+04     4.71917e+01
        39         40     1.00000e+00     1.71186e+04     1.46650e+01
        40         41     1.00000e+00     1.71177e+04     1.22074e+01
        41         42     1.00000e+00     1.71172e+04     8.57846e+00
        42         43     1.00000e+00     1.71171e+04     1.83532e+01
        43         44     1.00000e+00     1.71169e+04     6.84019e+00
        44         45     1.00000e+00     1.71168e+04     2.20643e+00
        45         46     1.00000e+00     1.71167e+04     3.27873e+00
        46         47     1.00000e+00     1.71167e+04     1.20115e+00
        47         48     1.00000e+00     1.71167e+04     2.22905e+00
        48         49     1.00000e+00     1.71167e+04     5.42830e+00
        49         50     1.00000e+00     1.71167e+04     9.00362e+00
        50         51     1.00000e+00     1.71166e+04     9.82641e+00
        51         52     1.00000e+00     1.71165e+04     1.27488e+01
        52         54     4.72543e-01     1.71165e+04     8.33743e+00
        53         55     1.00000e+00     1.71164e+04     4.89122e+00
        54         56     1.00000e+00     1.71164e+04     6.44481e-01
        55         57     1.00000e+00     1.71164e+04     5.46661e-01
        56         58     1.00000e+00     1.71164e+04     1.36614e+00
        57         59     1.00000e+00     1.71164e+04     2.46083e+00
        58         60     1.00000e+00     1.71164e+04     2.87980e+00
        59         61     1.00000e+00     1.71164e+04     2.40811e+00
        60         62     1.00000e+00     1.71164e+04     3.14707e+00
        61         63     1.00000e+00     1.71163e+04     2.01836e+00
        62         64     1.00000e+00     1.71163e+04     2.46767e+00
        63         65     1.00000e+00     1.71163e+04     2.55408e+00
        64         67     4.04260e-01     1.71163e+04     1.56098e+00
        65         68     1.00000e+00     1.71163e+04     3.12589e-01
        66         69     1.00000e+00     1.71163e+04     2.40309e-01
        67         70     1.00000e+00     1.71163e+04     3.51821e-01
        68         71     1.00000e+00     1.71163e+04     6.98811e-01
        69         72     1.00000e+00     1.71163e+04     1.25232e+00
        70         73     1.00000e+00     1.71163e+04     1.74637e+00
        71         74     1.00000e+00     1.71163e+04     1.82086e+00
        72         76     4.32633e-01     1.71163e+04     1.38697e+00
        73         77     1.00000e+00     1.71163e+04     5.39311e-01
        74         78     1.00000e+00     1.71163e+04     1.37194e-01
        75         79     1.00000e+00     1.71163e+04     4.85571e-02
        76         80     1.00000e+00     1.71163e+04     3.82990e-02
        77         81     1.00000e+00     1.71163e+04     2.54560e-02
        78         82     1.00000e+00     1.71163e+04     4.69787e-02
        79         83     1.00000e+00     1.71163e+04     8.66219e-02
        80         84     1.00000e+00     1.71163e+04     1.35306e-01
        81         85     1.00000e+00     1.71163e+04     8.59283e-02
        82         86     1.00000e+00     1.71163e+04     1.67533e-01
        83         87     1.00000e+00     1.71163e+04     1.30266e-02
        84         88     1.00000e+00     1.71163e+04     1.01122e-02
        85         90     1.00000e+01     1.71163e+04     2.79276e-02
        86         91     1.00000e+00     1.71163e+04     2.97327e-02
        87         92     1.00000e+00     1.71163e+04     4.27361e-02
        88         93     1.00000e+00     1.71163e+04     2.10481e-02
        89         94     1.00000e+00     1.71163e+04     4.44461e-03
        90         95     1.00000e+00     1.71163e+04     4.32501e-03
        91         96     1.00000e+00     1.71163e+04     5.82381e-03
        92         97     1.00000e+00     1.71163e+04     8.82892e-03
        93         98     1.00000e+00     1.71163e+04     1.32052e-02
        94         99     1.00000e+00     1.71163e+04     1.62503e-02
        95        100     1.00000e+00     1.71163e+04     1.40167e-02
        96        101     1.00000e+00     1.71163e+04     3.80935e-03
        97        102     1.00000e+00     1.71163e+04     2.07595e-03
        98        103     1.00000e+00     1.71163e+04     5.51650e-04
        99        104     1.00000e+00     1.71163e+04     8.77187e-04
       100        105     1.00000e+00     1.71163e+04     1.56245e-03
       101        106     1.00000e+00     1.71163e+04     2.42554e-03
       102        107     1.00000e+00     1.71163e+04     2.95997e-03
       103        108     1.00000e+00     1.71163e+04     2.76296e-03
       104        110     2.44377e-01     1.71163e+04     4.14080e-03
       105        111     1.00000e+00     1.71163e+04     2.18171e-03
       106        112     1.00000e+00     1.71163e+04     4.59363e-04
       107        113     1.00000e+00     1.71163e+04     6.28469e-04
       108        114     1.00000e+00     1.71163e+04     1.78548e-03
       109        115     1.00000e+00     1.71163e+04     3.70297e-03
       110        116     1.00000e+00     1.71163e+04     5.67089e-03
       111        117     1.00000e+00     1.71163e+04     6.25322e-03
       112        119     2.20033e-02     1.71163e+04     6.23359e-03
       113        120     1.00000e+00     1.71163e+04     3.85092e-03
       114        121     1.00000e+00     1.71163e+04     8.83878e-04
Directional Derivative below progTol

w =

    0.6249
   -0.4527
   -0.6101
   -0.3241
    0.4608
   -0.3367
    0.5082
    1.0738
    1.5220
    0.1323
    0.1217
   -0.2454
   -1.2248
    0.1201
   -0.2996
   -0.4049
    0.5844
    0.0663
   -0.1287
    0.0724
   -0.0100
    0.2604
    0.1159
   -0.1940
   -0.1823
    0.1009
    0.0248
   -0.0077
   -0.1180
   -0.2939
   -0.0535
    0.2040
    0.1434
    0.2591
    0.0059
    0.0167
   -0.2816
   -0.1454
   -0.1638
    0.1644
    0.1448
   -0.1869
   -0.7289
    0.4606
    0.4552
   -0.4338
    0.0364
   -0.2204
    0.6178
    0.1368
   -0.0791
   -0.0796
    0.0220
   -0.0843
    0.3841
   -0.4049
    0.1051
   -0.0467
    0.3905
   -0.4187
    0.0749
    0.4876
   -0.1031
    0.0023
   -0.3868
Although not as drastic as the improvement obtained by adding the node features, adding the edge features leads to a further improvement in the training set negative log-likelihood.

The function UGM_makeEdgeFeatures only represents one possible way to construct the edge features (by simply using the corresponding node features). Provided that Xedge has the right size and is real-valued, you can use any edge features you like. For example, when using continuous node features, using edge features that depend on the differences between node features might make sense. You could also consider edge features that do not depend on the node features.


Using the Estimated Parameters

As an example of using the learned model, we will generate samples when the binary 'December' feature is on. The first example of a December in the training data is instance 11, so we can generate samples of December using:
i = 11;
[nodePot,edgePot] = UGM_CRF_makePotentials(w,Xnode,Xedge,nodeMap,edgeMap,edgeStruct,i);
samples = UGM_Sample_Chain(nodePot,edgePot,edgeStruct);
Below is an example:

As expected, when we let the model know that it should generate samples of December, it predicts a lot more rain (red).

Next, we will repeat our scenario from the previous demo where we want to know what the rest of the month will look like given that it rained for the first two days. However, we will now ask the question conditioned on the information that the month is Decmber:

clamped = zeros(nNodes,1);
clamped(1:2) = 2;
condSamples = UGM_Sample_Conditional(nodePot,edgePot,edgeStruct,clamped,@UGM_Sample_Chain);
Here is the result:

With a trained CRF, we don't necessarily have to use the Xnode and Xedge from the training data. In principle, we could make up whatever node and edge features we want, and see what the model predicts for that scenario. For example, instead of looking for a training example in July, we could just make the relevant node and edge features. Below, we make the node/edge features for July, and then generate 100 samples of July (note that UGM_CRF_makePotentials makes the potential for the first training example by default):
XtestNode = [1 0 0 0 0 0 0 1 0 0 0 0 0]; % Turn on bias and indicator variable for July
XtestNode = repmat(XtestNode,[1 1 nNodes]);
XtestEdge = UGM_makeEdgeFeatures(XtestNode,edgeStruct.edgeEnds,sharedFeatures);
[nodePot,edgePot] = UGM_CRF_makePotentials(w,XtestNode,XtestEdge,nodeMap,edgeMap,edgeStruct);
samples = UGM_Sample_Chain(nodePot,edgePot,edgeStruct);
Here is the result:

These samples seem to confirm our intuition that it rains more in December than July.


L2-Regularized Parameter Estimation

Above, we have seen that the addition of features can lead to substantially better models. However, this is still a very simple model. In practical applications of CRFs, it is not uncommon to use millions of features. In these cases, it is possible to get low values of the negative log-likelihood without having a very good model (since the negative log-likelihood decreases as we add more features to a model). This is referred to as over-fitting. One way to combat over-fitting is with L2-regularization. L2-regularization can be viewed as a penalty for having large parameter values. It is straightforward to use L2-regularization for CRFs (or MRFs) with UGM by using the penalizedL2 function. This function takes another function and values of the L2-regularization parameters, and returns the objective value (and gradient) of the function augmented with L2-regularization. Below, we give an example:
% Set up regularization parameters
lambda = 10*ones(size(w));
lambda(1) = 0; % Don't penalize node bias variable
lambda(14:17) = 0; % Don't penalize edge bias variable
regFunObj = @(w)penalizedL2(w,@UGM_CRF_NLL,lambda,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain);

% Optimize
w = zeros(nParams,1);
w = minFunc(regFunObj,w);
NLL = UGM_CRF_NLL(w,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain)

 Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     2.99873e-05     1.90437e+04     2.55653e+03
         2          3     1.00000e+00     1.76411e+04     2.33918e+03
         3          4     1.00000e+00     1.72763e+04     5.65313e+02
         4          5     1.00000e+00     1.72539e+04     1.94029e+02
         5          6     1.00000e+00     1.72203e+04     4.66367e+02
         6          7     1.00000e+00     1.71863e+04     4.26978e+02
         7          8     1.00000e+00     1.71725e+04     9.30902e+01
         8          9     1.00000e+00     1.71707e+04     8.04480e+01
         9         10     1.00000e+00     1.71694e+04     1.34711e+02
        10         11     1.00000e+00     1.71665e+04     2.03993e+02
        11         12     1.00000e+00     1.71628e+04     2.00020e+02
        12         13     1.00000e+00     1.71597e+04     1.81252e+02
        13         14     1.00000e+00     1.71549e+04     4.18050e+01
        14         15     1.00000e+00     1.71533e+04     7.63220e+01
        15         16     1.00000e+00     1.71520e+04     1.35407e+02
        16         17     1.00000e+00     1.71501e+04     1.57695e+02
        17         18     1.00000e+00     1.71478e+04     9.84706e+01
        18         19     1.00000e+00     1.71467e+04     1.40848e+01
        19         20     1.00000e+00     1.71463e+04     4.76307e+01
        20         21     1.00000e+00     1.71460e+04     7.27852e+01
        21         22     1.00000e+00     1.71455e+04     8.87116e+01
        22         23     1.00000e+00     1.71449e+04     6.67483e+01
        23         24     1.00000e+00     1.71446e+04     1.75071e+01
        24         25     1.00000e+00     1.71445e+04     1.77834e+01
        25         26     1.00000e+00     1.71444e+04     2.74928e+01
        26         27     1.00000e+00     1.71443e+04     3.80644e+01
        27         28     1.00000e+00     1.71442e+04     4.73859e+01
        28         29     1.00000e+00     1.71438e+04     4.55021e+01
        29         30     1.00000e+00     1.71433e+04     4.03336e+01
        30         32     4.93663e-01     1.71429e+04     1.96383e+01
        31         33     1.00000e+00     1.71426e+04     4.42916e+00
        32         34     1.00000e+00     1.71426e+04     7.80547e+00
        33         35     1.00000e+00     1.71426e+04     6.56353e+00
        34         36     1.00000e+00     1.71425e+04     2.32351e+00
        35         37     1.00000e+00     1.71425e+04     2.84096e+00
        36         38     1.00000e+00     1.71425e+04     3.88930e+00
        37         39     1.00000e+00     1.71425e+04     3.93474e+00
        38         40     1.00000e+00     1.71425e+04     6.13024e+00
        39         42     3.79889e-01     1.71424e+04     5.21645e+00
        40         43     1.00000e+00     1.71424e+04     2.93747e+00
        41         44     1.00000e+00     1.71424e+04     5.06808e-01
        42         45     1.00000e+00     1.71424e+04     4.22126e-01
        43         46     1.00000e+00     1.71424e+04     9.28331e-01
        44         47     1.00000e+00     1.71424e+04     7.80794e-01
        45         48     1.00000e+00     1.71424e+04     3.99689e-01
        46         49     1.00000e+00     1.71424e+04     6.24792e-01
        47         50     1.00000e+00     1.71424e+04     1.40733e+00
        48         51     1.00000e+00     1.71424e+04     1.81548e+00
        49         52     1.00000e+00     1.71424e+04     1.19220e+00
        50         53     1.00000e+00     1.71424e+04     2.22150e-01
        51         54     1.00000e+00     1.71424e+04     7.59190e-02
        52         55     1.00000e+00     1.71424e+04     1.00610e-01
        53         56     1.00000e+00     1.71424e+04     1.16663e-01
        54         57     1.00000e+00     1.71424e+04     6.81109e-02
        55         59     5.02027e-01     1.71424e+04     6.86376e-02
        56         60     1.00000e+00     1.71424e+04     6.01253e-02
        57         61     1.00000e+00     1.71424e+04     8.71068e-02
        58         62     1.00000e+00     1.71424e+04     1.11292e-01
        59         63     1.00000e+00     1.71424e+04     8.80091e-02
        60         65     2.88591e-01     1.71424e+04     5.69282e-02
        61         66     1.00000e+00     1.71424e+04     1.04917e-02
        62         67     1.00000e+00     1.71424e+04     1.08230e-02
        63         68     1.00000e+00     1.71424e+04     8.98160e-03
        64         69     1.00000e+00     1.71424e+04     8.42876e-03
        65         70     1.00000e+00     1.71424e+04     3.26717e-03
        66         71     1.00000e+00     1.71424e+04     4.21883e-03
        67         72     1.00000e+00     1.71424e+04     6.02435e-03
        68         73     1.00000e+00     1.71424e+04     6.57011e-03
        69         74     1.00000e+00     1.71424e+04     1.47230e-02
        70         75     1.00000e+00     1.71424e+04     2.70788e-03
        71         76     1.00000e+00     1.71424e+04     1.49507e-03
        72         77     1.00000e+00     1.71424e+04     1.12688e-03
        73         78     1.00000e+00     1.71424e+04     6.33433e-04
        74         79     1.00000e+00     1.71424e+04     8.52142e-04
        75         80     1.00000e+00     1.71424e+04     1.62696e-03
        76         81     1.00000e+00     1.71424e+04     2.34920e-03
        77         82     1.00000e+00     1.71424e+04     2.19590e-03
        78         83     1.00000e+00     1.71424e+04     1.36188e-03
Function Value changing by less than progTol

NLL =

   1.7131e+04
In the above, the optimization took fewer iterations than in the unregularized (maximum likelihood case). This is because the regularization makes the problem better-conditioned.

In the above, we follow the usual convention where the bias variables are not regularized. It is often argued that it is important to not regularize the bias variable (to avoid dependence of these weights on the scale), although in practice it often does not make a big difference if is is regularized.

As expected, the regularized estimates had a higher negative log-likelihood of the training data. However, if we carefully select the regularization parameter(s), the regularized estimates will typically have a lower negative log-likelihood than the maximum likelihood estimates on new data (ie. data that was not used for parameter estimation). Since we typically want a model that does reasonable things given new data, we will typically use a regularized estimate.


Importance of Regularization

In these demos, we will only briefly discuss regularization. However, along with choosing the right features/labels/graph, choosing the right amount of regularization will typically be crucial to the performance of the amount. Indeed, for both MRFs and CRFs the optimal parameters may not be bounded without regularization. In general, it is a good idea to use a separate regularization parameter for the edge weights than the node weights (we can accomplish this by changing the values of the lambda variable in the demo above). Then, you should perform a 2D search to find a setting of these two parameters that gives good performance on a held-out validation/test set. Without this search, CRFs often do not perform very well. In contrast, with this search then in the worst case CRFs will still perform as well as L2-regularized logistic regression (which has similar performance to support vector machines).

An alternative to L2-regularization is L1-regularization. In addition to regularizing the parameter weights, L1-regularization also encourages the weights to be exactly zero. This means that it can be used as a form of feature pruning, since features that are exactly zero can be safely removed from the model without changing anything. Using L1-regularization leads to non-differentiable optimization problems so minFunc cannot be used. But, this non-differentiable problem is highly-structured and can be solved with the related L1General package.


Notes

The main advantage of CRFs is that we don't need to build a model of the features. This is important when very complicated sets of features are needed to model the random variables, as in natural language processing and computer vision applications. CRFs are also useful when some variables are explicitly controlled in an experiment. There is no natural way to model such interventions in the random variables of undirected graphical models, but it is natural to include variables set by intervention as features in a CRF. We looked at this in:

In this demo, the same features are used for every node and edge. You will often want to use different features for each node and edge; we will see an example of this in the next demo. However, currently UGM assumes that all nodes have the same number of features, and that all edges have the same number of features (but the number of node features doesn't have to be the same as the number of edge features). As with allowing data sets where different training examples have different graph structures, you would simply need to add up the objectives and gradients across training examples with different numbers of features in order to handle this case.

Currently, UGM only supports log-linear potentials for CRFs, but this is the most widely-used type of potential, and this parameterization is appealing since it is jointly convex in the parameters. Nevertheless, it is still possible to incorporate non-linear effects by replacing/augmenting the features with non-linear transformations of the features. Another limitation of UGM is that it does not currently support missing values or hidden variables.

CRFs were originally proposed as a model for natural language processing applications in:

UGM implements a much more general notion of CRFs than described in this paper, but the cost of this generality is that several efficiencies that are available for chain-structured CRFs with high-arrity categorical features are not implemented in UGM (and further, UGM assumes that all 'sentences' have the same length). The crfChain software that I wrote with Kevin Swersky more closely resembles the type of CRF used in natural language processing: it uses an efficient storage scheme for high-arrity cateogorical features represented as indicator variables, and it naturally handles sentences of different lengths. The specific differencese between crfChain and UGM are as follows:

I have worked with Kevin Murphy on the Conditional Random Field Toolbox. However, UGM makes this toolbox obsolete; UGM is faster, simpler, more general, and implements more of the available methods.

PREVIOUS DEMO NEXT DEMO


Mark Schmidt > Software > UGM > TrainCRF Demo