If we are only interested in conditional queries, then a conditional random field (CRF) might be a better option. In CRFs, we have two types of variables: (i) the 'features' (also known as 'covariates') X are treated as fixed non-random variables, and (ii) the 'labels' y are treated as random variables in a UGM, where the parameters of the UGM depend on the features. In our case, the 'labels' will be the 28 variables representing days of the month, and the features will consist of a 'bias' feature and a set of indicator variables that will represent the month. That is, each possible value that the features can take leads to a different UGM model of the labels.
load rain.mat y = int32(X+1); [nInstances,nNodes] = size(y); nStates = max(y); adj = zeros(nNodes); for i = 1:nNodes-1 adj(i,i+1) = 1; end adj = adj+adj'; edgeStruct = UGM_makeEdgeStruct(adj,nStates); maxState = max(nStates);Note that we have used y to denote the samples of the random variables in this demo, since X is typically used to denote the features in a CRF model.
For our first CRF model, we will only use a single 'bias' feature for the node and edge potentials. A 'bias' feature is a feature that always set to one, and it is included to reflect any effects on the states that are independent of the features. MRFs correspond to the special case where we only have a bias variable. In general, the features in CRFs can be arbitrary real-valued numbers and thus we do not necessarily need to include a bias variable. However, unless the states are perfectly balanced in the training data it usually makes sense to include a bias variable. After training this simple model (and verifying that it gives the same results as the previous demo), we will move on to using node and edge features that incorporate the month information to give a better model.
If our CRF has nInstances (conditional) samples of nNodes random variables, and we have nNodeFeatures features for each node, then the Xnode variable will be a nInstances by nNodeFeatures by nNodes array, each element containing the relevant feature for the appropriate node in the particular instance. Therefore, for the simple case where we only have a single unit-valued feature, we would make Xnode as follows:
Xnode = ones(nInstances,1,nNodes);
Similarly, the Xedge variable will be a nInstances by nEdgeFeatures by nEdges array. To make the simple unit-valued edge features we use:
nEdges = edgeStruct.nEdges;
Xedge = ones(nInstances,1,nEdges);
In the case where have only one feature, we don't use the extra dimension of the nodeMap and edgeMap variables so they can be constructed as before:
nodeMap = zeros(nNodes,maxState,'int32');
nodeMap(:,1) = 1;
edgeMap = zeros(maxState,maxState,nEdges,'int32');
edgeMap(1,1,:) = 2;
edgeMap(2,1,:) = 3;
edgeMap(1,2,:) = 4;
To initialize the parameters, we proceed as before:
nParams = max([nodeMap(:);edgeMap(:)]);
w = zeros(nParams,1);
We will again consider minimizing the negative log-likelihood in order to maximize the likelihood. The negative log-likelihood (and its) gradient are computed by UGM_CRF_NLL. With log-linear CRFs, the negative log-likelihood is still differentiable and jointly convex in the node and edge parameters, so we can again computes its optimal value using minFunc. The negative log-likelihood for CRFs takes similar arguments to the MRF case, but we pass in the features instead of sufficient statistics:
w = minFunc(@UGM_CRF_NLL,randn(size(w)),,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain)
Iteration FunEvals Step Length Function Val Opt Cond
1 2 3.19683e-05 2.83032e+04 9.07391e+03
2 3 1.00000e+00 2.42379e+04 7.08784e+03
3 4 1.00000e+00 2.13785e+04 3.39832e+03
4 5 1.00000e+00 1.95148e+04 2.92734e+03
5 6 1.00000e+00 1.79881e+04 5.36254e+02
6 7 1.00000e+00 1.79304e+04 6.91542e+02
7 8 1.00000e+00 1.79114e+04 1.85252e+02
8 10 1.00000e+01 1.78885e+04 7.54564e+02
9 11 1.00000e+00 1.78582e+04 6.71792e+02
10 12 1.00000e+00 1.78035e+04 5.24928e+01
11 13 1.00000e+00 1.78032e+04 1.59129e+01
12 14 1.00000e+00 1.78031e+04 2.55631e+01
13 15 1.00000e+00 1.78028e+04 5.82554e+01
14 16 1.00000e+00 1.78021e+04 9.26333e+01
15 17 1.00000e+00 1.78007e+04 1.23620e+02
16 18 1.00000e+00 1.77990e+04 1.09300e+02
17 19 1.00000e+00 1.77979e+04 3.57499e+01
18 20 1.00000e+00 1.77977e+04 4.54387e+00
19 21 1.00000e+00 1.77977e+04 1.01720e-01
20 22 1.00000e+00 1.77977e+04 3.55628e-02
21 23 1.00000e+00 1.77977e+04 1.99919e-03
Directional Derivative below progTol
As we expected, this degenerate CRF gives us exactly the model we had in the last MRF demo. However, note that training was a bit slower. This is because each instance in a CRF is potentially a different UGM, so the loss function needs to call inferFunc for every sample.
In CRFs, the potentials depend on the features. So to make a set of potentials, we need to choose a set of features. The function UGM_CRF_makePotentials takes a training instance number and returns the corresponding potentials. For example, to return the potentials associated with the first training example we would use
instance = 1;
[nodePot,edgePot] = UGM_CRF_makePotentials(w,Xnode,Xedge,nodeMap,edgeMap,edgeStruct,instance);
As expected, these are the same potentials we had before.
nFeatures = 12; Xnode = zeros(nInstances,nFeatures,nNodes); for m = 1:nFeatures Xnode(months==m,m,:) = 1; endFor example, Xnode(3,4,1) is set to 1 while Xnode(3,[1:3 5:12],1) is 0 because the month variable for the third sample is set to 4 (representing April). We will also want a bias in the model (since we expect there to be some baseline probability of rain that is independent of the month), so we add a feature that is always set to 1:
Xnode = [ones(nInstances,1,nNodes) Xnode]; nNodeFeatures = size(Xnode,2);We now make the nNodes by maxState by nNodeFeatures nodeMap, which lets UGM know which features multiplied together and then added and exponentiated for each variable. In our case, all variables will use all 13 features but share the same parameters, so we use:
nodeMap = zeros(nNodes,maxState,nNodeFeatures,'int32'); for f = 1:nNodeFeatures nodeMap(:,1,f) = f; endWe also need to update the edgeMap, so that the nodeMap and edgeMap numbers don't overlap (it rarely makes sense to share node and edge parameters):
edgeMap = zeros(maxState,maxState,nEdges,'int32'); edgeMap(1,1,:) = nNodeFeatures+1; edgeMap(2,1,:) = nNodeFeatures+2; edgeMap(1,2,:) = nNodeFeatures+3;With the updated nodeMap and edgeMap, training proceeds exactly as before:
nParams = max([nodeMap(:);edgeMap(:)]); w = zeros(nParams,1); w = minFunc(@UGM_CRF_NLL,w,,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain) Iteration FunEvals Step Length Function Val Opt Cond 1 2 5.50957e-05 1.87895e+04 1.71063e+03 2 3 1.00000e+00 1.83557e+04 1.71053e+03 3 4 1.00000e+00 1.76199e+04 3.06222e+03 4 5 1.00000e+00 1.73411e+04 7.91786e+02 5 6 1.00000e+00 1.73097e+04 2.19752e+02 6 7 1.00000e+00 1.72916e+04 2.66328e+02 7 8 1.00000e+00 1.72552e+04 5.10685e+02 8 9 1.00000e+00 1.72132e+04 4.40088e+02 9 10 1.00000e+00 1.71942e+04 8.79763e+01 10 11 1.00000e+00 1.71925e+04 5.99177e+01 11 12 1.00000e+00 1.71918e+04 8.47407e+01 12 13 1.00000e+00 1.71893e+04 1.21715e+02 13 14 1.00000e+00 1.71855e+04 1.22600e+02 14 15 1.00000e+00 1.71756e+04 2.08079e+02 15 16 1.00000e+00 1.71695e+04 1.12069e+02 16 17 1.00000e+00 1.71670e+04 1.95784e+01 17 18 1.00000e+00 1.71668e+04 1.46669e+01 18 19 1.00000e+00 1.71665e+04 1.99287e+01 19 20 1.00000e+00 1.71661e+04 1.94539e+01 20 21 1.00000e+00 1.71659e+04 3.94953e+00 21 22 1.00000e+00 1.71658e+04 1.08766e+00 22 23 1.00000e+00 1.71658e+04 1.80154e-01 23 24 1.00000e+00 1.71658e+04 8.30007e-02 24 25 1.00000e+00 1.71658e+04 2.70300e-02 25 26 1.00000e+00 1.71658e+04 2.17654e-03 26 27 1.00000e+00 1.71658e+04 2.90511e-04 Directional Derivative below progTol w = 0.5662 -0.3639 -0.1622 -0.0970 0.0183 0.1848 0.2275 0.6032 0.6072 0.3090 -0.0487 -0.3435 -0.3684 -0.4218 -1.0356 -0.8798As expected, conditioning on the month information leads to a better fit. In fact, the improvement between the likelihood of the CRF model and the MRF model is much bigger than the improvement that we obtained by using more expressive edge potentials in the previous demo.
sharedFeatures = ones(13,1); Xedge = UGM_makeEdgeFeatures(Xnode,edgeStruct.edgeEnds,sharedFeatures); nEdgeFeatures = size(Xedge,2);We now make the maxState by maxState by nEdges by nEdgeFeatures edgeMap:
edgeMap = zeros(maxState,maxState,nEdges,nEdgeFeatures,'int32'); for edgeFeat = 1:nEdgeFeatures for s1 = 1:2 for s2 = 1:2 f = f+1; edgeMap(s1,s2,:,edgeFeat) = f; end end endWith the updated Xedge and edgeMap, we proceed as before:
nParams = max([nodeMap(:);edgeMap(:)]); w = zeros(nParams,1); w = minFunc(@UGM_CRF_NLL,w,,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain) Iteration FunEvals Step Length Function Val Opt Cond 1 2 2.99873e-05 1.90436e+04 2.55653e+03 2 3 1.00000e+00 1.76400e+04 2.34032e+03 3 4 1.00000e+00 1.72738e+04 5.65236e+02 4 5 1.00000e+00 1.72512e+04 1.96187e+02 5 6 1.00000e+00 1.72162e+04 4.73894e+02 6 7 1.00000e+00 1.71816e+04 4.29853e+02 7 8 1.00000e+00 1.71676e+04 9.36649e+01 8 9 1.00000e+00 1.71657e+04 8.10747e+01 9 10 1.00000e+00 1.71645e+04 1.35856e+02 10 11 1.00000e+00 1.71614e+04 2.06251e+02 11 12 1.00000e+00 1.71575e+04 2.03290e+02 12 13 1.00000e+00 1.71545e+04 1.92202e+02 13 14 1.00000e+00 1.71492e+04 4.64233e+01 14 15 1.00000e+00 1.71475e+04 7.23352e+01 15 16 1.00000e+00 1.71461e+04 1.38713e+02 16 17 1.00000e+00 1.71440e+04 1.75369e+02 17 18 1.00000e+00 1.71414e+04 1.30053e+02 18 19 1.00000e+00 1.71397e+04 2.82008e+01 19 20 1.00000e+00 1.71390e+04 5.28906e+01 20 21 1.00000e+00 1.71386e+04 9.04927e+01 21 22 1.00000e+00 1.71378e+04 1.27399e+02 22 23 1.00000e+00 1.71366e+04 1.29590e+02 23 24 1.00000e+00 1.71350e+04 8.02345e+01 24 25 1.00000e+00 1.71338e+04 2.14189e+01 25 26 1.00000e+00 1.71328e+04 6.68356e+01 26 27 1.00000e+00 1.71323e+04 8.78730e+01 27 28 1.00000e+00 1.71310e+04 1.01143e+02 28 29 1.00000e+00 1.71288e+04 9.07294e+01 29 30 1.00000e+00 1.71255e+04 2.70413e+01 30 31 1.00000e+00 1.71234e+04 2.33995e+01 31 32 1.00000e+00 1.71226e+04 3.42237e+01 32 33 1.00000e+00 1.71223e+04 2.23663e+01 33 34 1.00000e+00 1.71220e+04 5.62033e+00 34 35 1.00000e+00 1.71220e+04 1.69890e+01 35 36 1.00000e+00 1.71218e+04 3.02428e+01 36 37 1.00000e+00 1.71217e+04 2.99272e+01 37 38 1.00000e+00 1.71211e+04 3.03047e+01 38 39 1.00000e+00 1.71202e+04 4.71917e+01 39 40 1.00000e+00 1.71186e+04 1.46650e+01 40 41 1.00000e+00 1.71177e+04 1.22074e+01 41 42 1.00000e+00 1.71172e+04 8.57846e+00 42 43 1.00000e+00 1.71171e+04 1.83532e+01 43 44 1.00000e+00 1.71169e+04 6.84019e+00 44 45 1.00000e+00 1.71168e+04 2.20643e+00 45 46 1.00000e+00 1.71167e+04 3.27873e+00 46 47 1.00000e+00 1.71167e+04 1.20115e+00 47 48 1.00000e+00 1.71167e+04 2.22905e+00 48 49 1.00000e+00 1.71167e+04 5.42830e+00 49 50 1.00000e+00 1.71167e+04 9.00362e+00 50 51 1.00000e+00 1.71166e+04 9.82641e+00 51 52 1.00000e+00 1.71165e+04 1.27488e+01 52 54 4.72543e-01 1.71165e+04 8.33743e+00 53 55 1.00000e+00 1.71164e+04 4.89122e+00 54 56 1.00000e+00 1.71164e+04 6.44481e-01 55 57 1.00000e+00 1.71164e+04 5.46661e-01 56 58 1.00000e+00 1.71164e+04 1.36614e+00 57 59 1.00000e+00 1.71164e+04 2.46083e+00 58 60 1.00000e+00 1.71164e+04 2.87980e+00 59 61 1.00000e+00 1.71164e+04 2.40811e+00 60 62 1.00000e+00 1.71164e+04 3.14707e+00 61 63 1.00000e+00 1.71163e+04 2.01836e+00 62 64 1.00000e+00 1.71163e+04 2.46767e+00 63 65 1.00000e+00 1.71163e+04 2.55408e+00 64 67 4.04260e-01 1.71163e+04 1.56098e+00 65 68 1.00000e+00 1.71163e+04 3.12589e-01 66 69 1.00000e+00 1.71163e+04 2.40309e-01 67 70 1.00000e+00 1.71163e+04 3.51821e-01 68 71 1.00000e+00 1.71163e+04 6.98811e-01 69 72 1.00000e+00 1.71163e+04 1.25232e+00 70 73 1.00000e+00 1.71163e+04 1.74637e+00 71 74 1.00000e+00 1.71163e+04 1.82086e+00 72 76 4.32633e-01 1.71163e+04 1.38697e+00 73 77 1.00000e+00 1.71163e+04 5.39311e-01 74 78 1.00000e+00 1.71163e+04 1.37194e-01 75 79 1.00000e+00 1.71163e+04 4.85571e-02 76 80 1.00000e+00 1.71163e+04 3.82990e-02 77 81 1.00000e+00 1.71163e+04 2.54560e-02 78 82 1.00000e+00 1.71163e+04 4.69787e-02 79 83 1.00000e+00 1.71163e+04 8.66219e-02 80 84 1.00000e+00 1.71163e+04 1.35306e-01 81 85 1.00000e+00 1.71163e+04 8.59283e-02 82 86 1.00000e+00 1.71163e+04 1.67533e-01 83 87 1.00000e+00 1.71163e+04 1.30266e-02 84 88 1.00000e+00 1.71163e+04 1.01122e-02 85 90 1.00000e+01 1.71163e+04 2.79276e-02 86 91 1.00000e+00 1.71163e+04 2.97327e-02 87 92 1.00000e+00 1.71163e+04 4.27361e-02 88 93 1.00000e+00 1.71163e+04 2.10481e-02 89 94 1.00000e+00 1.71163e+04 4.44461e-03 90 95 1.00000e+00 1.71163e+04 4.32501e-03 91 96 1.00000e+00 1.71163e+04 5.82381e-03 92 97 1.00000e+00 1.71163e+04 8.82892e-03 93 98 1.00000e+00 1.71163e+04 1.32052e-02 94 99 1.00000e+00 1.71163e+04 1.62503e-02 95 100 1.00000e+00 1.71163e+04 1.40167e-02 96 101 1.00000e+00 1.71163e+04 3.80935e-03 97 102 1.00000e+00 1.71163e+04 2.07595e-03 98 103 1.00000e+00 1.71163e+04 5.51650e-04 99 104 1.00000e+00 1.71163e+04 8.77187e-04 100 105 1.00000e+00 1.71163e+04 1.56245e-03 101 106 1.00000e+00 1.71163e+04 2.42554e-03 102 107 1.00000e+00 1.71163e+04 2.95997e-03 103 108 1.00000e+00 1.71163e+04 2.76296e-03 104 110 2.44377e-01 1.71163e+04 4.14080e-03 105 111 1.00000e+00 1.71163e+04 2.18171e-03 106 112 1.00000e+00 1.71163e+04 4.59363e-04 107 113 1.00000e+00 1.71163e+04 6.28469e-04 108 114 1.00000e+00 1.71163e+04 1.78548e-03 109 115 1.00000e+00 1.71163e+04 3.70297e-03 110 116 1.00000e+00 1.71163e+04 5.67089e-03 111 117 1.00000e+00 1.71163e+04 6.25322e-03 112 119 2.20033e-02 1.71163e+04 6.23359e-03 113 120 1.00000e+00 1.71163e+04 3.85092e-03 114 121 1.00000e+00 1.71163e+04 8.83878e-04 Directional Derivative below progTol w = 0.6249 -0.4527 -0.6101 -0.3241 0.4608 -0.3367 0.5082 1.0738 1.5220 0.1323 0.1217 -0.2454 -1.2248 0.1201 -0.2996 -0.4049 0.5844 0.0663 -0.1287 0.0724 -0.0100 0.2604 0.1159 -0.1940 -0.1823 0.1009 0.0248 -0.0077 -0.1180 -0.2939 -0.0535 0.2040 0.1434 0.2591 0.0059 0.0167 -0.2816 -0.1454 -0.1638 0.1644 0.1448 -0.1869 -0.7289 0.4606 0.4552 -0.4338 0.0364 -0.2204 0.6178 0.1368 -0.0791 -0.0796 0.0220 -0.0843 0.3841 -0.4049 0.1051 -0.0467 0.3905 -0.4187 0.0749 0.4876 -0.1031 0.0023 -0.3868Although not as drastic as the improvement obtained by adding the node features, adding the edge features leads to a further improvement in the training set negative log-likelihood.
The function UGM_makeEdgeFeatures only represents one possible way to construct the edge features (by simply using the corresponding node features). Provided that Xedge has the right size and is real-valued, you can use any edge features you like. For example, when using continuous node features, using edge features that depend on the differences between node features might make sense. You could also consider edge features that do not depend on the node features.
i = 11; [nodePot,edgePot] = UGM_CRF_makePotentials(w,Xnode,Xedge,nodeMap,edgeMap,edgeStruct,i); samples = UGM_Sample_Chain(nodePot,edgePot,edgeStruct);Below is an example:
Next, we will repeat our scenario from the previous demo where we want to know what the rest of the month will look like given that it rained for the first two days. However, we will now ask the question conditioned on the information that the month is Decmber:
clamped = zeros(nNodes,1);
clamped(1:2) = 2;
condSamples = UGM_Sample_Conditional(nodePot,edgePot,edgeStruct,clamped,@UGM_Sample_Chain);
Here is the result:
With a trained CRF, we don't necessarily have to use the Xnode and Xedge from the training data. In principle, we could make up whatever node and edge features we want, and see what the model predicts for that scenario. For example, instead of looking for a training example in July, we could just make the relevant node and edge features. Below, we make the node/edge features for July, and then generate 100 samples of July (note that UGM_CRF_makePotentials makes the potential for the first training example by default):
XtestNode = [1 0 0 0 0 0 0 1 0 0 0 0 0]; % Turn on bias and indicator variable for July
XtestNode = repmat(XtestNode,[1 1 nNodes]);
XtestEdge = UGM_makeEdgeFeatures(XtestNode,edgeStruct.edgeEnds,sharedFeatures);
[nodePot,edgePot] = UGM_CRF_makePotentials(w,XtestNode,XtestEdge,nodeMap,edgeMap,edgeStruct);
samples = UGM_Sample_Chain(nodePot,edgePot,edgeStruct);
Here is the result:
These samples seem to confirm our intuition that it rains more in December than July.
% Set up regularization parameters lambda = 10*ones(size(w)); lambda(1) = 0; % Don't penalize node bias variable lambda(14:17) = 0; % Don't penalize edge bias variable regFunObj = @(w)penalizedL2(w,@UGM_CRF_NLL,lambda,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain); % Optimize w = zeros(nParams,1); w = minFunc(regFunObj,w); NLL = UGM_CRF_NLL(w,Xnode,Xedge,y,nodeMap,edgeMap,edgeStruct,@UGM_Infer_Chain) Iteration FunEvals Step Length Function Val Opt Cond 1 2 2.99873e-05 1.90437e+04 2.55653e+03 2 3 1.00000e+00 1.76411e+04 2.33918e+03 3 4 1.00000e+00 1.72763e+04 5.65313e+02 4 5 1.00000e+00 1.72539e+04 1.94029e+02 5 6 1.00000e+00 1.72203e+04 4.66367e+02 6 7 1.00000e+00 1.71863e+04 4.26978e+02 7 8 1.00000e+00 1.71725e+04 9.30902e+01 8 9 1.00000e+00 1.71707e+04 8.04480e+01 9 10 1.00000e+00 1.71694e+04 1.34711e+02 10 11 1.00000e+00 1.71665e+04 2.03993e+02 11 12 1.00000e+00 1.71628e+04 2.00020e+02 12 13 1.00000e+00 1.71597e+04 1.81252e+02 13 14 1.00000e+00 1.71549e+04 4.18050e+01 14 15 1.00000e+00 1.71533e+04 7.63220e+01 15 16 1.00000e+00 1.71520e+04 1.35407e+02 16 17 1.00000e+00 1.71501e+04 1.57695e+02 17 18 1.00000e+00 1.71478e+04 9.84706e+01 18 19 1.00000e+00 1.71467e+04 1.40848e+01 19 20 1.00000e+00 1.71463e+04 4.76307e+01 20 21 1.00000e+00 1.71460e+04 7.27852e+01 21 22 1.00000e+00 1.71455e+04 8.87116e+01 22 23 1.00000e+00 1.71449e+04 6.67483e+01 23 24 1.00000e+00 1.71446e+04 1.75071e+01 24 25 1.00000e+00 1.71445e+04 1.77834e+01 25 26 1.00000e+00 1.71444e+04 2.74928e+01 26 27 1.00000e+00 1.71443e+04 3.80644e+01 27 28 1.00000e+00 1.71442e+04 4.73859e+01 28 29 1.00000e+00 1.71438e+04 4.55021e+01 29 30 1.00000e+00 1.71433e+04 4.03336e+01 30 32 4.93663e-01 1.71429e+04 1.96383e+01 31 33 1.00000e+00 1.71426e+04 4.42916e+00 32 34 1.00000e+00 1.71426e+04 7.80547e+00 33 35 1.00000e+00 1.71426e+04 6.56353e+00 34 36 1.00000e+00 1.71425e+04 2.32351e+00 35 37 1.00000e+00 1.71425e+04 2.84096e+00 36 38 1.00000e+00 1.71425e+04 3.88930e+00 37 39 1.00000e+00 1.71425e+04 3.93474e+00 38 40 1.00000e+00 1.71425e+04 6.13024e+00 39 42 3.79889e-01 1.71424e+04 5.21645e+00 40 43 1.00000e+00 1.71424e+04 2.93747e+00 41 44 1.00000e+00 1.71424e+04 5.06808e-01 42 45 1.00000e+00 1.71424e+04 4.22126e-01 43 46 1.00000e+00 1.71424e+04 9.28331e-01 44 47 1.00000e+00 1.71424e+04 7.80794e-01 45 48 1.00000e+00 1.71424e+04 3.99689e-01 46 49 1.00000e+00 1.71424e+04 6.24792e-01 47 50 1.00000e+00 1.71424e+04 1.40733e+00 48 51 1.00000e+00 1.71424e+04 1.81548e+00 49 52 1.00000e+00 1.71424e+04 1.19220e+00 50 53 1.00000e+00 1.71424e+04 2.22150e-01 51 54 1.00000e+00 1.71424e+04 7.59190e-02 52 55 1.00000e+00 1.71424e+04 1.00610e-01 53 56 1.00000e+00 1.71424e+04 1.16663e-01 54 57 1.00000e+00 1.71424e+04 6.81109e-02 55 59 5.02027e-01 1.71424e+04 6.86376e-02 56 60 1.00000e+00 1.71424e+04 6.01253e-02 57 61 1.00000e+00 1.71424e+04 8.71068e-02 58 62 1.00000e+00 1.71424e+04 1.11292e-01 59 63 1.00000e+00 1.71424e+04 8.80091e-02 60 65 2.88591e-01 1.71424e+04 5.69282e-02 61 66 1.00000e+00 1.71424e+04 1.04917e-02 62 67 1.00000e+00 1.71424e+04 1.08230e-02 63 68 1.00000e+00 1.71424e+04 8.98160e-03 64 69 1.00000e+00 1.71424e+04 8.42876e-03 65 70 1.00000e+00 1.71424e+04 3.26717e-03 66 71 1.00000e+00 1.71424e+04 4.21883e-03 67 72 1.00000e+00 1.71424e+04 6.02435e-03 68 73 1.00000e+00 1.71424e+04 6.57011e-03 69 74 1.00000e+00 1.71424e+04 1.47230e-02 70 75 1.00000e+00 1.71424e+04 2.70788e-03 71 76 1.00000e+00 1.71424e+04 1.49507e-03 72 77 1.00000e+00 1.71424e+04 1.12688e-03 73 78 1.00000e+00 1.71424e+04 6.33433e-04 74 79 1.00000e+00 1.71424e+04 8.52142e-04 75 80 1.00000e+00 1.71424e+04 1.62696e-03 76 81 1.00000e+00 1.71424e+04 2.34920e-03 77 82 1.00000e+00 1.71424e+04 2.19590e-03 78 83 1.00000e+00 1.71424e+04 1.36188e-03 Function Value changing by less than progTol NLL = 1.7131e+04In the above, the optimization took fewer iterations than in the unregularized (maximum likelihood case). This is because the regularization makes the problem better-conditioned.
In the above, we follow the usual convention where the bias variables are not regularized. It is often argued that it is important to not regularize the bias variable (to avoid dependence of these weights on the scale), although in practice it often does not make a big difference if is is regularized.
As expected, the regularized estimates had a higher negative log-likelihood of the training data. However, if we carefully select the regularization parameter(s), the regularized estimates will typically have a lower negative log-likelihood than the maximum likelihood estimates on new data (ie. data that was not used for parameter estimation). Since we typically want a model that does reasonable things given new data, we will typically use a regularized estimate.
An alternative to L2-regularization is L1-regularization. In addition to regularizing the parameter weights, L1-regularization also encourages the weights to be exactly zero. This means that it can be used as a form of feature pruning, since features that are exactly zero can be safely removed from the model without changing anything. Using L1-regularization leads to non-differentiable optimization problems so minFunc cannot be used. But, this non-differentiable problem is highly-structured and can be solved with the related L1General package.
In this demo, the same features are used for every node and edge. You will often want to use different features for each node and edge; we will see an example of this in the next demo. However, currently UGM assumes that all nodes have the same number of features, and that all edges have the same number of features (but the number of node features doesn't have to be the same as the number of edge features). As with allowing data sets where different training examples have different graph structures, you would simply need to add up the objectives and gradients across training examples with different numbers of features in order to handle this case.
Currently, UGM only supports log-linear potentials for CRFs, but this is the most widely-used type of potential, and this parameterization is appealing since it is jointly convex in the parameters. Nevertheless, it is still possible to incorporate non-linear effects by replacing/augmenting the features with non-linear transformations of the features. Another limitation of UGM is that it does not currently support missing values or hidden variables.
CRFs were originally proposed as a model for natural language processing applications in:
I have worked with Kevin Murphy on the Conditional Random Field Toolbox. However, UGM makes this toolbox obsolete; UGM is faster, simpler, more general, and implements more of the available methods.
PREVIOUS DEMO NEXT DEMO