load X.mat [nRows,nCols] = size(X); nNodes = nRows*nCols; nStates = 2; nInstances = 100; % Make 100 noisy X instances y = int32(1+X); y = reshape(y,[1 1 nNodes]); y = repmat(y,[nInstances 1 1]); X = y + randn(size(y))/2; %% Make edgeStruct adj = sparse(nNodes,nNodes); % Add Down Edges ind = 1:nNodes; exclude = sub2ind([nRows nCols],repmat(nRows,[1 nCols]),1:nCols); % No Down edge for last row ind = setdiff(ind,exclude); adj(sub2ind([nNodes nNodes],ind,ind+1)) = 1; % Add Right Edges ind = 1:nNodes; exclude = sub2ind([nRows nCols],1:nRows,repmat(nCols,[1 nRows])); % No right edge for last column ind = setdiff(ind,exclude); adj(sub2ind([nNodes nNodes],ind,ind+nCols)) = 1; % Add Up/Left Edges adj = adj+adj'; edgeStruct = UGM_makeEdgeStruct(adj,nStates); nEdges = edgeStruct.nEdges;

We make the

% Add bias and Standardize Columns tied = 1; Xnode = [ones(nInstances,1,nNodes) UGM_standardizeCols(X,tied)]; nNodeFeatures = size(Xnode,2); % Make nodeMap nodeMap = zeros(nNodes,nStates,nNodeFeatures,'int32'); for f = 1:nNodeFeatures nodeMap(:,1,f) = f; end % Make Xedge sharedFeatures = [1 0]; Xedge = UGM_makeEdgeFeatures(Xnode,edgeStruct.edgeEnds,sharedFeatures); nEdgeFeatures = size(Xedge,2); % Make edgeMap f = max(nodeMap(:)); edgeMap = zeros(nStates,nStates,nEdges,nEdgeFeatures,'int32'); for edgeFeat = 1:nEdgeFeatures edgeMap(1,1,:,edgeFeat) = f+edgeFeat; edgeMap(2,2,:,edgeFeat) = f+edgeFeat; end

maxIter = 3; % Number of passes through the data set stepSize = 1e-4; w = zeros(nParams,1); for iter = 1:maxIter*nInstances i = ceil(rand*nInstances); funObj = @(w)UGM_CRF_NLL(w,Xnode(i,:,:),Xedge(i,:,:),y(i,:),nodeMap,edgeMap,edgeStruct,@UGM_Infer_LBP); [f,g] = funObj(w); fprintf('Iter = %d of %d (fsub = %f)\n',iter,maxIter,f); w = w - stepSize*g; end

In this case, SGD has obtained a better result in the time-limited setting. If we ran them longer, minFunc would eventually find a better solution and terminate, but the time for minFunc to 'catch up' to the more simple method increases with the size and amount of redundancy in the data set. This makes SGD useful for training CRFs on very large data sets.

We proposed to use SGD for training CRFs in:

- Vishwanathan Schraudolph Schmidt Murphy (2006). Accelerated training of conditional random fields with stochastic gradient methods. International Conference on Machine Learning.

- Ratliff Bagnell Zinkevich (2007). (Online) Subgradient methods for structured prediction. Artificial Intelligence and Statistics.

- Younes (1989). Parameter estimation for imperfectly observed Gibbsian fields. Probability Theory and Related Fields.

- We need to carefully choose the sequence of step sizes in order to obtain good performance.
- The convergence rate of the method gets slower the longer we run the method.
- It is hard to determine the point at which we shold terminate the algorithm.

- Friedlander Schmidt (2011). Hybrid Deterministic-Stochastic Methods for Data Fitting. Submitted.

Mark Schmidt > Software > UGM > TrainSGD Demo