guillaume alain

On Being Reasonable When Vectorizing Code

Code Vectorization in Matlab (or Python)

This guide assumes that you are familiar with the idea of vectorizing your functions to gain speed. I will assume that you know why the code snippet A below runs faster than the snippet B. It all comes down to how Matlab interprets code. Its exp function does things really fast and it operates on arrays internally. If you iterate outside of it you pay a hefty price for the large number of separate calls to exp.

% snippet A, runs fast 
X = rand(1,1000);
Y = exp(13*X);

% snippet B, runs slow 
X = rand(1,1000);
Y = zeros(1,1000);
for k=1:1000
  Y(1,k) = exp(X(1,k));
end 

You can usually combine vectorized functions in a straightforward way, but sometimes it requires a LOT of mental gymnastics. You're left with Matlab code that's hard to read, but the only alternatives are

The bit of Matlab expertise that I want to share here is that sometimes you can easily vectorize your functions along some of the dimensions of your arrays. You avoid the mental gymnastics of 3D arrays colliding and telescoping while still reaping the speed benefits. By the way, this also works in Python with Numpy/Scipy.

An example of application in Machine Learning

On many different occasions I've used a very useful function called sqdist from Tom Minka's Lightspeed toolbox (available for free here). The original function looks like this (just skim through it, nodding your head) :
function m = sqdist(p, q, A)
% SQDIST      Squared Euclidean or Mahalanobis distance. 
% SQDIST(p,q)   returns m(i,j) = (p(:,i) - q(:,j))'*(p(:,i) - q(:,j)). 
% SQDIST(p,q,A) returns m(i,j) = (p(:,i) - q(:,j))'*A*(p(:,i) - q(:,j)). 
 
%  From Tom Minka's lightspeed toolbox 
 
[d, pn] = size(p);
[d, qn] = size(q);
 
if nargin == 2
 
  pmag = sum(p .* p, 1);
  qmag = sum(q .* q, 1);
  m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*q;
  %m = ones(pn,1)*qmag + pmag'*ones(1,qn) - 2*p'*q; 
 
else 
 
  if isempty(A) | isempty(p)
    error('sqdist: empty matrices');
  end 
  Ap = A*p;
  Aq = A*q;
  pmag = sum(p .* Ap, 1);
  qmag = sum(q .* Aq, 1);
  m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*Aq;
end

In all the years that I've used it, I think I traced it by hand only once to see exactly why/how it worked. The, the next day, it was as if I had never done the exercise. Back into ignorance. I can write magic code like that, but it requires planning, deep thinking and acrobatics. Besides, Tom Minka's code works by exploiting certain properties of matrix multiplication that play well with the L2 norm. It would not work for other p-norms (more on that later).

Compromising halfway

Let's look at the following code.

function E = semi_vectorized_sqdist(A, B)
% A pure Matlab implementation of sqdist, vectorized along two dimensions. 
% Same output as the sqdist function by Tom Minka. 
  m = size(A,2);
  n = size(B,2);

  E = zeros(m,n);

  for i=1:m
    % iterate manually over i=1:m and use vectorized operations for the rest 
    D = repmat(A(:,i), [1,n]) - B;
    % W will contain the results for i and all j=1:n 
    W = sum(D.^2,1);
    % put those values inside E where they belong 
    E(i,:) = W';
  end 
end

The most complicated part of this code comes from the line

D = repmat(A(:,i), [1,n]) - B;
which just copies over n times the column A(:,i) in order to subtract it from each column of B. Relatively straightforward, but are we shooting ourselves in the foot with that for i=1:m loop ?

We'll be benchmarking these two versions of the function and we'll even add a third implementation that's even less vectorized.

function E = slow_sqdist(A, B)
    m = size(A,2);
    n = size(B,2);
 
    E = zeros(m,n);
 
    for i=1:m
        for j=1:n
            E(i,j) = sum( (A(:,i) - B(:,j)) .^ 2 );
        end 
    end 
 
end 

The only interesting conclusion justifying this experiment would be that

We will demonstrate exactly that and explain why it matters. Naturally, this will all depend on the number of columns m,n and the dimension d of the vectors compared.

Due diligence, sanity checks, units tests

I will digress a bit here to include another important piece of Matlab advice. When you have multiple implementations of the same function, compare their outputs. People are talking about units tests everywhere. This stuff is gold. When you write a function in C to get insane speed, write the same function in Matlab, using wasteful for loops if you want them for clarity. Run them again each other. Voilà, units tests in Matlab !

% sanity check 
m = 1000;
n = 1000;
d = 17;
 
A = rand(d,m);
B = rand(d,n);
 
reference = sqdist(A,B);

results = semi_vectorized_sqdist(A,B);
fprintf('Fast implementation matched with reference within %f digits.\n',...
      -log10(max(abs(reference(:) - results(:)))));
 
results_s = slow_sqdist(A,B);
fprintf('Slow implementation matched with reference within %f digits.\n',...
      -log10(max(abs(reference(:) - results_s(:)))));

Speed measurements

As mentioned earlier, the performance depends on the particular values of m,n,d. In a naive face-matching algorithm I once used the sqdist function with d=256. At other times, when doing kernel smoothing over data points in two dimensions, I had d=2 but had m=100 points and m=1000 places at which I had to evaluate the density function.

First off, using the values

m = 1000;
n = 1000;
d = 17;

we measure the speed of the functions with this code :

reps = 100;
tic
for r=1:reps
    sqdist(A,B);
end 
time = toc;
fprintf('Reference sqdist implementation took on average %f seconds.\n', time/reps);
 
% vectorized along two dimensions n,d 
tic
for r=1:reps
    semi_vectorized_sqdist(A,B);
end 
time = toc;
fprintf('Semi-vectorized sqdist implementation took on average %f seconds.\n', time/reps);
 
% vectorized along one dimension d 
tic
for r=1:reps
    slow_sqdist(A,B);
end 
time = toc;
fprintf('Slow sqdist implementation took on average %f seconds.\n', time/reps);

and we get

Reference sqdist implementation took on average 0.089418 seconds.
Semi-vectorized sqdist implementation took on average 0.319253 seconds.
Slow sqdist implementation took on average 4.100107 seconds.

We can put this in a table in which we compare results for other values of (m,n,d).

(256,256,4)(1024,1024,4)(1024,1024,8)
0.0019 s0.134 s0.134 s
0.0317 s0.258 s0.293 s
0.2591 s4.163 s4.241 s

I was a bit surprized when I did this experiment "formally" to see that Matlab takes the same average time for sqdist with d=4,8 and even d=16 (not included the in the table).

An example from density estimation

When doing density estimation, one method involves putting Gaussian distributions on top of every observed point from at reference dataset. If we have 30 points in the initial dataset, the resulting density function is a weighted sum of these 30 contributions, as illustrated below :

When plotting values from the density function we need to take the Euclidean distances between these 30 points and all the points from our domain at which we want to evaluate the function. If we are in two-dimensional space and we use a mesh of 256 * 256 points, this translates into d=2,m=30,n=65536. We'll run the numbers in this case to see how the performances compare.

(30,65536,2)(100,100,256)(256,256,1024)
0.210 s0.0014 s0.029 s
0.265 s0.0347 s1.559 s
7.85 s0.0609 s0.815 s

This is the punch line. We got the same order of magnitude of speeds with a ruthlessly vectorized function than we got with the little half-assed semi_vectorized_sqdist function that I wrote in a few minutes for the purposes of this demonstration. It took me years as a Matlab user to realize that "vectorizing my code" meant that I had to vectorize operations at the bottleneck, but not necessarily along all the dimensions of my data.

Making this more robust

I hope the previous table was convincing, but I didn't really cover the case where m > n and where our function could be very lousy because of that. It's relatively easy to fix. I've shown here an example of one potential solution. Adding an extra file isn't very elegant, but it's just to show that it's easy to do. Make sure also that you test results before flipping arguments and inserting a bunch of transpositions where they seem plausible.

function E = prudent_semi_vectorized_sqdist(A, B)
    m = size(A,2);
    n = size(B,2);
 
    if m>n
        E = semi_vectorized_sqdist(B,A)';
    else 
        E = semi_vectorized_sqdist(A,B);
    end 
end 
(30,65536,2)(65536,30,2)
0.21 s0.21 s
0.25 s6.65 s
0.25 s0.25 s

That's it. The important question is always : What takes the Matlab interpreter the most time when looping "manually" ? Is it the overhead costs of the interpretation, or is it the actual functions being evaluated on arrays ? This is just another flavor of the bottleneck optimization principle.

Extra : Other Lp norms

I mentioned earlier that Tom Minka's code was using matrix multiplication to encode the operations that it needs. This works when using Gaussian distributions or mean-square errors, but it would not work if we wanted to use the L1 norm instead of the L2 norm. However, with our code we could get the L1 norm simply by changing the line

W = sum(D.^2,1);
to
W = sum(abs(D),1);
Again, this is just an example to illustrate my point. There are other situations where that kind of flexibility is very welcome.