% snippet A, runs fast X = rand(1,1000); Y = exp(13*X); % snippet B, runs slow X = rand(1,1000); Y = zeros(1,1000); for k=1:1000 Y(1,k) = exp(X(1,k)); end
guillaume alain
This guide assumes that you are familiar with the idea of vectorizing your functions to gain speed. I will assume that you know why the code snippet A below runs faster than the snippet B. It all comes down to how Matlab interprets code. Its exp function does things really fast and it operates on arrays internally. If you iterate outside of it you pay a hefty price for the large number of separate calls to exp.
% snippet A, runs fast X = rand(1,1000); Y = exp(13*X); % snippet B, runs slow X = rand(1,1000); Y = zeros(1,1000); for k=1:1000 Y(1,k) = exp(X(1,k)); end
You can usually combine vectorized functions in a straightforward way, but sometimes it requires a LOT of mental gymnastics. You're left with Matlab code that's hard to read, but the only alternatives are
The bit of Matlab expertise that I want to share here is that sometimes you can easily vectorize your functions along some of the dimensions of your arrays. You avoid the mental gymnastics of 3D arrays colliding and telescoping while still reaping the speed benefits. By the way, this also works in Python with Numpy/Scipy.
function m = sqdist(p, q, A) % SQDIST Squared Euclidean or Mahalanobis distance. % SQDIST(p,q) returns m(i,j) = (p(:,i) - q(:,j))'*(p(:,i) - q(:,j)). % SQDIST(p,q,A) returns m(i,j) = (p(:,i) - q(:,j))'*A*(p(:,i) - q(:,j)). % From Tom Minka's lightspeed toolbox [d, pn] = size(p); [d, qn] = size(q); if nargin == 2 pmag = sum(p .* p, 1); qmag = sum(q .* q, 1); m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*q; %m = ones(pn,1)*qmag + pmag'*ones(1,qn) - 2*p'*q; else if isempty(A) | isempty(p) error('sqdist: empty matrices'); end Ap = A*p; Aq = A*q; pmag = sum(p .* Ap, 1); qmag = sum(q .* Aq, 1); m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*Aq; end
In all the years that I've used it, I think I traced it by hand only once to see exactly why/how it worked. The, the next day, it was as if I had never done the exercise. Back into ignorance. I can write magic code like that, but it requires planning, deep thinking and acrobatics. Besides, Tom Minka's code works by exploiting certain properties of matrix multiplication that play well with the L2 norm. It would not work for other p-norms (more on that later).
Let's look at the following code.
function E = semi_vectorized_sqdist(A, B) % A pure Matlab implementation of sqdist, vectorized along two dimensions. % Same output as the sqdist function by Tom Minka. m = size(A,2); n = size(B,2); E = zeros(m,n); for i=1:m % iterate manually over i=1:m and use vectorized operations for the rest D = repmat(A(:,i), [1,n]) - B; % W will contain the results for i and all j=1:n W = sum(D.^2,1); % put those values inside E where they belong E(i,:) = W'; end end
The most complicated part of this code comes from the line
D = repmat(A(:,i), [1,n]) - B;which just copies over n times the column A(:,i) in order to subtract it from each column of B. Relatively straightforward, but are we shooting ourselves in the foot with that for i=1:m loop ?
We'll be benchmarking these two versions of the function and we'll even add a third implementation that's even less vectorized.
function E = slow_sqdist(A, B) m = size(A,2); n = size(B,2); E = zeros(m,n); for i=1:m for j=1:n E(i,j) = sum( (A(:,i) - B(:,j)) .^ 2 ); end end end
The only interesting conclusion justifying this experiment would be that
We will demonstrate exactly that and explain why it matters. Naturally, this will all depend on the number of columns m,n and the dimension d of the vectors compared.
I will digress a bit here to include another important piece of Matlab advice. When you have multiple implementations of the same function, compare their outputs. People are talking about units tests everywhere. This stuff is gold. When you write a function in C to get insane speed, write the same function in Matlab, using wasteful for loops if you want them for clarity. Run them again each other. Voilà, units tests in Matlab !
% sanity check m = 1000; n = 1000; d = 17; A = rand(d,m); B = rand(d,n); reference = sqdist(A,B); results = semi_vectorized_sqdist(A,B); fprintf('Fast implementation matched with reference within %f digits.\n',... -log10(max(abs(reference(:) - results(:))))); results_s = slow_sqdist(A,B); fprintf('Slow implementation matched with reference within %f digits.\n',... -log10(max(abs(reference(:) - results_s(:)))));
As mentioned earlier, the performance depends on the particular values of m,n,d. In a naive face-matching algorithm I once used the sqdist function with d=256. At other times, when doing kernel smoothing over data points in two dimensions, I had d=2 but had m=100 points and m=1000 places at which I had to evaluate the density function.
First off, using the values
m = 1000; n = 1000; d = 17;
we measure the speed of the functions with this code :
reps = 100; tic for r=1:reps sqdist(A,B); end time = toc; fprintf('Reference sqdist implementation took on average %f seconds.\n', time/reps); % vectorized along two dimensions n,d tic for r=1:reps semi_vectorized_sqdist(A,B); end time = toc; fprintf('Semi-vectorized sqdist implementation took on average %f seconds.\n', time/reps); % vectorized along one dimension d tic for r=1:reps slow_sqdist(A,B); end time = toc; fprintf('Slow sqdist implementation took on average %f seconds.\n', time/reps);
and we get
Reference sqdist implementation took on average 0.089418 seconds. Semi-vectorized sqdist implementation took on average 0.319253 seconds. Slow sqdist implementation took on average 4.100107 seconds.
We can put this in a table in which we compare results for other values of (m,n,d).
| (256,256,4) | (1024,1024,4) | (1024,1024,8) | |
| sqdist | 0.0019 s | 0.134 s | 0.134 s |
| semi_vectorized_sqdist | 0.0317 s | 0.258 s | 0.293 s |
| slow_sqdist | 0.2591 s | 4.163 s | 4.241 s |
I was a bit surprized when I did this experiment "formally" to see that Matlab takes the same average time for sqdist with d=4,8 and even d=16 (not included the in the table).
When doing density estimation, one method involves putting Gaussian distributions on top of every observed point from at reference dataset. If we have 30 points in the initial dataset, the resulting density function is a weighted sum of these 30 contributions, as illustrated below :

When plotting values from the density function we need to take the Euclidean distances between these 30 points and all the points from our domain at which we want to evaluate the function. If we are in two-dimensional space and we use a mesh of 256 * 256 points, this translates into d=2,m=30,n=65536. We'll run the numbers in this case to see how the performances compare.
| (30,65536,2) | (100,100,256) | (256,256,1024) | |
| sqdist | 0.210 s | 0.0014 s | 0.029 s |
| semi_vectorized_sqdist | 0.265 s | 0.0347 s | 1.559 s |
| slow_sqdist | 7.85 s | 0.0609 s | 0.815 s |
This is the punch line. We got the same order of magnitude of speeds with a ruthlessly vectorized function than we got with the little half-assed semi_vectorized_sqdist function that I wrote in a few minutes for the purposes of this demonstration. It took me years as a Matlab user to realize that "vectorizing my code" meant that I had to vectorize operations at the bottleneck, but not necessarily along all the dimensions of my data.
I hope the previous table was convincing, but I didn't really cover the case where m > n and where our function could be very lousy because of that. It's relatively easy to fix. I've shown here an example of one potential solution. Adding an extra file isn't very elegant, but it's just to show that it's easy to do. Make sure also that you test results before flipping arguments and inserting a bunch of transpositions where they seem plausible.
function E = prudent_semi_vectorized_sqdist(A, B) m = size(A,2); n = size(B,2); if m>n E = semi_vectorized_sqdist(B,A)'; else E = semi_vectorized_sqdist(A,B); end end
| (30,65536,2) | (65536,30,2) | |
| sqdist | 0.21 s | 0.21 s |
| semi_vectorized_sqdist | 0.25 s | 6.65 s |
| prudent_semi_vectorized_sqdist | 0.25 s | 0.25 s |
That's it. The important question is always : What takes the Matlab interpreter the most time when looping "manually" ? Is it the overhead costs of the interpretation, or is it the actual functions being evaluated on arrays ? This is just another flavor of the bottleneck optimization principle.
I mentioned earlier that Tom Minka's code was using matrix multiplication to encode the operations that it needs. This works when using Gaussian distributions or mean-square errors, but it would not work if we wanted to use the L1 norm instead of the L2 norm. However, with our code we could get the L1 norm simply by changing the line
W = sum(D.^2,1);to
W = sum(abs(D),1);Again, this is just an example to illustrate my point. There are other situations where that kind of flexibility is very welcome.