I got thrown into writing multi-threaded programs at work when I had to port some code from Visual C++ to GNU C++. Some grad student wrote the original code and I just had to modify it, which was annoying but it gave me a minimal working knowledge of POSIX threading, which is actually far easier than I expected.
While Matlab does a good job of using multiple processor cores on some of its built-in functions, I wanted to get add that to some of the custom mex functions I’ve written. I’ve achieved some great speed increases when re-writing matlab functions in c. Especially with iterative loops (x(i) = x(i-1)+x(i+1)), or times where I can make enormous assumptions about the input to a function and avoid all kinds of time-wasting error detections.
Anyway, I spent a little time trying to figure out how to use threading within a matlab mex function and was able to get it working. The actual code is below, and I assume some familiarity with mex functions and POSIX threading (read the links above). Here’s a brief overview of how it works:
The function called by each thread must be of the form void *function_name(*void).
You can only send one variable to the function, so if you want to send more information than that, create a struct to hold everything. When you create the thread, you’ll have to cast this as a void*.
Once in the thread function, you need to create a pointer of the type that you want to use (the struct that you just made) to the void* that was passed to it.
The code below simply fills a matrix with this very exciting pattern:

Within the mexFunction(), I figure out how many columns of this matrix will be filled by each thread (always try to process by column. See the bottom of this post for more). In my case, I used 7 threads and had a 1000×1000 matrix, so each of the first 6 threads process 142 columns, and the 7th thread processes the remaining 148. I realize it’s better to have 6 threads process 143 and the 7th process 142, but I didn’t really want to get into figuring that out right now.
You can compile and run the code with two simple lines. Make sure that when you compile it, you link against the pthread library. In this example I have each thread print out it’s ID and the number of columns it will be working on.
>> mex -lpthread fill_matrix_par.c
>> x=fill_matrix_par(1000);
In thread 0. Filling 142 columns
In thread 6. Filling 148 columns
In thread 4. Filling 142 columns
In thread 3. Filling 142 columns
In thread 2. Filling 142 columns
In thread 1. Filling 142 columns
#include "mex.h"
#include <math.h>
#include "pthread.h"
/**
* prhs[0] : Nrow - dimension of the square matrix to be output
*
* plhs[0] : CS - an Nrow by Nrow matrix output
*/
/* We need to define a structure to pass to each thread, since you can only
* pass it one variable. This will hold everything needed by the thread.
*/
typedef struct {
int Ncol; /* number of columns each specific thread will process */
int Nrow; /* number of rows in the matrix */
int offset; /* this is for use in filling the matrix */
int i; /* this is just and ID for the thread */
double *col_start; /* This is a pointer to a specific section of the matrix */
} info_T;
/* Define how many threads you want to use.
* This shouldn't exceed the number of cores on your computer. I'm using
* an 8-core server, so I'll use 7 of them for now
*/
#define MAX_NUMBER_THREADS 7
/* Here's the definition of the function that will be called
* in each thread
*/
void *calc_col( void *ptr );
void mexFunction(int nlhs,
mxArray* plhs[],
int nrhs,
const mxArray* prhs[])
{
/* Grab the input from matlab */
int Nrow = *mxGetPr(prhs[0]);
/* Create the output DoubleMatrix */
mxArray* CSarr = mxCreateDoubleMatrix(Nrow, Nrow, mxREAL);
double* CS = mxGetPr(CSarr);
int i;
/* An array of threads and structures to be passed to each thread */
pthread_t pThreads[MAX_NUMBER_THREADS];
info_T pass_struct[MAX_NUMBER_THREADS];
/* Each thread will handle this many columns, except the last thread
* will will handle the remainder as well. I could speed this up by
* evenly distributing the remaineder threads, but I don't.
*/
int cols_per_thread = Nrow / MAX_NUMBER_THREADS;
for (i = 0; i < MAX_NUMBER_THREADS; i++){
/* fill the structure with info to be passed to the given thread */
pass_struct[i].Ncol = cols_per_thread;
pass_struct[i].Nrow = Nrow;
pass_struct[i].col_start = CS + Nrow*i*cols_per_thread;
pass_struct[i].i = i;
pass_struct[i].offset = i*cols_per_thread;
/* The last thread needs to get the remainder of the columns.
*/
if(i == MAX_NUMBER_THREADS - 1){
pass_struct[i].Ncol = Nrow - i*cols_per_thread;
}
/* spawn the thread */
pthread_create(&pThreads[i], NULL,
calc_col, (void*) &pass_struct[i]);
}
/* wait for all of the threads to finish */
for (i = 0; i < MAX_NUMBER_THREADS; i++){
pthread_join( pThreads[i], NULL);
}
/* return CSarr to matlab */
nlhs = 1;
plhs[0] = CSarr;
}
void *calc_col( void *ptr ){
/* Here's the function that calculates the column starting at the
* given point
*/
/* create a local structure to point to the input pointer */
info_T *struct_ptr;
struct_ptr = (info_T *) ptr;
mexPrintf("In thread %i. Filling %i columns\n",
struct_ptr->i, struct_ptr->Ncol);
/* perform whatever you need to do on the matrix here */
int rows, cols;
for(cols=0; cols<struct_ptr->Ncol; cols++){
for(rows=0; rows<struct_ptr->Nrow; rows++){
*(struct_ptr->col_start + rows + cols*struct_ptr->Nrow) =
-rows+cols+struct_ptr->offset;
}
}
/* exit the thread */
pthread_exit(NULL);
}
So here’s the catch: Each of the threads are writing to the same matrix at the same time, each accessing a different section of the memory. struct_ptr->col_start points to column 0, 142, 244, … depending on the thread number.
This is PROBABLY a bad idea and I don’t think it’s normally considered to be thread-safe, but it works here. I most often write specific functions for specific tasks that fit inside specific Matlab scripts. I’d love for someone to correct me on this, but for now I just figured it out on my own.
With regards to processing by column instead of row, it’s always better to work down columns because of how the arrays are stored in memory. The image below sums it up pretty well:

So if you want to fill a matrix by row in c/c++, you would need to use two loops and do lots of multiplying:
for(r=0; r < Nrows; r++){
for(c=0;c < Ncols; c++){
*(array_loc1 + c*Nrows + r) = c+r*Nrows;
}
}
and get this for an output:
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
Whereas to fill it by column you would just loop over the total number of cells:
for(c=0; c < Nrows * Ncols; c++){
*(array_loc2 + c) = c;
}
and you’ll get this:
0 5 10 15 20
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24