Regression in Matrix Form

I’m working on some predictive modelling projects and I need to iteratively compute R² statistics over 100’s of variables. Each time I do the calculations I need to go and have an extended coffee break – and I’m starting to buzz with too much caffeine so I thought I would look to see whether I could make my code more efficient!

Linear regression in matrix form looks like this:

One of the great things about JSL is that I can directly implement this formula:

β = Inv(X`*X)*X`*Y;

Where the grave accent indicates the transpose of the X matrix. That’s it! One line of code to compute the parameter estimates (β) for a set of X and Y data. There’s a direct correspondence between the mathematical form and the code – no need to figure out complex algorithms to convert the problem into JSL. I of course need the matrices, so here is the full code:

// generate matrices
X = Column("height") << Get Values;
Y = Column("weight") << Get Values;
// add a column of 1's for the intercept term
X = J(Nrow(X),1) || X; 
// compute least squares estimates
β = Inv(X`*X)*X`*Y;

Now I have my solution I can use it to compute the R² statistic:

N = NRows(Y);
Ybar = Mean(Y);
R2 = (β`*X`*Y - N*Ybar^2)/(Y`*Y - N*Ybar^2);

In practice I want to perform this for 100’s of variables based on real-world data. That requires a bit more care to handle situations such as missing data or singular values. Below is a more robust implementation:

Of course it’s possible to perform regression in JMP using the Bivariate, and in JSL this is how I would extract the R²value:

biv = Bivariate( 
    Y( :weight), 
    X( :height ), 
    Fit Line, invisible 
);
rep = biv << report;
mat = rep[NumberColBox(1)] << Get As Matrix;
rep << Close Window;
R2 = mat[1];

In fact, if my only goal is the calculation of R² then I could use the Multivariate platform. And then of course there is the Fit Model platform.

How do these methods compare in terms of performance?

Below is a chart of execution times for each method:

The matrix calculations are 5 times faster than Bivariate and over 30 times faster than Fit Model. That last statistic is important because I also want to generalise the method for some forward selection calculations that involve more than one X variable in the model.

3 thoughts on “Regression in Matrix Form”

JMP Ahead

Regression in Matrix Form

3 thoughts on “Regression in Matrix Form”

Leave a Reply Cancel reply

Insights in the use of JMP and the scripting language JSL