Even though sophisticated libraries for regression analysis exist in any well-respected computational software, it may often be useful to do regression from scratch.
In this way, you have access to all the details, and may come up with an idea how to make your computation more efficient.
Besides that, you may find out how to tweak your method if none of the functions existing in the software converges to the solution in your particular case.
I found another reason to do it this way.
I think that not all the steps in regression analysis are wise to do numerically, employing the same numeric procedure for all mathematical models that you try to fit to your data.
Some steps are best performed analytically.
I mean, let’s not become blind slaves of computers, and let’s do by hand those things that are easy to do by hand (and those things are often hard for computers), while letting our computers work on tasks that are hard for us but easy for them.
So, regression analysis in Scilab, … and on paper.
Suppose our data are stored in a comma separated values file,
data.csv. The last column in the file, y, is a function of all other columns (features) X1, X2, … :
y = f(X1, X2, …)
We load the data into Scilab, store the features in the matrix X, and the result in the column vector y
clear,clc // Load data filename = 'data.csv'; data = csvRead(filename); [m,n] = size(data); X = data(:, 1:n-1); y = data(:, $);
We are going to fit these data to some model h(X; θ), where θ is a set of parameters whose values minimise the following cost function (in a matrix form):
J = (h−y)T(h−y) / (2m) = (hTh-2hTy+yTy) / (2m)
Here, m is the length of the vector y.
There are many ways to find the set of parameters θ that minimises the cost J.
In the method called ‘gradient descent’, one starts from an arbitrary vector θ (usually chosen by some heuristic to be not too far from the truth) and then move in the θ–space at the rate α in the direction opposite to the gradient of the cost function,
θ → θ − α ∇J = θ − α∇hT(h − y) / m
If we use gradient descent, then rescaling the features may improve convergence, especially when the values of one feature differ significantly from the values of another feature.
The following code transforms the features so that each of them would have zero mean, and standard deviation equal to one:
// Scale features and set them to mean = 0, std = 1 mu = mean(X,'r'); X = X - repmat(mu,m,1); sigma = stdev(X,'r'); X = X ./ repmat(sigma,m,1);
In the simplest case, when h(X; θ) is linear, that is
h = θ0+θ1X1+θ2X2+ … +θn−1Xn−1
we can rewrite this formula in a matrix form, introducing X0 = 1
h = X θ
Let’s add the intercept column to X
// Add intercept term to X X = [ones(m, 1) X];
To perform gradient descent for the linear case, we need to calculate ∇hT.
For some strange reason, most computational programs do it numerically, thus wasting time, processor power, and introducing errors.
Yet, it’s so easy to do it analytically:
∇hT = XT
Now, the gradient descent is done in Scilab with the following code:
// === Gradient Descent === // Learning rate and number of iterations a = 0.01; n_iter = 400; // Initialize fitting parameters theta = zeros(n, 1); // Run Gradient Descent for iter = 1:n_iter do h = X * theta; theta = theta - a * X' *(h-y) / m; J_history(iter) = (h-y)' * (h-y) / (2*m); end
J_history is the cost of linear regression at each iteration step.
The results can be visualised as follows:
// Plot the convergence graph clf(0);scf(0); plot(1:n_iter, J_history'); xtitle('Convergence','Iterations','Cost') // Display gradient descent's result disp(theta)
The same procedure can be applied if you fit your data to a non-linear model.
The only difference is the formula for ∇hT. I think that it is wise to compute it analytically in this case also.
If you fit your data to some mathematical model, it is unlikely that your model is so ugly that you cannot take its derivative by hand.
If it is too tedious to do it by hand or you are afraid of making mistakes you can use some computer algebra software like Maxima.
And this way, you’ll never forget the table of derivatives 🙂
There’s another method in regression analysis – normal equation. It is preferred to the gradient descent method when the number of features is not too large. Besides, it doesn’t require rescaling.
If the cost function is minimal at a certain θ then at that θ we have
∇J = 0
∇hT(h − y) = 0
In the linear case, this equation has the closed-form solution
The code for the normal equation method is:
clear,clc // Load data filename = 'data.csv'; data = csvRead(filename); [m,n] = size(data); X = data(:, 1:n-1); y = data(:, $); // Add intercept term to X X = [ones(m, 1) X]; theta = pinv(X' * X) * X' * y; disp(theta)
In the non-linear case, I would also recommend to go as far as you can analytically, switching to numerical methods only when it is not possible to do it by hand.
This approach used to be obvious in the last century (when people were not afraid of mathematics and were able to send astronauts to the moon), but as technology progresses, future generations will have to rediscover again and again that doing something by hand may in some case be better that using a machine.