Approximate Solutions to Linear Regression Problems
Here we assume that we have a first order solution to a regression problem, in the form
Y = Σ wi Ri,
where Y is the response, wi are the regression coefficients, and Ri
are the independent variables. The number of variables is very high, and the independent variables are highly correlated.
We want to improve the model by considering a second order regression of the form
Y = Σ wi Ri + Σ wij cij mij Ri Rj ,
In practice, some of the Ris are highly correlated and grouped into clusters. These clusters can be identified
by using a clustering algorithm on the cijs. For example, one could think of a model with two clusters A and B such as
- cij = correlation between Ri and Rj
- wij = |wiwj|0.5 x sgn(wiwj)
- mij are arbitrary constants
Y = Σ wi Ri + mA ΣA wij cij Ri Rj
+ mB ΣB wij cij Ri Rj
An interesting case occurs when the cluster structure is
so strong that
- ΣA (resp. ΣB) are taken over all i < j belonging to A (resp. B)
- mij = mA (constant) if i, j belong to cluster A
- mij = mB (constant) if i, j belong to cluster B
This particular case results in
- |cij| = 1 if i and j belong to the same cluster (either A or B)
- cij = 0 otherwise
mA = 4 / [1 + (1+8kA)0.5]
mB = 4 / [1 + (1+8kB)0.5]
kA= ΣA |cij| and kB= ΣB |cij|.
If the cluster structure is moderately strong, with the correlations cij close to 1, -1 or 0, how accurate is the above formula
involving kA and kB? Here we assume that the wis are known or approximated. Typically, wi is a constant or
wi is a simple function of the correlation between Y and Ri.
Let us consider a simplified model involving one cluster, with mij = constant = m. For instance, the unique cluster could
consist of all variables i, j with |cij| > 0.70. The model can be written as
Y = Σ wi Ri + m Σ wij cij Ri Rj.
We want to find m that provides the best improvement over the first order model, in terms of residual error. The first order model corresponds to
m = 0.
Let us introduce the following notations:
Without loss of generality, let us consider the slightly modified (centered) model
- W = Σ wij cij Ri Rj,
- V = W - u, where u = average(W) (Thus V is the centered W, with mean 0),
- S= Σ wi Ri. (average(S) = average(Y) by construction)
Y = S + m V.
m = [ Transposed(V) x (Y-S) ] / [ Transposed(V) x V ],
- Y, S, and V are vectors with n rows,
- n is the number of observations.
The alternate approach could be incorporated in an iterative algorithm, where at each step a new cluster is added. So at each step we would have the same computation for m, optimizing the residual error on
Y = S + m V.
However this time, S would contain all the clusters detected during the previous step, and V would contain the new cluster
being added to the model.