Approximate Solutions to Linear Regression Problems
Here we assume that we have a first order solution to a regression problem, in the form
Y = Σ w_{i} R_{i},
where Y is the response, w_{i} are the regression coefficients, and R_{i}
are the independent variables. The number of variables is very high, and the independent variables are highly correlated.
We want to improve the model by considering a second order regression of the form
Y = Σ w_{i} R_{i} + Σ w_{ij} c_{ij} m_{ij} R_{i} R_{j} ,
where
 c_{ij} = correlation between R_{i} and R_{j}
 w_{ij} = w_{i}w_{j}^{0.5} x sgn(w_{i}w_{j})
 m_{ij} are arbitrary constants
In practice, some of the R_{i}s are highly correlated and grouped into clusters. These clusters can be identified
by using a clustering algorithm on the c_{ij}s. For example, one could think of a model with two clusters A and B such as
Y = Σ w_{i} R_{i} + m_{A} Σ_{A} w_{ij} c_{ij} R_{i} R_{j}
+ m_{B} Σ_{B} w_{ij} c_{ij} R_{i} R_{j}
where
 Σ_{A} (resp. Σ_{B}) are taken over all i < j belonging to A (resp. B)
 m_{ij} = m_{A} (constant) if i, j belong to cluster A
 m_{ij} = m_{B} (constant) if i, j belong to cluster B
An interesting case occurs when the cluster structure is
so strong that
 c_{ij} = 1 if i and j belong to the same cluster (either A or B)
 c_{ij} = 0 otherwise
This particular case results in
m_{A} = 4 / [1 + (1+8k_{A})^{0.5}]
m_{B} = 4 / [1 + (1+8k_{B})^{0.5}]
where
k_{A}= Σ_{A} c_{ij} and k_{B}= Σ_{B} c_{ij}.
Question
If the cluster structure is moderately strong, with the correlations c_{ij} close to 1, 1 or 0, how accurate is the above formula
involving k_{A} and k_{B}? Here we assume that the w_{i}s are known or approximated. Typically, w_{i} is a constant or
w_{i} is a simple function of the correlation between Y and R_{i}.
Alternate Approach
Let us consider a simplified model involving one cluster, with m_{ij} = constant = m. For instance, the unique cluster could
consist of all variables i, j with c_{ij} > 0.70. The model can be written as
Y = Σ w_{i} R_{i} + m Σ w_{ij} c_{ij} R_{i} R_{j}.
We want to find m that provides the best improvement over the first order model, in terms of residual error. The first order model corresponds to
m = 0.
Let us introduce the following notations:
 W = Σ w_{ij} c_{ij} R_{i} R_{j},
 V = W  u, where u = average(W) (Thus V is the centered W, with mean 0),
 S= Σ w_{i} R_{i}. (average(S) = average(Y) by construction)
Without loss of generality, let us consider the slightly modified (centered) model
Y = S + m V.
Then
m = [ Transposed(V) x (YS) ] / [ Transposed(V) x V ],
where
 Y, S, and V are vectors with n rows,
 n is the number of observations.
Further Improvements
The alternate approach could be incorporated in an iterative algorithm, where at each step a new cluster is added. So at each step we would have the same computation for m, optimizing the residual error on
Y = S + m V.
However this time, S would contain all the clusters detected during the previous step, and V would contain the new cluster
being added to the model.
