Thursday, December 3, 2009

Generating correlated random variables using SAS

This code is based on the discussion on SITMO. It uses two ways to generate correlated random variables. For any correlation matrix, C,

1) Find the Cholesky decomposition. In SAS, this uses the root function in IML. Multiply the Cholesky decomposition to a matrix of randomly generated numbers.

2) Find the eigenvalues and eigenvectors. In SAS, the function is call eigen in IML. The eigenvectors pre-multiplied with the diagonalized eigenvalues results in a matrix V. Multiply the transpose of V with the matrix of randomly generated numbers.

The product of this multiplication results in a matrix of correlated series.

The code:

proc iml;
C={1 0.6 0.3, 0.6 1 0.5, 0.3 0.5 1};
/* Method 1 uses the Cholesky decomposition */
U=root(C);
/* Method 2 uses the eigenvalues and eigenvectors */
call eigen(eival, eivec, c);
v=eivec*(diag(sqrt(eival)));
vt=t(v);
call randseed(12345);
/* Generate 3 random series 500 in length */
randm = j(500,3,.);
call randgen(randm,'NORMAL');
corr = randm * U;
corrv = randm * vt;
create random_data from randm;
append from randm;
create correlated_data from corr;
append from corr;
create correlated_data_v from corrv;
append from corrv;
quit;

title1 'Correlation of randomly generated data';
proc corr data = random_data;
run;

title1 'Correlation of data using Cholesky decomposition';
proc corr data = correlated_data;
run;

title1 'Correlation of data using Eigenvalue and Eigenvector decomposition';
proc corr data = correlated_data_v;
run;

Note that the correlation using 500 numbers may not give the exact correlation as in the C matrix. A longer series may be required, e.g. 1000.

2 comments:

Justin said...
This comment has been removed by the author.
Justin said...

Just a note, if you set the correlations between different columns to all be the same:

The Cholesky decomposition method causes the columns to always be ordered as follows: COL1 < COL2 < COL3 in terms of their means. Unfortunately, the eigenvalue method has a similar problem and will always be ordered as follows: COL1 < COL3 < COL2. You need to be very aware of this when you use this method as it can introduce unwanted bias into your simulation study.