Nearest Correlation Matrix
Sometimes what we want really is the Spearman correlation. Then we don't need to do any Pearson matching. All we need to do is estimate/obtain the Spearman correlation of some data, convert it to Pearson, and then simulate. The resulting simulated data will have the same Spearman correlation as the one estimated from the data (up to stochastic error). The problem is that for high dimensional data, the Spearman or converted Pearson correlation matrix may not be positive semidefinite (PSD). The problem is then how to compute the nearest PSD correlation matrix.
We provide the function cor_nearPD
to handle this problem. It is based off of the work of Qi and Sun (2006), and is a quadratically convergent algorithm. Here we use BRCA data to show its use.
julia> m = Matrix(brca)
878×200 Matrix{Float64}: 2.42686e6 47131.0 292717.0 … 93811.0 34168.0 1814.0 719834.0 415707.0 338326.0 27934.0 41261.0 57961.0 1.67475e6 1.52156e6 556783.0 91540.0 61890.0 85477.0 1.14082e6 277162.0 380049.0 62468.0 30171.0 37379.0 660487.0 504250.0 313120.0 34433.0 32450.0 44306.0 1.37852e6 7.15225e6 796746.0 … 58779.0 46555.0 46230.0 577958.0 2.1654e6 398694.0 25625.0 31330.0 20169.0 1.0599e6 445787.0 496284.0 65047.0 38029.0 63358.0 1.70653e6 177563.0 285268.0 93722.0 147704.0 37.0 1.36036e6 2.21442e6 770635.0 64471.0 34574.0 77441.0 ⋮ ⋱ 133177.0 35773.0 55143.0 18901.0 13591.0 126.0 640157.0 1.44312e6 349451.0 … 26857.0 31804.0 52028.0 760560.0 526643.0 402973.0 69531.0 24371.0 42206.0 561799.0 147555.0 848469.0 47962.0 21519.0 333.0 245867.0 224245.0 213969.0 34973.0 31067.0 123707.0 633497.0 581455.0 692539.0 52434.0 22456.0 3441.0 687263.0 163202.0 620491.0 … 34539.0 20698.0 3417.0 946835.0 157578.0 287362.0 33074.0 69588.0 64613.0 499893.0 238058.0 395895.0 33389.0 26519.0 37366.0
julia> spearman_corr = cor(m, Spearman);
julia> pearson_corr = cor_convert(spearman_corr, Spearman, Pearson);
julia> isposdef(spearman_corr), isposdef(pearson_corr)
(true, false)
We see that the converted Pearson correlation matrix is no longer positve definite. This will result in a failure during the multivariate normal generation, particularly during the Cholesky decomposition.
julia> rvec(10, pearson_corr, margins)
ERROR: ArgumentError: `rho` must be a valid correlation matrix
We can fix this by computing the nearest PD correlation.
julia> adjusted_pearson_corr = cor_nearPD(pearson_corr);
julia> isposdef(adjusted_pearson_corr)
true
julia> rvec(10, adjusted_pearson_corr, margins)
10×200 Matrix{Float64}: 1.06629e6 86106.0 … 61005.0 54336.0 34171.0 6823.0 1.01967e6 138953.0 55049.0 73530.0 53421.0 70526.0 2.46427e6 790194.0 61833.0 71181.0 65816.0 6988.0 787046.0 465710.0 45077.0 38466.0 67846.0 65391.0 801175.0 647174.0 26739.0 33854.0 36668.0 77878.0 742760.0 97125.0 … 31898.0 22763.0 41289.0 11608.0 953193.0 1759.0 12919.0 26664.0 40230.0 1476.0 409232.0 54334.0 18361.0 12028.0 20449.0 20214.0 509655.0 3.9661e6 64242.0 40738.0 50559.0 96184.0 1.44324e6 678078.0 36502.0 28065.0 59367.0 25638.0
Benchmarking
What's more impressive is that computing the nearest correlation matrix in Julia is fast!
julia> @benchmark cor_nearPD(pearson_corr)
BenchmarkTools.Trial:
memory estimate: 7.15 MiB
allocs estimate: 160669
--------------
minimum time: 8.968 ms (0.00% GC)
median time: 9.274 ms (0.00% GC)
mean time: 9.752 ms (4.78% GC)
maximum time: 12.407 ms (23.64% GC)
--------------
samples: 513
evals/sample: 1
Let's scale up to a larger correlation matrix:
julia> m3000 = cor_randPSD(3000) |> m -> cor_convert(m, Spearman, Pearson)
3000×3000 Matrix{Float64}: 1.0 0.0372893 0.0161555 … -0.00387422 -0.0139672 0.0372893 1.0 0.0164445 0.00050544 0.0217751 0.0161555 0.0164445 1.0 0.00288241 0.026114 0.0514409 0.0496294 0.0131528 0.0352954 0.000741581 -0.00662413 0.0344721 0.0191696 -0.0112396 0.0149118 0.00298761 -0.0612864 -0.0137199 … -0.00644881 -0.0226285 0.0149567 0.00526769 0.00607779 -0.0289536 0.0220176 -0.0368099 0.0110296 -0.0293787 0.0142162 0.0428297 0.00146157 -0.00152596 -0.00083242 -0.0253234 0.00873587 0.00247538 0.0169011 -0.0340555 -0.00222749 -0.0345521 ⋮ ⋱ 0.00998338 0.009418 0.0226753 -0.00911558 -0.029194 -0.00724692 -0.00303705 0.0285578 0.020127 0.0042419 0.00036744 -0.00436533 -0.0176778 -0.0334419 -0.0129353 0.0153296 -0.0112785 -0.00630443 -0.0107837 -0.0203791 0.0285686 0.00267097 -0.00942461 … -0.0471443 0.00264967 0.0176413 0.0474747 0.0186195 -0.0106618 -0.0161374 -0.00345109 0.00239747 0.0046495 0.0052183 -0.0136783 -0.00387422 0.00050544 0.00288241 1.0 0.020131 -0.0139672 0.0217751 0.026114 0.020131 1.0
julia> m3000_PD = cor_nearPD(m3000);
julia> isposdef(m3000)
false
julia> isposdef(m3000_PD)
true
julia> @benchmark cor_nearPD(m3000)
BenchmarkTools.Trial:
memory estimate: 3.95 GiB
allocs estimate: 78597
--------------
minimum time: 10.648 s (2.66% GC)
median time: 10.648 s (2.66% GC)
mean time: 10.648 s (2.66% GC)
maximum time: 10.648 s (2.66% GC)
--------------
samples: 1
evals/sample: 1
~11 seconds^[using an AMD Ryzen 9 3900X with 32GB of RAM] to convert a 3000x3000 correlation matrix! This even beats previous benchmarks for a 3000x3000 randomly generated pseudo correlation matrix. Here is an excerpt from Defeng Sun's home page where his matlab code is:
For a randomly generated 3,000 by 3,000 pseudo correlation matrix (the code is insensitive to input data), the code needs 24 seconds to reach a solution with the relative duality gap less than 1.0e-3 after 3 iterations and 43 seconds with the relative duality gap less than 1.0e-10 after 6 iterations in my Dell Desktop with Intel (R) Core i7 processor.
We also offer a faster routine that gives up a little accuracy for speed. While cor_nearPD
finds the nearest correlation matrix to the input matrix, cor_fastPD
finds a positive definite correlation matrix that is close to the input matrix.
julia> @benchmark cor_fastPD(m3000) samples=10 seconds=30
BenchmarkTools.Trial:
memory estimate: 628.92 MiB
allocs estimate: 26040
--------------
minimum time: 1.949 s (0.00% GC)
median time: 1.998 s (1.15% GC)
mean time: 2.015 s (2.28% GC)
maximum time: 2.139 s (8.48% GC)
--------------
samples: 10
evals/sample: 1
And it's not too far off from the nearest:
julia> m3000_PD_fast = cor_fastPD(m3000);
julia> norm(m3000 - m3000_PD)
0.7355285242585915
julia> norm(m3000 - m3000_PD_fast)
0.7684890630458305