Nearest Correlation Matrix

Sometimes what we want really is the Spearman correlation. Then we don't need to do any Pearson matching. All we need to do is estimate/obtain the Spearman correlation of some data, convert it to Pearson, and then simulate. The resulting simulated data will have the same Spearman correlation as the one estimated from the data (up to stochastic error). The problem is that for high dimensional data, the Spearman or converted Pearson correlation matrix may not be positive semidefinite (PSD). The problem is then how to compute the nearest PSD correlation matrix.

We provide the function cor_nearPD to handle this problem. It is based off of the work of Qi and Sun (2006), and is a quadratically convergent algorithm. Here we use BRCA data to show its use.

julia> m = Matrix(brca)878×200 Matrix{Float64}:
      2.42686e6   47131.0        292717.0  …  93811.0   34168.0    1814.0
 719834.0        415707.0        338326.0     27934.0   41261.0   57961.0
      1.67475e6       1.52156e6  556783.0     91540.0   61890.0   85477.0
      1.14082e6  277162.0        380049.0     62468.0   30171.0   37379.0
 660487.0        504250.0        313120.0     34433.0   32450.0   44306.0
      1.37852e6       7.15225e6  796746.0  …  58779.0   46555.0   46230.0
 577958.0             2.1654e6   398694.0     25625.0   31330.0   20169.0
      1.0599e6   445787.0        496284.0     65047.0   38029.0   63358.0
      1.70653e6  177563.0        285268.0     93722.0  147704.0      37.0
      1.36036e6       2.21442e6  770635.0     64471.0   34574.0   77441.0
      ⋮                                    ⋱
 133177.0         35773.0         55143.0     18901.0   13591.0     126.0
 640157.0             1.44312e6  349451.0  …  26857.0   31804.0   52028.0
 760560.0        526643.0        402973.0     69531.0   24371.0   42206.0
 561799.0        147555.0        848469.0     47962.0   21519.0     333.0
 245867.0        224245.0        213969.0     34973.0   31067.0  123707.0
 633497.0        581455.0        692539.0     52434.0   22456.0    3441.0
 687263.0        163202.0        620491.0  …  34539.0   20698.0    3417.0
 946835.0        157578.0        287362.0     33074.0   69588.0   64613.0
 499893.0        238058.0        395895.0     33389.0   26519.0   37366.0
julia> spearman_corr = cor(m, Spearman);
julia> pearson_corr = cor_convert(spearman_corr, Spearman, Pearson);
julia> isposdef(spearman_corr), isposdef(pearson_corr)(true, false)

We see that the converted Pearson correlation matrix is no longer positve definite. This will result in a failure during the multivariate normal generation, particularly during the Cholesky decomposition.

julia> rvec(10, pearson_corr, margins)ERROR: ArgumentError: `rho` must be a valid correlation matrix

We can fix this by computing the nearest PD correlation.

julia> adjusted_pearson_corr = cor_nearPD(pearson_corr);
julia> isposdef(adjusted_pearson_corr)true
julia> rvec(10, adjusted_pearson_corr, margins)10×200 Matrix{Float64}: 1.06629e6 86106.0 … 61005.0 54336.0 34171.0 6823.0 1.01967e6 138953.0 55049.0 73530.0 53421.0 70526.0 2.46427e6 790194.0 61833.0 71181.0 65816.0 6988.0 787046.0 465710.0 45077.0 38466.0 67846.0 65391.0 801175.0 647174.0 26739.0 33854.0 36668.0 77878.0 742760.0 97125.0 … 31898.0 22763.0 41289.0 11608.0 953193.0 1759.0 12919.0 26664.0 40230.0 1476.0 409232.0 54334.0 18361.0 12028.0 20449.0 20214.0 509655.0 3.9661e6 64242.0 40738.0 50559.0 96184.0 1.44324e6 678078.0 36502.0 28065.0 59367.0 25638.0

Benchmarking

What's more impressive is that computing the nearest correlation matrix in Julia is fast!

julia> @benchmark cor_nearPD(pearson_corr)
BenchmarkTools.Trial: 
  memory estimate:  7.15 MiB
  allocs estimate:  160669
  --------------
  minimum time:     8.968 ms (0.00% GC)
  median time:      9.274 ms (0.00% GC)
  mean time:        9.752 ms (4.78% GC)
  maximum time:     12.407 ms (23.64% GC)
  --------------
  samples:          513
  evals/sample:     1

Let's scale up to a larger correlation matrix:

julia> m3000 = cor_randPSD(3000) |> m -> cor_convert(m, Spearman, Pearson)3000×3000 Matrix{Float64}:
  1.0          0.0372893    0.0161555   …  -0.00387422  -0.0139672
  0.0372893    1.0          0.0164445       0.00050544   0.0217751
  0.0161555    0.0164445    1.0             0.00288241   0.026114
  0.0514409    0.0496294    0.0131528       0.0352954    0.000741581
 -0.00662413   0.0344721    0.0191696      -0.0112396    0.0149118
  0.00298761  -0.0612864   -0.0137199   …  -0.00644881  -0.0226285
  0.0149567    0.00526769   0.00607779     -0.0289536    0.0220176
 -0.0368099    0.0110296   -0.0293787       0.0142162    0.0428297
  0.00146157  -0.00152596  -0.00083242     -0.0253234    0.00873587
  0.00247538   0.0169011   -0.0340555      -0.00222749  -0.0345521
  ⋮                                     ⋱
  0.00998338   0.009418     0.0226753      -0.00911558  -0.029194
 -0.00724692  -0.00303705   0.0285578       0.020127     0.0042419
  0.00036744  -0.00436533  -0.0176778      -0.0334419   -0.0129353
  0.0153296   -0.0112785   -0.00630443     -0.0107837   -0.0203791
  0.0285686    0.00267097  -0.00942461  …  -0.0471443    0.00264967
  0.0176413    0.0474747    0.0186195      -0.0106618   -0.0161374
 -0.00345109   0.00239747   0.0046495       0.0052183   -0.0136783
 -0.00387422   0.00050544   0.00288241      1.0          0.020131
 -0.0139672    0.0217751    0.026114        0.020131     1.0
julia> m3000_PD = cor_nearPD(m3000);
julia> isposdef(m3000)false
julia> isposdef(m3000_PD)true
julia> @benchmark cor_nearPD(m3000)
BenchmarkTools.Trial: 
  memory estimate:  3.95 GiB
  allocs estimate:  78597
  --------------
  minimum time:     10.648 s (2.66% GC)
  median time:      10.648 s (2.66% GC)
  mean time:        10.648 s (2.66% GC)
  maximum time:     10.648 s (2.66% GC)
  --------------
  samples:          1
  evals/sample:     1

~11 seconds^[using an AMD Ryzen 9 3900X with 32GB of RAM] to convert a 3000x3000 correlation matrix! This even beats previous benchmarks for a 3000x3000 randomly generated pseudo correlation matrix. Here is an excerpt from Defeng Sun's home page where his matlab code is:

For a randomly generated 3,000 by 3,000 pseudo correlation matrix (the code is insensitive to input data), the code needs 24 seconds to reach a solution with the relative duality gap less than 1.0e-3 after 3 iterations and 43 seconds with the relative duality gap less than 1.0e-10 after 6 iterations in my Dell Desktop with Intel (R) Core i7 processor.

We also offer a faster routine that gives up a little accuracy for speed. While cor_nearPD finds the nearest correlation matrix to the input matrix, cor_fastPD finds a positive definite correlation matrix that is close to the input matrix.

julia> @benchmark cor_fastPD(m3000) samples=10 seconds=30
BenchmarkTools.Trial: 
  memory estimate:  628.92 MiB
  allocs estimate:  26040
  --------------
  minimum time:     1.949 s (0.00% GC)
  median time:      1.998 s (1.15% GC)
  mean time:        2.015 s (2.28% GC)
  maximum time:     2.139 s (8.48% GC)
  --------------
  samples:          10
  evals/sample:     1

And it's not too far off from the nearest:

julia> m3000_PD_fast = cor_fastPD(m3000);
julia> norm(m3000 - m3000_PD)0.7355285242585915
julia> norm(m3000 - m3000_PD_fast)0.7684890630458305