Pearson Matching
Correlation Conversion
Let's say we really wanted to estimate the Spearman correlation between the temperature and ozone.
julia> ρ_s = cor(Matrix(df), Spearman)
2×2 Array{Float64,2}:
1.0 0.774043
0.774043 1.0
julia> cor_bounds(margins[1], margins[2], Spearman)
(lower = -1.0, upper = 1.0)
If we just use the Spearman correlation when we simulate data, then the errors are double.
- The NORTA algorithm is expecting a Pearson correlation
- The non-linear transformation in the NORTA step does not guarantee that the input correlation is the same as the output.
julia> D2 = MvDistribution(ρ_s, margins, Spearman);
julia> x_2 = rand(D2, 1_000_000);
julia> cor(x_2, Spearman)
2×2 Array{Float64,2}:
1.0 0.758205
0.758205 1.0
Let's try to address 1 and convert the Spearman correlation to a Pearson correlation.
julia> ρ_p = cor_convert(ρ_s, Spearman, Pearson);
julia> D3 = MvDistribution(ρ_p, margins, Pearson);
julia> x_3 = rand(D3, 1_000_000);
ERROR: MvSim.ValidCorrelationError()
julia> cor(x_3, Pearson)
ERROR: UndefVarError: x_3 not defined
julia> cor(x_3, Spearman)
ERROR: UndefVarError: x_3 not defined
Notice that the estimated Pearson correlation is lower than the target Pearson correlation, but the estimated Spearman correlation is essentially the same as the target. This is because the transformation in the NORTA step is monotonic, which means that rank-based correlations are preserved. As a consequence, we can match the Spearman correlation exactly (up to stochastic error), but not the Pearson.
Pearson Matching
We can overcome this limitation by employing a Pearson matching algorithm that determines the necessary input correlation in order to achieve the target correlation. Let's now address 2.
julia> D4 = pearson_match(D2);
julia> cor(D4)
2×2 Array{Float64,2}:
1.0 0.942665
0.942665 1.0
Notice the signficant change in the input correlation!
julia> x_4 = rand(D4, 1_000_000);
julia> cor(x_4, Pearson)
2×2 Array{Float64,2}:
1.0 0.77293
0.77293 1.0
But the estimated correlation is nearly spot on to the [converted] Pearson correlation (ρ_p).
A better example is using the MvDistribution
. We never estimated the correlation after simulating, so let's look at that now.
julia> cor(rand(D, 1_000_000))
2×2 Array{Float64,2}:
1.0 0.574189
0.574189 1.0
compared to the target correlation:
julia> ρ
2×2 Array{Float64,2}:
1.0 0.69836
0.69836 1.0
The estimated correlation is much too low. Let's do some Pearson matching and observe the results.
julia> D5 = pearson_match(D);
julia> x_5 = rand(D5, 1_000_000);
julia> cor(x_5)
2×2 Array{Float64,2}:
1.0 0.697994
0.697994 1.0
Now the simulated data results in a correlation structure that exactly matches the target Pearson correlation!