Getting Started
We’re going to show the basic use and syntax of MvSim by using the New York air quality data set (airquality) included in the RDatasets package. We will focus specifically on the temperature (degrees Fahrenheit) and ozone level (parts per billion).
using MvSim, Distributions
using RDatasets, DataFrames, Statistics
df = dataset("datasets", "airquality")[:, [:Ozone, :Temp]] |> dropmissing
| Row | Ozone | Temp | | | Int64 | Int64 | |-----|-------|-------| | 1 | 41 | 67 | | 2 | 36 | 72 | | 3 | 12 | 74 | | ⋮ | ⋮ | ⋮ | | 114 | 14 | 75 | | 115 | 18 | 76 | | 116 | 20 | 68 | 110 rows omitted
Let’s look at the joint distribution of the Ozone and Temperature
We can see that not all margins are normally distributed; the ozone level is highly skewed. Though we don’t know the true distribution of ozone levels, we can go forward assuming that it is log-normally distributed.
To simulate observations from this joint distribution, we need to estimate the correlation and the marginal parameters.
Estimating Correlation
To estimate the correlation, we use cor
with an argument specifying the type of correlation to estimate. The options are Pearson
, Spearman
, or Kendall
.
julia> ρ = cor(Matrix(df), Pearson)
2×2 Array{Float64,2}:
1.0 0.69836
0.69836 1.0
Defining Marginal Distributions
Next we can estimate the marginal parameters. Assuming that the Temperature
is normally distributed, it has parameters:
julia> μ_Temp = mean(df.Temp)
77.87068965517241
julia> σ_Temp = std(df.Temp)
9.48548563759966
and assuming that Ozone
is log-normally distributed, it has parameters:
julia> μ_Ozone = mean(log.(df.Ozone))
3.4185151008120065
julia> σ_Ozone = sqrt(mean((log.(df.Ozone) .- mean(log.(df.Ozone))).^2))
0.8617359690270705
Finally we take the parameters and put them into a vector of margins:
julia> margins = [Normal(μ_Temp, σ_Temp), LogNormal(μ_Ozone, σ_Ozone)]
2-element Array{Distribution{Univariate,Continuous},1}:
Normal{Float64}(μ=77.87068965517241, σ=9.48548563759966)
LogNormal{Float64}(μ=3.4185151008120065, σ=0.8617359690270705)
Multivariate Distribution
While the individual components can be used separately within the package, they work best when joined together into a MvDistribution
data type:
julia> D = MvDistribution(ρ, margins, Pearson);
Correlation Bounds
Given a vector of margins, the theoretical lower and upper correlation coefficients can be estimated using simulation:
julia> lower, upper = cor_bounds(D);
julia> lower
2×2 Array{Float64,2}:
1.0 -0.824896
-0.824896 1.0
julia> upper
2×2 Array{Float64,2}:
1.0 0.822551
0.822551 1.0
The pearson_bounds
function uses more sophisticated methods to determine the theoretical lower and upper Pearson correlation bounds. It also requires more computational time.
julia> lower, upper = pearson_bounds(D);
julia> lower
2×2 Array{Float64,2}:
1.0 -0.821122
-0.821122 1.0
julia> upper
2×2 Array{Float64,2}:
1.0 0.821122
0.821122 1.0
Simulating Multivariate Data
Let’s now simulate 10,000 observations from the joint distribution using rvec
:
julia> x = rvec(10_000, ρ, margins)
10000×2 Array{Float64,2}:
73.4429 16.1916
82.3632 24.7819
90.4179 39.5605
84.6615 77.7587
85.5435 41.6268
89.3857 47.055
88.5116 19.0025
85.2236 20.522
82.812 163.006
67.5972 21.8145
⋮
81.8391 44.4356
87.324 40.9421
81.4798 26.3412
81.8011 67.0453
90.2719 250.656
83.5681 38.44
83.806 24.1448
77.6699 21.0085
62.6525 6.07819
Alternatively we can use rand
with the MvDistribution
type:
julia> rand(D, 10)
10×2 Array{Float64,2}:
74.07 95.3598
48.1756 3.83217
67.0746 16.0237
87.163 47.5094
93.1557 71.9095
60.2451 10.2295
76.2776 74.8223
95.8442 142.933
82.7193 72.52
82.0985 37.1756
Visualizing Bivariate Data
df_sim = DataFrame(x, [:Temp, :Ozone]);
histogram2d(df_sim.:Temp, df_sim.:Ozone, nbins=250, legend=false,
xlims=extrema(df.:Temp) .+ (-10, 10),
ylims=extrema(df.:Ozone) .+ (0, 20))
Compared to Uncorrelated Samples
We can compare the bivariate distribution above to one where no correlation is taken into account.
df_sim2 = DataFrame(
Temp = rand(margins[1], 10000),
Ozone = rand(margins[2], 10000)
);
histogram2d(df_sim2.:Temp, df_sim2.:Ozone, nbins=250, legend=false,
xlims=extrema(df.:Temp) .+ (-10, 10),
ylims=extrema(df.:Ozone) .+ (0, 20))