Getting Started

We’re going to show the basic use and syntax of MvSim by using the New York air quality data set (airquality) included in the RDatasets package. We will focus specifically on the temperature (degrees Fahrenheit) and ozone level (parts per billion).

using MvSim, Distributions
using RDatasets, DataFrames, Statistics
df = dataset("datasets", "airquality")[:, [:Ozone, :Temp]] |> dropmissing
| Row | Ozone |  Temp |
|     | Int64 | Int64 |
|-----|-------|-------|
|   1 |    41 |    67 |
|   2 |    36 |    72 |
|   3 |    12 |    74 |
|  ⋮  |   ⋮   |   ⋮   |
| 114 |    14 |    75 |
| 115 |    18 |    76 |
| 116 |    20 |    68 |
       110 rows omitted

Let’s look at the joint distribution of the Ozone and Temperature

We can see that not all margins are normally distributed; the ozone level is highly skewed. Though we don’t know the true distribution of ozone levels, we can go forward assuming that it is log-normally distributed.

To simulate observations from this joint distribution, we need to estimate the correlation and the marginal parameters.

Estimating Correlation

To estimate the correlation, we use cor with an argument specifying the type of correlation to estimate. The options are Pearson, Spearman, or Kendall.

julia> ρ = cor(Matrix(df), Pearson)
2×2 Array{Float64,2}:
 1.0      0.69836
 0.69836  1.0

Defining Marginal Distributions

Next we can estimate the marginal parameters. Assuming that the Temperature is normally distributed, it has parameters:

julia> μ_Temp = mean(df.Temp)
77.87068965517241

julia> σ_Temp = std(df.Temp)
9.48548563759966

and assuming that Ozone is log-normally distributed, it has parameters:

julia> μ_Ozone = mean(log.(df.Ozone))
3.4185151008120065

julia> σ_Ozone = sqrt(mean((log.(df.Ozone) .- mean(log.(df.Ozone))).^2))
0.8617359690270705

Finally we take the parameters and put them into a vector of margins:

julia> margins = [Normal(μ_Temp, σ_Temp), LogNormal(μ_Ozone, σ_Ozone)]
2-element Array{Distribution{Univariate,Continuous},1}:
 Normal{Float64}(μ=77.87068965517241, σ=9.48548563759966)
 LogNormal{Float64}(μ=3.4185151008120065, σ=0.8617359690270705)

Multivariate Distribution

While the individual components can be used separately within the package, they work best when joined together into a MvDistribution data type:

julia> D = MvDistribution(ρ, margins, Pearson);

Correlation Bounds

Given a vector of margins, the theoretical lower and upper correlation coefficients can be estimated using simulation:

julia> lower, upper = cor_bounds(D);

julia> lower
2×2 Array{Float64,2}:
  1.0       -0.824896
 -0.824896   1.0

julia> upper
2×2 Array{Float64,2}:
 1.0       0.822551
 0.822551  1.0

The pearson_bounds function uses more sophisticated methods to determine the theoretical lower and upper Pearson correlation bounds. It also requires more computational time.

julia> lower, upper = pearson_bounds(D);

julia> lower
2×2 Array{Float64,2}:
  1.0       -0.821122
 -0.821122   1.0

julia> upper
2×2 Array{Float64,2}:
 1.0       0.821122
 0.821122  1.0

Simulating Multivariate Data

Let’s now simulate 10,000 observations from the joint distribution using rvec:

julia> x = rvec(10_000, ρ, margins)
10000×2 Array{Float64,2}:
 73.4429   16.1916
 82.3632   24.7819
 90.4179   39.5605
 84.6615   77.7587
 85.5435   41.6268
 89.3857   47.055
 88.5116   19.0025
 85.2236   20.522
 82.812   163.006
 67.5972   21.8145
  ⋮
 81.8391   44.4356
 87.324    40.9421
 81.4798   26.3412
 81.8011   67.0453
 90.2719  250.656
 83.5681   38.44
 83.806    24.1448
 77.6699   21.0085
 62.6525    6.07819

Alternatively we can use rand with the MvDistribution type:

julia> rand(D, 10)
10×2 Array{Float64,2}:
 74.07     95.3598
 48.1756    3.83217
 67.0746   16.0237
 87.163    47.5094
 93.1557   71.9095
 60.2451   10.2295
 76.2776   74.8223
 95.8442  142.933
 82.7193   72.52
 82.0985   37.1756

Visualizing Bivariate Data

df_sim = DataFrame(x, [:Temp, :Ozone]);

histogram2d(df_sim.:Temp, df_sim.:Ozone, nbins=250, legend=false,
			xlims=extrema(df.:Temp) .+ (-10, 10),
			ylims=extrema(df.:Ozone) .+ (0, 20))

Compared to Uncorrelated Samples

We can compare the bivariate distribution above to one where no correlation is taken into account.

df_sim2 = DataFrame(
	Temp  = rand(margins[1], 10000),
	Ozone = rand(margins[2], 10000)
);

histogram2d(df_sim2.:Temp, df_sim2.:Ozone, nbins=250, legend=false,
			xlims=extrema(df.:Temp) .+ (-10, 10),
			ylims=extrema(df.:Ozone) .+ (0, 20))