Getting Started

We’re going to show the basic use and syntax of Bigsimr by using the New York air quality data set (airquality) included in the RDatasets package. We will focus specifically on the temperature (degrees Fahrenheit) and ozone level (parts per billion).

using Distributions, Bigsimr
using RDatasets, DataFrames, Statistics

julia> df = dataset("datasets", "airquality")[:, [:Ozone, :Temp]] |> dropmissing;

| Row | Ozone |  Temp |
|     | Int64 | Int64 |
|-----|-------|-------|
|   1 |    41 |    67 |
|   2 |    36 |    72 |
|   3 |    12 |    74 |
|  ⋮  |   ⋮   |   ⋮   |
| 114 |    14 |    75 |
| 115 |    18 |    76 |
| 116 |    20 |    68 |
       110 rows omitted

Let’s look at the joint distribution of the Ozone and Temperature

We can see that not all margins are normally distributed; the ozone level is highly skewed. Though we don’t know the true distribution of ozone levels, we can go forward assuming that it is log-normally distributed.

To simulate observations from this joint distribution, we need to estimate the correlation and the marginal parameters.

Estimating Correlation

To estimate the correlation, we use cor with an argument specifying the type of correlation to estimate. The options are Pearson, Spearman, or Kendall.

julia> ρ = cor(Pearson, Matrix(df))2×2 Matrix{Float64}:
 1.0      0.69836
 0.69836  1.0

Defining Marginal Distributions

Next we can estimate the marginal parameters. Assuming that the Temperature is normally distributed, it has parameters:

julia> μ_Temp = mean(df.Temp)77.87068965517241
julia> σ_Temp = std(df.Temp)9.48548563759966

and assuming that Ozone is log-normally distributed, it has parameters:

julia> μ_Ozone = mean(log.(df.Ozone))3.418515100812007
julia> σ_Ozone = sqrt(mean((log.(df.Ozone) .- mean(log.(df.Ozone))).^2))0.8617359690270703

Finally we take the parameters and put them into a vector of margins:

julia> margins = [Normal(μ_Temp, σ_Temp), LogNormal(μ_Ozone, σ_Ozone)]2-element Vector{Distributions.Distribution{Distributions.Univariate, Distributions.Continuous}}:
 Distributions.Normal{Float64}(μ=77.87068965517241, σ=9.48548563759966)
 Distributions.LogNormal{Float64}(μ=3.418515100812007, σ=0.8617359690270703)

Correlation Bounds

Given a vector of margins, the theoretical lower and upper correlation coefficients can be estimated using simulation:

julia> lower, upper = cor_bounds(margins, Pearson);
julia> lower2×2 Matrix{Float64}:
  1.0       -0.823041
 -0.823041   1.0
julia> upper2×2 Matrix{Float64}:
 1.0       0.823137
 0.823137  1.0

The pearson_bounds function uses more sophisticated methods to determine the theoretical lower and upper Pearson correlation bounds. It also requires more computational time.

julia> lower, upper = pearson_bounds(margins);
julia> lower2×2 Matrix{Float64}:
  1.0       -0.821122
 -0.821122   1.0
julia> upper2×2 Matrix{Float64}:
 1.0       0.821122
 0.821122  1.0

Simulating Multivariate Data

Let’s now simulate 10,000 observations from the joint distribution using rvec:

julia> x = rvec(10_000, ρ, margins)10000×2 Matrix{Float64}:
 72.5643  10.864
 79.0325   9.52456
 78.8821  48.4813
 80.376   10.3785
 89.3455  63.7452
 86.6853  24.2706
 69.435   17.6312
 80.6068  31.7706
 77.9444  53.3
 75.496   10.5896
  ⋮
 88.4384  45.3207
 84.5002  60.6812
 76.4996  42.1385
 85.4416  72.3477
 72.8968   8.38938
 78.2151  66.3718
 76.5695  19.9163
 67.4049  13.8964
 75.2277  16.1328

Visualizing Bivariate Data

df_sim = DataFrame(x, [:Temp, :Ozone]);

histogram2d(df_sim.:Temp, df_sim.:Ozone, nbins=250, legend=false,
			xlims=extrema(df.:Temp) .+ (-10, 10),
			ylims=extrema(df.:Ozone) .+ (0, 20))

Compared to Uncorrelated Samples

We can compare the bivariate distribution above to one where no correlation is taken into account.

df_sim2 = DataFrame(
	Temp  = rand(margins[1], 10000),
	Ozone = rand(margins[2], 10000)
);

histogram2d(df_sim2.:Temp, df_sim2.:Ozone, nbins=250, legend=false,
			xlims=extrema(df.:Temp) .+ (-10, 10),
			ylims=extrema(df.:Ozone) .+ (0, 20))