library(distplyr)
##
## Attaching package: 'distplyr'
## The following object is masked from 'package:stats':
##
##     sd
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(purrr)
library(tidyr)

One purpose of distplyr is to handle the menial distribution-related calculations for you. Just specify a distribution once, and there is no need to manage its components anymore.

Example: want to compute the variance of a Uniform(-1, 1) distribution, get the 0.25- and 0.75-quantiles, and generate a sample of size 10?

Without distplyr:

a <- -1
b <- 1
# Look up formula for variance:
(b - a) ^ 2 / 12
##  0.3333333
# Get quantiles:
qunif(c(0.25, 0.75), min = a, max = b)
##  -0.5  0.5
# Get sample of size 10:
runif(10, min = a, max = b)
##    0.1174779 -0.9531106 -0.6852088  0.9290763 -0.7067272  0.9681273
##   -0.3001042  0.2677890 -0.5263214  0.4327520

With distplyr:

d <- dst_unif(-1, 1)
variance(d)
##  0.3333333
eval_quantile(d, at = c(0.25, 0.75))
##  -0.5  0.5
realise(d, 10)
##    0.5883782 -0.9982469  0.6478184 -0.6376378 -0.4005474 -0.9865743
##   -0.7744609 -0.6609212  0.4301258  0.2135421

## Functional Representations of a Distribution

A distribution can be represented by different functions, such as a density function, a cumulative distribution function, and others. In distplyr, you can:

• evaluate the function directly by calling eval_*;
• evaluate the function and enframe results together with the function arguments with enframe_*; or
• get the function itself using get_*.

Here are the representations and the corresponding distplyr functions:

Quantity distplyr Functions
Cumulative Distribution Function eval_cdf(), get_cdf(), enframe_cdf()
Survival Function eval_survival(), get_survival(), enframe_survival()
Quantile Function eval_quantile(), get_quantile(), enframe_quantile()
Hazard Function eval_hazard(), get_hazard(), enframe_hazard()
Cumulative Hazard Function eval_chf(), get_chf(), enframe_chf()
Probability density function eval_density(), get_density(), enframe_density()
Probability mass function eval_pmf(), get_pmf(), enframe_pmf()

These functions all take a distribution object as their first argument, and eval_* and enframe_* have a second argument named at indicating where to evaluate the function. The at argument is vectorized.

Here is an example of evaluating the hazard function and the random sample generator of a Uniform(-1,1) distribution, and enframing the density:

eval_hazard(d, at = 0:10)
##     1 Inf NaN NaN NaN NaN NaN NaN NaN NaN NaN
enframe_density(d, at = 0:10)
## # A tibble: 11 x 2
##     .arg .density
##    <int>    <dbl>
##  1     0      0.5
##  2     1      0.5
##  3     2      0
##  4     3      0
##  5     4      0
##  6     5      0
##  7     6      0
##  8     7      0
##  9     8      0
## 10     9      0
## 11    10      0
set.seed(10)

enframe() works particularly well with tibbles and tidyr::unnest():

half_marathon <- tribble(
~ person, ~ race_time_min,
"Vincenzo", dst_norm(130, 25),
"Colleen", dst_norm(110, 13),
"Regina", dst_norm(115, 20)
)
half_marathon %>%
mutate(quartiles = map(race_time_min, enframe_quantile, at = 1:3 / 4)) %>%
unnest(quartiles)
## # A tibble: 9 x 4
##   person   race_time_min  .arg .quantile
##   <chr>    <list>        <dbl>     <dbl>
## 1 Vincenzo <norm>         0.25      127.
## 2 Vincenzo <norm>         0.5       130
## 3 Vincenzo <norm>         0.75      133.
## 4 Colleen  <norm>         0.25      108.
## 5 Colleen  <norm>         0.5       110
## 6 Colleen  <norm>         0.75      112.
## 7 Regina   <norm>         0.25      112.
## 8 Regina   <norm>         0.5       115
## 9 Regina   <norm>         0.75      118.

## Drawing a random sample

To draw a random sample from a distribution, use the realise() or realize() function:

realise(d, n = 5)
##   0.01495641 -0.38646299 -0.14618467  0.38620416 -0.82972806

You can read this call as “realise distribution d five times”. By default, n is set to 1, so that realizing a distribution converts it to a numeric draw:

realise(d)
##  -0.5491268

This default is especially useful when working with distributions in a tibble:

half_marathon %>%
mutate(actual_time_min = map_dbl(race_time_min, realise))
## # A tibble: 3 x 3
##   person   race_time_min actual_time_min
##   <chr>    <list>                  <dbl>
## 1 Vincenzo <norm>                   127.
## 2 Colleen  <norm>                   111.
## 3 Regina   <norm>                   117.

Perhaps surprisingly, distplyr does not consider realise() as a functional representation of a distribution, even though random sampling falls into the same family as the stats::p*/d*/q*/r* functions. This is because it’s impossible to perfectly describe a distribution based on a sample.

## Properties of Distributions

Distributions have various numeric properties. Common examples are the mean and variance, but there are many others as well.

Below is a table of the properties incorporated in distplyr:

Property distplyr Function
Mean mean()
Median median()
Mode mode()
Variance variance()
Standard Deviation sd()
Inter-Quartile Range iqr()
Entropy entropy()
Skewness skewness()
Excess Kurtosis kurtosis_exc()
Kurtosis kurtosis_raw()
Extreme Value (Tail) Index evi()

Here are some properties of our original Uniform(-1, 1) distribution:

mean(d)
##  0
sd(d)
##  0.5773503
evi(d)
##  -1