library(distplyr)
## 
## Attaching package: 'distplyr'
## The following object is masked from 'package:stats':
## 
##     sd
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

One purpose of distplyr is to handle the menial distribution-related calculations for you. Just specify a distribution once, and there is no need to manage its components anymore.

Example: want to compute the variance of a Uniform(-1, 1) distribution, get the 0.25- and 0.75-quantiles, and generate a sample of size 10?

Without distplyr:

a <- -1
b <- 1
# Look up formula for variance:
(b - a) ^ 2 / 12
## [1] 0.3333333
# Get quantiles:
qunif(c(0.25, 0.75), min = a, max = b)
## [1] -0.5  0.5
# Get sample of size 10:
runif(10, min = a, max = b)
##  [1]  0.1174779 -0.9531106 -0.6852088  0.9290763 -0.7067272  0.9681273
##  [7] -0.3001042  0.2677890 -0.5263214  0.4327520

With distplyr:

d <- dst_unif(-1, 1)
variance(d)
## [1] 0.3333333
eval_quantile(d, at = c(0.25, 0.75))
## [1] -0.5  0.5
realise(d, 10)
##  [1]  0.5883782 -0.9982469  0.6478184 -0.6376378 -0.4005474 -0.9865743
##  [7] -0.7744609 -0.6609212  0.4301258  0.2135421

Functional Representations of a Distribution

A distribution can be represented by different functions, such as a density function, a cumulative distribution function, and others. In distplyr, you can:

  • evaluate the function directly by calling eval_*;
  • evaluate the function and enframe results together with the function arguments with enframe_*; or
  • get the function itself using get_*.

Here are the representations and the corresponding distplyr functions:

Quantity distplyr Functions
Cumulative Distribution Function eval_cdf(), get_cdf(), enframe_cdf()
Survival Function eval_survival(), get_survival(), enframe_survival()
Quantile Function eval_quantile(), get_quantile(), enframe_quantile()
Hazard Function eval_hazard(), get_hazard(), enframe_hazard()
Cumulative Hazard Function eval_chf(), get_chf(), enframe_chf()
Probability density function eval_density(), get_density(), enframe_density()
Probability mass function eval_pmf(), get_pmf(), enframe_pmf()

These functions all take a distribution object as their first argument, and eval_* and enframe_* have a second argument named at indicating where to evaluate the function. The at argument is vectorized.

Here is an example of evaluating the hazard function and the random sample generator of a Uniform(-1,1) distribution, and enframing the density:

eval_hazard(d, at = 0:10)
##  [1]   1 Inf NaN NaN NaN NaN NaN NaN NaN NaN NaN
enframe_density(d, at = 0:10)
## # A tibble: 11 x 2
##     .arg .density
##    <int>    <dbl>
##  1     0      0.5
##  2     1      0.5
##  3     2      0  
##  4     3      0  
##  5     4      0  
##  6     5      0  
##  7     6      0  
##  8     7      0  
##  9     8      0  
## 10     9      0  
## 11    10      0

enframe() works particularly well with tibbles and tidyr::unnest():

half_marathon <- tribble(
    ~ person, ~ race_time_min,
    "Vincenzo", dst_norm(130, 25),
    "Colleen", dst_norm(110, 13),
    "Regina", dst_norm(115, 20)
) 
half_marathon %>% 
    mutate(quartiles = map(race_time_min, enframe_quantile, at = 1:3 / 4)) %>% 
    unnest(quartiles)
## # A tibble: 9 x 4
##   person   race_time_min  .arg .quantile
##   <chr>    <list>        <dbl>     <dbl>
## 1 Vincenzo <norm>         0.25      127.
## 2 Vincenzo <norm>         0.5       130 
## 3 Vincenzo <norm>         0.75      133.
## 4 Colleen  <norm>         0.25      108.
## 5 Colleen  <norm>         0.5       110 
## 6 Colleen  <norm>         0.75      112.
## 7 Regina   <norm>         0.25      112.
## 8 Regina   <norm>         0.5       115 
## 9 Regina   <norm>         0.75      118.

Drawing a random sample

To draw a random sample from a distribution, use the realise() or realize() function:

realise(d, n = 5)
## [1]  0.01495641 -0.38646299 -0.14618467  0.38620416 -0.82972806

You can read this call as “realise distribution d five times”. By default, n is set to 1, so that realizing a distribution converts it to a numeric draw:

## [1] -0.5491268

This default is especially useful when working with distributions in a tibble:

half_marathon %>% 
    mutate(actual_time_min = map_dbl(race_time_min, realise))
## # A tibble: 3 x 3
##   person   race_time_min actual_time_min
##   <chr>    <list>                  <dbl>
## 1 Vincenzo <norm>                   127.
## 2 Colleen  <norm>                   111.
## 3 Regina   <norm>                   117.

Perhaps surprisingly, distplyr does not consider realise() as a functional representation of a distribution, even though random sampling falls into the same family as the stats::p*/d*/q*/r* functions. This is because it’s impossible to perfectly describe a distribution based on a sample.

Properties of Distributions

Distributions have various numeric properties. Common examples are the mean and variance, but there are many others as well.

Below is a table of the properties incorporated in distplyr:

Property distplyr Function
Mean mean()
Median median()
Mode mode()
Variance variance()
Standard Deviation sd()
Inter-Quartile Range iqr()
Entropy entropy()
Skewness skewness()
Excess Kurtosis kurtosis_exc()
Kurtosis kurtosis_raw()
Extreme Value (Tail) Index evi()

Here are some properties of our original Uniform(-1, 1) distribution:

mean(d)
## [1] 0
sd(d)
## [1] 0.5773503
evi(d)
## [1] -1