distplyr is still in its infancy, but it has an ambitious vision. To truly empower the analyst, there are some key functionalities that should be developed:

  1. Making new base distributions not in distplyr.
  2. Transforming distributions
  3. Families of distributions instead of an isolated distribution.

This page is a vision of distplyr after having achieved these things, written in the form of a vignette. Some of these things might be too ambitious, but I truly think they are possible, and very powerful. If you believe in empowering analysts to seemlessly draw powerful insights using distributions, please consider contributing to this open source project.

1. Base Distributions not included in distplyr

Of course, distplyr cannot contain an exhaustive list of parametric distributions, nor does such a list exist. But, you can make new ones.

When p/d/q/r Functions Exist

The easiest way to do this is if the functions p/d/q/rfoo() already exist, such as by loading an R package. If so, just call as_dst() with the name of the distribution, followed by parameter specifications:

my_dst <- as_dst("foo", param1 = 3, param2 = 6)

While the above is enough, you might want to consider adding more information. With no knowledge of properties such as mean and variance, these quantities will be computed by their definition, often involving an integral. Instead, you can specify these things using the set_*() functions:

library(magrittr)
my_dst <- my_dst %>%
  set_mean(param1 / (param1 + param2)) %>% 
  set_variance({
    denominator <- param1 + param
    param1 / denominator
  })

In General

In general, you might not have the convenience of having a distribution’s representations already defined. You can make a new distribution using dst(), and specifying at least one of the distribution’s representations:

dst(density = function(x) sqrt(1 - x^2) * 2 / pi,
    support = c(-1, 1), 
    mean = 0)

You can either provide details in the dst() function or outside, using the set_*() family of functions:

dst() %>% 
    set_density(function(x) sqrt(1 - x^2) * 2 / pi) %>% 
    set_support(-1, 1) %>% 
    set_mean(0)

Specifying the components of a distribution does not guarantee that they are in accordance with each other. It’s up to you to double check.

2. Transformations

It’s important to be able to transform distributions. Simple ones include add_by() and divide_by().

For example, here is an empirical distribution of the residuals of a regression model:

model <- lm(mpg ~ I(1 / disp), data = mtcars)
error_dist <-  model %>% 
    residuals() %>% 
    stepdst()

It would be useful to add the mean back in to the error distribution:

broom::augment(model) %>% 
    mutate(dist = map(.fitted, ~ add_by(error_dist, .x)))

Now the error distribution is non-parametric – useful if you are skeptical of assumptions such as the usual Gaussian assumption.

3. Parametric Families

Oftentimes, we work with families of distributions before isolating a single member. distplyr allows you to work with families as well as isolated distributions. For example, don’t specify the parameters of a distribution, and it is a family:

dst_norm()

Resolving a Family to a Distribution

You are able to gradually resolve a distribution from a family using the restrict_*() family of functions.

dst_beta() %>% 
    restrict_parameters(alpha = 5) %>% 
    parameters()
#> alpha   beta 
#>     5     NA 

Note that this is the same as:

dst_beta(alpha = 5) %>% 
    parameters()
#> alpha   beta 
#>     5     NA 

You aren’t forced to use the standard parameterization, either. Perhaps you’d like to consider the family of Beta distributions whose mean is 0.5, or whose 0.25-quantile is 0.5:

dst_beta() %>% 
    restrict_mean(to = 0.5) %>% 
    parameters()
#> mean   beta 
#>  0.5     NA 
dst_beta() %>% 
    restrict_quantile(0.25, to = 0.5) %>%
    parameters()
#> 0.25.quantile   beta 
#>           0.5     NA 

The need for gradually reducing a family of distributions is useful in analyses with more than one goal.

  • Perhaps the mean is a priority to estimate, and you would like to do so without a distributional assumption. Only after the mean’s estimation would you then like to choose a conforming Beta distribution that has the largest likelihood, to set the final parameter.
  • Perhaps you’ve fit a regression model to the 0.25- and 0.75-quantiles, and would like your distribution to adopt those quantiles.

Making your own Parametric Family

You can also make your own parametric family.

If the p/d/q/rfoo functions are already defined, just use as_dst(), but don’t specify the parameters, and you’ll get a family:

as_dst("foo")

Similarly, you can use dst() to specify your own family. To represent a function, use the tilde notation with x as the variable. Here’s an example that starts from scratch:

my_dst2 <- dst() %>% 
  set_cdf(~ 1 - exp(-beta * x)) %>% 
  set_quantile(~ {
    m <- 1 / (1 - x)
    log(m) / beta
  }) %>% 
  set_density(~ beta * exp(-beta * x)) %>% 
  restrict_parameters(beta > 0) %>% 
  set_support(0, Inf) %>% 
  set_mean(1 / beta)

Development note: tidy evaluation might be the best course of action for obtaining this.

This is a family of distributions because beta has not been specified yet. If you’d like to resolve the family, you can go ahead and specify beta:

my_dst2 %>% 
  restrict_parameters(beta = 1.5)