distplyr is still in its infancy, but it has an ambitious vision. To truly empower the analyst, there are some key functionalities that should be developed:

  1. Make a new parametric distribution not in distplyr.
  2. Manipulating distributions
  3. Gradually isolate distribution.

This page is a vision of distplyr after having achieved these things, written in the form of a vignette. Some of these things might be too ambitious, but I truly think they are possible, and very powerful. If you believe in empowering analysts to seemlessly draw powerful insights using distributions, please consider contributing to this open source project.

1. Base Distributions not included in distplyr

There are plenty of packages that give you access to p/d/q/rfoo() functions for a distribution not contained in base R. distplyr can’t include all of those. When such functions exist, make a distribution using distplyr with as_distribution() followed by parameter specifications:

my_dst <- as_dst("foo", param1 = 3, param2 = 6)

While the above is enough, you might want to consider adding more information. With no knowledge of properties such as mean and variance, these quantities will be computed by their definition, often involving an integral. Instead, you can specify these things using the set_*() functions:

my_dst <- my_dst %>%
  set_mean(param1 / (param1 + param2)) %>% 
  set_variance({
    denominator <- param1 + param
    param1 / denominator
  })

2. Transformations

It’s important to be able to transform distributions. Simple ones include add_by() and divide_by() – better yet, define binary operators for distributions.

For example, here is an empirical distribution of the residuals of a regression model:

model <- lm(mpg ~ I(1 / disp), data = mtcars)
error_dist <-  model %>% 
    residuals() %>% 
    stepdst()

It would be useful to add the mean back in to the error distribution:

broom::augment(model) %>% 
    mutate(dist = map(.fitted, ~ .x + error_dist))

Now the error distribution is non-parametric – useful if you are skeptical of assumptions such as the usual Gaussian assumption.

3. Gradually Isolate a Distribution

It’s useful to gradually isolate a distribution. Consider these use cases.

Example 1

Your client wants you to come up with the best supervised learning model that predicts the mean of Y given X. A secondary goal is to communicate uncertainty in the prediction – you suspect the residuals are lognormal, but don’t want to make that assumption when fitting your baseline model. You’re willing to consider making the second Lognormal parameter constant across X (let’s say this to make this demo simpler).

To find the distribution of the residuals, set a generic Lognormal distribution with unspecified parameters a and b, then set the mean at 0. The remaining parameter can be estimated via maximum likelihood.

res <- dst_lnorm() %>% 
    set_mean(0)
f <- get_density(res)   # Maximize

Notice that neither a nor b is the mean of the distribution, yet we can still restrict this distribution down to a 1-dimensional parameter space.

Example 2

You’re trying to forecast your income, and you know that most of the time you make between 3 and 4 thousand dollars per month. You decide to make this a 90% prediction interval of your income, using a Normal distribution:

dst_norm() %>% 
    set_quantiles(0.05 ~ 3000, 0.95 ~ 4000)

Note that this is different from set_quantile(), which sets the entire quantile function (?).

Example 3

Maybe you want a normal distribution that has the same mean and same variance. Although an example like this might not be practical in the univariate setting, it does in the multivariate, where we’d like to set dependence to be equal between two pairs of variables.

dst_norm(a, a)  # Then, resolve via MLE 

Example 4

You’d like to fit a GPD, and you are confident that the distribution is heavy tailed, so would like the shape parameter to be positive.

dst_gpd() %>% 
   set_parameters(shape > 0)