`vision.Rmd`

`distplyr`

is still in its infancy, but it has an ambitious vision. To truly empower the analyst, there are some key functionalities that should be developed:

- Making new base distributions not in
`distplyr`

. - Transforming distributions
- Families of distributions instead of an isolated distribution.

This page is a vision of `distplyr`

after having achieved these things, written in the form of a vignette. Some of these things might be too ambitious, but I truly think they are possible, and very powerful. If you believe in empowering analysts to seemlessly draw powerful insights using distributions, please consider contributing to this open source project.

`distplyr`

Of course, `distplyr`

cannot contain an exhaustive list of parametric distributions, nor does such a list exist. But, you can make new ones.

`p/d/q/r`

Functions ExistThe easiest way to do this is if the functions `p/d/q/rfoo()`

already exist, such as by loading an R package. If so, just call `as_dst()`

with the name of the distribution, followed by parameter specifications:

`my_dst <- as_dst("foo", param1 = 3, param2 = 6)`

While the above is enough, you might want to consider adding more information. With no knowledge of properties such as mean and variance, these quantities will be computed by their definition, often involving an integral. Instead, you can specify these things using the `set_*()`

functions:

```
library(magrittr)
my_dst <- my_dst %>%
set_mean(param1 / (param1 + param2)) %>%
set_variance({
denominator <- param1 + param
param1 / denominator
})
```

In general, you might not have the convenience of having a distribution’s representations already defined. You can make a new distribution using `dst()`

, and specifying at least one of the distribution’s representations:

```
dst(density = function(x) sqrt(1 - x^2) * 2 / pi,
support = c(-1, 1),
mean = 0)
```

You can either provide details in the `dst()`

function or outside, using the `set_*()`

family of functions:

```
dst() %>%
set_density(function(x) sqrt(1 - x^2) * 2 / pi) %>%
set_support(-1, 1) %>%
set_mean(0)
```

Specifying the components of a distribution does not guarantee that they are in accordance with each other. It’s up to you to double check.

It’s important to be able to transform distributions. Simple ones include `add_by()`

and `divide_by()`

.

For example, here is an empirical distribution of the residuals of a regression model:

```
model <- lm(mpg ~ I(1 / disp), data = mtcars)
error_dist <- model %>%
residuals() %>%
stepdst()
```

It would be useful to add the mean back in to the error distribution:

```
broom::augment(model) %>%
mutate(dist = map(.fitted, ~ add_by(error_dist, .x)))
```

Now the error distribution is non-parametric – useful if you are skeptical of assumptions such as the usual Gaussian assumption.

Oftentimes, we work with *families* of distributions before isolating a single member. `distplyr`

allows you to work with families as well as isolated distributions. For example, don’t specify the parameters of a distribution, and it is a family:

`dst_norm()`

You are able to gradually resolve a distribution from a family using the `restrict_*()`

family of functions.

```
dst_beta() %>%
restrict_parameters(alpha = 5) %>%
parameters()
#> alpha beta
#> 5 NA
```

Note that this is the same as:

```
dst_beta(alpha = 5) %>%
parameters()
#> alpha beta
#> 5 NA
```

You aren’t forced to use the standard parameterization, either. Perhaps you’d like to consider the family of Beta distributions whose mean is 0.5, or whose 0.25-quantile is 0.5:

```
dst_beta() %>%
restrict_mean(to = 0.5) %>%
parameters()
#> mean beta
#> 0.5 NA
dst_beta() %>%
restrict_quantile(0.25, to = 0.5) %>%
parameters()
#> 0.25.quantile beta
#> 0.5 NA
```

The need for gradually reducing a family of distributions is useful in analyses with more than one goal.

- Perhaps the mean is a priority to estimate, and you would like to do so without a distributional assumption. Only after the mean’s estimation would you then like to choose a conforming Beta distribution that has the largest likelihood, to set the final parameter.
- Perhaps you’ve fit a regression model to the 0.25- and 0.75-quantiles, and would like your distribution to adopt those quantiles.

You can also make your own parametric family.

If the `p/d/q/rfoo`

functions are already defined, just use `as_dst()`

, but don’t specify the parameters, and you’ll get a family:

`as_dst("foo")`

Similarly, you can use `dst()`

to specify your own family. To represent a function, use the tilde notation with `x`

as the variable. Here’s an example that starts from scratch:

```
my_dst2 <- dst() %>%
set_cdf(~ 1 - exp(-beta * x)) %>%
set_quantile(~ {
m <- 1 / (1 - x)
log(m) / beta
}) %>%
set_density(~ beta * exp(-beta * x)) %>%
restrict_parameters(beta > 0) %>%
set_support(0, Inf) %>%
set_mean(1 / beta)
```

*Development note*: tidy evaluation might be the best course of action for obtaining this.

This is a family of distributions because `beta`

has not been specified yet. If you’d like to resolve the family, you can go ahead and specify beta:

```
my_dst2 %>%
restrict_parameters(beta = 1.5)
```