`creating.Rmd`

library(distplyr) #> #> Attaching package: 'distplyr' #> The following object is masked from 'package:stats': #> #> sd library(magrittr) library(tibble)

You’re able to make a wide range of probability distributions using distplyr’s manipulation functions, but you’ll need to start with more standard, basic distributions first. There are typically three use cases for building a basic distribution:

- Parametric families
- Empirical distributions
- Manually specified distributions

These include distributions like Normal, Exponential, Poisson, etc.

distplyr includes distributions present in base R’s `r*`

/`p*`

/`d*`

/`q*`

selection of distributions. For example, a Normal distribution in base R has associated functions `rnorm()`

etc. In distplyr:

dst_norm(0, 1) #> Gaussian Distribution #> #> Parameters: #> # A tibble: 3 x 2 #> parameter value #> <chr> <dbl> #> 1 mean 0 #> 2 variance 1 #> 3 sd 1 #> #> Number of Discontinuities: 0

distplyr also includes other common distributions not present in base R, such as a generalized Pareto distribution:

dst_gpd(0, 1, 1) #> Generalized Pareto Distribution #> #> Parameters: #> # A tibble: 3 x 2 #> parameter value #> <chr> <dbl> #> 1 location 0 #> 2 scale 1 #> 3 shape 1 #> #> Number of Discontinuities: 0

**November 2020**: Until this package gains some stability in its structure, there will be a limited number of these distributions – but there will be plenty available in the not-too-distant future.

Whereas base R only has the `ecdf()`

function to handle empirical distributions, distplyr provides full functionality with `stepdst()`

(soon to be called something like `dst_emp()`

). Empirical distribution of `hp`

values in the `mtcars`

dataset:

(hp <- stepdst(hp, data = mtcars)) #> Step Distribution #> #> Number of Discontinuities: 22

The “step” in the name comes from the cdf:

plot(hp, "cdf", n = 501)

You can also weigh the outcomes differently. This is useful for explicitly specifying a probability mass function, as well as for other applications such as using kernel smoothing to find a conditional distribution. Here is an estimate of the conditional distribution of `hp`

given `disp = 150`

, with cdf depicted as the dashed line compared o the marginal with the solid line:

K <- function(x) dnorm(x, sd = 25) hp2 <- stepdst(hp, data = mtcars, weights = K(disp - 150)) plot(hp, "cdf", n = 1001) plot(hp2, "cdf", n = 1001, lty = 2, add = TRUE)

The weighting provides us with a far more informative prediction of `hp`

when `disp = 150`

compared to the loess, which just gives us the mean:

mean(hp2) #> [1] 109.961

With a distribution, you can get much more, such as this 90% prediction interval:

eval_quantile(hp2, at = c(0.05, 0.95)) #> [1] 62 175

Here’s the proportion of variance that’s reduced compared to the marginal:

It’s common to specify a distribution by explicitly specifying a representation, like a density, cdf, etc.

To do this, start with an empty distribution by calling `distribution()`

, and then `set_`

the representation of interest. Here’s the random variable with cdf given by the `sqrt()`

function:

(manual <- distribution() %>% set_cdf(sqrt)) #> Distribution #> #> Parameters: #> # A tibble: 0 x 0 #> #> Number of Discontinuities: plot(manual, "cdf", from = 0, to = 1)

Down the road, you’ll also be able to use purrr-style functions:

```
distribution() %>%
set_cdf(~ sqrt(.x))
```