A common way of representing data is through a parametric distribution, such as a Normal distribution, Exponential, Poisson, and others. A useful alternative is to use an empirical distribution, or in general what distplyr calls a step distribution.

To make a step distribution, use the function stepdst():

library(distplyr)
(d1 <- stepdst(hp, data = mtcars))
## Step Distribution
##
## Number of Discontinuities:  22

The “step” in the name comes from the cdf:

plot(d1, "cdf", n = 1001) You can also weigh the outcomes differently. This is useful for explicitly specifying a probability mass function, as well as for other applications such as using kernel smoothing to find a conditional distribution. Here is an estimate of the conditional distribution of hp given disp = 150, with cdf depicted as the dashed line compared o the marginal with the solid line:

K <- function(x) dnorm(x, sd = 25)
d2 <- stepdst(hp, data = mtcars, weights = K(disp - 150))
plot(d1, "cdf", n = 1001)
plot(d2, "cdf", n = 1001, lty = 2, add = TRUE) This is much more informative compared with a point prediction of hp when disp = 150. Such a prediction might be:

get_mean(d2)
##  109.961

With a distribution, you can get much more, such as a prediction interval. Here’s a 90% interval:

eval_quantile(d2, at = c(0.05, 0.95))
##   62 175

Here’s the proportion of variance that’s reduced compared to the marginal:

1 - get_variance(d2) / get_variance(d1)
##  0.8031741

You can extract the step discontinuities in any distribution, using the discontinuities() function. It will give you the location of the discontinuities, and the size of the jump in the cdf:

discontinuities(d2)
##    location         size
## 1        52 1.471993e-03
## 2        62 1.208194e-01
## 3        65 8.376466e-04
## 4        66 4.247905e-03
## 5        91 6.017982e-02
## 6        93 2.971967e-02
## 7        95 1.138973e-01
## 8        97 5.960867e-02
## 9       105 1.353927e-03
## 10      109 6.219092e-02
## 11      110 2.250234e-01
## 12      113 1.093317e-02
## 13      123 1.902518e-01
## 14      150 7.207284e-10
## 15      175 1.194633e-01
## 16      180 1.160516e-06
## 17      205 1.154540e-37
## 18      215 4.981505e-35
## 19      230 7.355083e-31
## 20      245 1.601543e-15
## 21      264 1.119890e-15
## 22      335 1.458954e-09

For continuous distributions, there are no discontinuities:

discontinuities(dst_norm(0, 1))
##  location size
## <0 rows> (or 0-length row.names)