Class EmpiricalDistribution
- java.lang.Object
- 
- org.apache.commons.math4.legacy.distribution.AbstractRealDistribution
- 
- org.apache.commons.math4.legacy.distribution.EmpiricalDistribution
 
 
- 
- All Implemented Interfaces:
- org.apache.commons.statistics.distribution.ContinuousDistribution
 
 public final class EmpiricalDistribution extends AbstractRealDistribution implements org.apache.commons.statistics.distribution.ContinuousDistribution Represents an empirical probability distribution: Probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from. An EmpiricalDistributionmaintains data structures called distribution digests that describe empirical distributions and support the following operations:- loading the distribution from "observed" data values
- dividing the input data into "bin ranges" and reporting bin frequency counts (data for histogram)
- reporting univariate statistics describing the full set of data values as well as the observations within each bin
- generating random values from the distribution
 EmpiricalDistributionto build grouped frequency histograms representing the input data or to generate random values "like" those in the input, i.e. the values generated will follow the distribution of the values in the file.The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing: Digesting the input file - Pass the file once to compute min and max.
- Divide the range from min to max into binCountbins.
- Pass the data file again, computing bin counts and univariate statistics (mean and std dev.) for each bin.
- Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
 - Generate a uniformly distributed value in (0,1)
- Select the subinterval to which the value belongs.
- Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.
 EmpiricalDistribution implements the CAVEAT: It is advised that theContinuousDistributioninterface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then setP(X < x) = P(B-) + P(B) * K(x) / K(B)whereK(x)is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.bin countis about one tenth of the size of the input array.
- 
- 
Field Summary- 
Fields inherited from class org.apache.commons.math4.legacy.distribution.AbstractRealDistributionSOLVER_DEFAULT_ABSOLUTE_ACCURACY
 
- 
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description doublecumulativeProbability(double x)Algorithm description: Find the bin B that x belongs to. Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B. Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x. If K is a constant distribution, we return P(B-) + P(B) (counting the full mass of B).doubledensity(double x)Returns the kernel density normalized so that its integral over each bin equals the bin mass.static EmpiricalDistributionfrom(int binCount, double[] input)Factory that creates a new instance from the specified data.static EmpiricalDistributionfrom(int binCount, double[] input, Function<SummaryStatistics,org.apache.commons.statistics.distribution.ContinuousDistribution> kernelFactory)Factory that creates a new instance from the specified data.intgetBinCount()Returns the number of bins.List<SummaryStatistics>getBinStats()Returns a copy of theSummaryStatisticsinstances containing statistics describing the values in each of the bins.double[]getGeneratorUpperBounds()Returns the upper bounds of the subintervals of [0, 1] used in generating data from the empirical distribution.doublegetMean()StatisticalSummarygetSampleStats()Returns aStatisticalSummarydescribing this distribution.doublegetSupportLowerBound()doublegetSupportUpperBound()double[]getUpperBounds()Returns the upper bounds of the bins.doublegetVariance()doubleinverseCumulativeProbability(double p)The default implementation returnsContinuousDistribution.getSupportLowerBound()forp = 0,ContinuousDistribution.getSupportUpperBound()forp = 1.- 
Methods inherited from class org.apache.commons.math4.legacy.distribution.AbstractRealDistributioncreateSampler, getSolverAbsoluteAccuracy, logDensity, probability, sample
 
- 
 
- 
- 
- 
Method Detail- 
frompublic static EmpiricalDistribution from(int binCount, double[] input, Function<SummaryStatistics,org.apache.commons.statistics.distribution.ContinuousDistribution> kernelFactory) Factory that creates a new instance from the specified data.- Parameters:
- binCount- Number of bins. Must be strictly positive.
- input- Input data. Cannot be- null.
- kernelFactory- Factory for creating within-bin kernels.
- Returns:
- a new instance.
- Throws:
- NotStrictlyPositiveException- if- binCount <= 0.
 
 - 
frompublic static EmpiricalDistribution from(int binCount, double[] input) Factory that creates a new instance from the specified data.- Parameters:
- binCount- Number of bins. Must be strictly positive.
- input- Input data. Cannot be- null.
- Returns:
- a new instance.
- Throws:
- NotStrictlyPositiveException- if- binCount <= 0.
 
 - 
getSampleStatspublic StatisticalSummary getSampleStats() Returns aStatisticalSummarydescribing this distribution. Preconditions:- the distribution must be loaded before invoking this method
 - Returns:
- the sample statistics
- Throws:
- IllegalStateException- if the distribution has not been loaded
 
 - 
getBinCountpublic int getBinCount() Returns the number of bins.- Returns:
- the number of bins.
 
 - 
getBinStatspublic List<SummaryStatistics> getBinStats() Returns a copy of theSummaryStatisticsinstances containing statistics describing the values in each of the bins. The list is indexed on the bin number.- Returns:
- the bins statistics.
 
 - 
getUpperBoundspublic double[] getUpperBounds() Returns the upper bounds of the bins. Assuming arrayuis returned by this method, the bins are:- (min, u[0]),
- (u[0], u[1]),
- ... ,
- (u[binCount - 2], u[binCount - 1] = max),
 - Returns:
- the bins upper bounds.
- Since:
- 2.1
 
 - 
getGeneratorUpperBoundspublic double[] getGeneratorUpperBounds() Returns the upper bounds of the subintervals of [0, 1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts. Preconditions:- the distribution must be loaded before invoking this method
 - Returns:
- array of upper bounds of subintervals used in data generation
- Throws:
- NullPointerException- unless a- loadmethod has been called beforehand.
- Since:
- 2.1
 
 - 
densitypublic double density(double x) Returns the kernel density normalized so that its integral over each bin equals the bin mass. Algorithm description:- Find the bin B that x belongs to.
- Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
- Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.
 - Specified by:
- densityin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 - 
cumulativeProbabilitypublic double cumulativeProbability(double x) Algorithm description:- Find the bin B that x belongs to.
- Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
- Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
- Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.
 - Specified by:
- cumulativeProbabilityin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 - 
inverseCumulativeProbabilitypublic double inverseCumulativeProbability(double p) The default implementation returns- ContinuousDistribution.getSupportLowerBound()for- p = 0,
- ContinuousDistribution.getSupportUpperBound()for- p = 1.
 - Find the smallest i such that the sum of the masses of the bins through i is at least p.
- 
   - Let K be the within-bin kernel distribution for bin i.
- Let K(B) be the mass of B under K.
- Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
- Let P(B) be the probability of bin i.
- Let P(B-) be the sum of the bin masses below bin i.
- Let pCrit = p - P(B-)
 
- Return the inverse of K evaluated at K(B-) + pCrit * K(B) / P(B)
 - Specified by:
- inverseCumulativeProbabilityin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Overrides:
- inverseCumulativeProbabilityin class- AbstractRealDistribution
- Since:
- 3.1
 
 - 
getMeanpublic double getMean() - Specified by:
- getMeanin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 - 
getVariancepublic double getVariance() - Specified by:
- getVariancein interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 - 
getSupportLowerBoundpublic double getSupportLowerBound() - Specified by:
- getSupportLowerBoundin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 - 
getSupportUpperBoundpublic double getSupportUpperBound() - Specified by:
- getSupportUpperBoundin interface- org.apache.commons.statistics.distribution.ContinuousDistribution
- Since:
- 3.1
 
 
- 
 
-