Go to the first, previous, next, last section, table of contents.
This chapter describes the statistical functions in the library. The
basic statistical functions include routines to compute the mean,
variance and standard deviation. More advanced functions allow you to
calculate absolute deviations, skewness, and kurtosis as well as the
median and arbitrary percentiles. The algorithms use recurrence
relations to compute average quantities in a stable way, without large
intermediate values that might overflow.
The functions are available in versions for datasets in the standard
floating-point and integer types. The versions for double precision
floating-point data have the prefix gsl_stats
and are declared in
the header file `gsl_statistics_double.h'. The versions for integer
data have the prefix gsl_stats_int
and are declared in the header
files `gsl_statistics_int.h'.
- Statistics: double gsl_stats_mean (const double data[], size_t stride, size_t n)
-
This function returns the arithmetic mean of data, a dataset of
length n with stride stride. The arithmetic mean, or
sample mean, is denoted by \Hat\mu and defined as,
\Hat\mu = (1/N) \sum x_i
where x_i are the elements of the dataset data. For
samples drawn from a gaussian distribution the variance of
\Hat\mu is \sigma^2 / N.
- Statistics: double gsl_stats_variance (const double data[], size_t stride, size_t n)
-
This function returns the estimated, or sample, variance of
data, a dataset of length n with stride stride. The
estimated variance is denoted by \Hat\sigma^2 and is defined by,
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - \Hat\mu)^2
where x_i are the elements of the dataset data. Note that
the normalization factor of 1/(N-1) results from the derivation
of \Hat\sigma^2 as an unbiased estimator of the population
variance \sigma^2. For samples drawn from a gaussian distribution
the variance of \Hat\sigma^2 itself is 2 \sigma^4 / N.
This function computes the mean via a call to gsl_stats_mean
. If
you have already computed the mean then you can pass it directly to
gsl_stats_variance_m
.
- Statistics: double gsl_stats_variance_m (const double data[], size_t stride, size_t n, double mean)
-
This function returns the sample variance of data relative to the
given value of mean. The function is computed with \Hat\mu
replaced by the value of mean that you supply,
\Hat\sigma^2 = (1/(N-1)) \sum (x_i - mean)^2
- Statistics: double gsl_stats_sd (const double data[], size_t stride, size_t n)
-
- Statistics: double gsl_stats_sd_m (const double data[], size_t stride, size_t n, double mean)
-
The standard deviation is defined as the square root of the variance.
These functions return the square root of the corresponding variance
functions above.
- Statistics: double gsl_stats_variance_with_fixed_mean (const double data[], size_t stride, size_t n, double mean)
-
This function computes an unbiased estimate of the variance of
data when the population mean mean of the underlying
distribution is known a priori. In this case the estimator for
the variance uses the factor 1/N and the sample mean
\Hat\mu is replaced by the known population mean \mu,
\Hat\sigma^2 = (1/N) \sum (x_i - \mu)^2
- Statistics: double gsl_stats_sd_with_fixed_mean (const double data[], size_t stride, size_t n, double mean)
-
This function calculates the standard deviation of data for a a
fixed population mean mean. The result is the square root of the
corresponding variance function.
- Statistics: double gsl_stats_absdev (const double data[], size_t stride, size_t n)
-
This function computes the absolute deviation from the mean of
data, a dataset of length n with stride stride. The
absolute deviation from the mean is defined as,
absdev = (1/N) \sum |x_i - \Hat\mu|
where x_i are the elements of the dataset data. The
absolute deviation from the mean provides a more robust measure of the
width of a distribution than the variance. This function computes the
mean of data via a call to gsl_stats_mean
.
- Statistics: double gsl_stats_absdev_m (const double data[], size_t stride, size_t n, double mean)
-
This function computes the absolute deviation of the dataset data
relative to the given value of mean,
absdev = (1/N) \sum |x_i - mean|
This function is useful if you have already computed the mean of
data (and want to avoid recomputing it), or wish to calculate the
absolute deviation relative to another value (such as zero, or the
median).
- Statistics: double gsl_stats_skew (const double data[], size_t stride, size_t n)
-
This function computes the skewness of data, a dataset of length
n with stride stride. The skewness is defined as,
skew = (1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^3
where x_i are the elements of the dataset data. The skewness
measures the asymmetry of the tails of a distribution.
The function computes the mean and estimated standard deviation of
data via calls to gsl_stats_mean
and gsl_stats_sd
.
- Statistics: double gsl_stats_skew_m_sd (const double data[], size_t stride, size_t n, double mean, double sd)
-
This function computes the skewness of the dataset data using the
given values of the mean mean and standard deviation sd,
skew = (1/N) \sum ((x_i - mean)/sd)^3
These functions are useful if you have already computed the mean and
standard deviation of data and want to avoid recomputing them.
- Statistics: double gsl_stats_kurtosis (const double data[], size_t stride, size_t n)
-
This function computes the kurtosis of data, a dataset of length
n with stride stride. The kurtosis is defined as,
kurtosis = ((1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^4) - 3
The kurtosis measures how sharply peaked a distribution is, relative to
its width. The kurtosis is normalized to zero for a gaussian
distribution.
- Statistics: double gsl_stats_kurtosis_m_sd (const double data[], size_t stride, size_t n, double mean, double sd)
-
This function computes the kurtosis of the dataset data using the
given values of the mean mean and standard deviation sd,
kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3
This function is useful if you have already computed the mean and
standard deviation of data and want to avoid recomputing them.
- Function: double gsl_stats_lag1_autocorrelation (const double data[], const size_t stride, const size_t n)
-
This function computes the lag-1 autocorrelation of the dataset data.
a_1 = {\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu)
\over
\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)}
- Function: double gsl_stats_lag1_autocorrelation_m (const double data[], const size_t stride, const size_t n, const double mean)
-
This function computes the lag-1 autocorrelation of the dataset
data using the given value of the mean mean.
- Function: double gsl_stats_covariance (const double data1[], const size_t stride1, const double data2[], const size_t stride2, const size_t n)
-
This function computes the covariance of the datasets data1 and
data2 which must both be of the same length n.
covar = (1/(n - 1)) \sum_{i = 1}^{n} (x_i - \Hat x) (y_i - \Hat y)
- Function: double gsl_stats_covariance_m (const double data1[], const size_t stride1, const double data2[], const size_t n, const double mean1, const double mean2)
-
This function computes the covariance of the datasets data1 and
data2 using the given values of the means, mean1 and
mean2.
The functions described in this section allow the computation of
statistics for weighted samples. The functions accept an array of
samples, x_i, with associated weights, w_i. Each sample
x_i is considered as having been drawn from a Gaussian
distribution with variance \sigma_i^2. The sample weight
w_i is defined as the reciprocal of this variance, w_i =
1/\sigma_i^2. Setting a weight to zero corresponds to removing a
sample from a dataset.
- Statistics: double gsl_stats_wmean (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
This function returns the weighted mean of the dataset data with
stride stride and length n, using the set of weights w
with stride wstride and length n. The weighted mean is defined as,
\Hat\mu = (\sum w_i x_i) / (\sum w_i)
- Statistics: double gsl_stats_wvariance (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
This function returns the estimated variance of the dataset data
with stride stride and length n, using the set of weights
w with stride wstride and length n. The estimated
variance of a weighted dataset is defined as,
\Hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2)))
\sum w_i (x_i - \Hat\mu)^2
Note that this expression reduces to an unweighted variance with the
familiar 1/(N-1) factor when there are N equal non-zero
weights.
- Statistics: double gsl_stats_wvariance_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
-
This function returns the estimated variance of the weighted dataset
data using the given weighted mean wmean.
- Statistics: double gsl_stats_wsd (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
The standard deviation is defined as the square root of the variance.
This function returns the square root of the corresponding variance
function
gsl_stats_wvariance
above.
- Statistics: double gsl_stats_wsd_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
-
This function returns the square root of the corresponding variance
function
gsl_stats_wvariance_m
above.
- Statistics: double gsl_stats_wvariance_with_fixed_mean (const double w[], size_t wstride, const double data[], size_t stride, size_t n, const double mean)
-
This function computes an unbiased estimate of the variance of weighted
dataset data when the population mean mean of the underlying
distribution is known a priori. In this case the estimator for
the variance replaces the sample mean \Hat\mu by the known
population mean \mu,
\Hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i)
- Statistics: double gsl_stats_wsd_with_fixed_mean (const double w[], size_t wstride, const double data[], size_t stride, size_t n, const double mean)
-
The standard deviation is defined as the square root of the variance.
This function returns the square root of the corresponding variance
function above.
- Statistics: double gsl_stats_wabsdev (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
This function computes the weighted absolute deviation from the weighted
mean of data. The absolute deviation from the mean is defined as,
absdev = (\sum w_i |x_i - \Hat\mu|) / (\sum w_i)
- Statistics: double gsl_stats_wabsdev_m (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean)
-
This function computes the absolute deviation of the weighted dataset
data about the given weighted mean wmean.
- Statistics: double gsl_stats_wskew (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
This function computes the weighted skewness of the dataset data.
skew = (\sum w_i ((x_i - xbar)/\sigma)^3) / (\sum w_i)
- Statistics: double gsl_stats_wskew_m_sd (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean, double wsd)
-
This function computes the weighted skewness of the dataset data
using the given values of the weighted mean and weighted standard
deviation, wmean and wsd.
- Statistics: double gsl_stats_wkurtosis (const double w[], size_t wstride, const double data[], size_t stride, size_t n)
-
This function computes the weighted kurtosis of the dataset data.
kurtosis = ((\sum w_i ((x_i - xbar)/sigma)^4) / (\sum w_i)) - 3
- Statistics: double gsl_stats_wkurtosis_m_sd (const double w[], size_t wstride, const double data[], size_t stride, size_t n, double wmean, double wsd)
-
This function computes the weighted kurtosis of the dataset data
using the given values of the weighted mean and weighted standard
deviation, wmean and wsd.
- Statistics: double gsl_stats_max (const double data[], size_t stride, size_t n)
-
This function returns the maximum value in data, a dataset of
length n with stride stride. The maximum value is defined
as the value of the element x_i which satisfies
x_i >= x_j for all j.
If you want instead to find the element with the largest absolute
magnitude you will need to apply fabs
or abs
to your data
before calling this function.
- Statistics: double gsl_stats_min (const double data[], size_t stride, size_t n)
-
This function returns the minimum value in data, a dataset of
length n with stride stride. The minimum value is defined
as the value of the element x_i which satisfies
x_i <= x_j for all j.
If you want instead to find the element with the smallest absolute
magnitude you will need to apply fabs
or abs
to your data
before calling this function.
- Statistics: void gsl_stats_minmax (double * min, double * max, const double data[], size_t stride, size_t n)
-
This function finds both the minimum and maximum values min,
max in data in a single pass.
- Statistics: size_t gsl_stats_max_index (const double data[], size_t stride, size_t n)
-
This function returns the index of the maximum value in data, a
dataset of length n with stride stride. The maximum value is
defined as the value of the element x_i which satisfies
x_i >= x_j for all j. When there are several equal maximum
elements then the first one is chosen.
- Statistics: size_t gsl_stats_min_index (const double data[], size_t stride, size_t n)
-
This function returns the index of the minimum value in data, a
dataset of length n with stride stride. The minimum value
is defined as the value of the element x_i which satisfies
x_i >= x_j for all j. When there are several equal
minimum elements then the first one is chosen.
- Statistics: void gsl_stats_minmax_index (size_t * min_index, size_t * max_index, const double data[], size_t stride, size_t n)
-
This function returns the indexes min_index, max_index of
the minimum and maximum values in data in a single pass.
The median and percentile functions described in this section operate on
sorted data. For convenience we use quantiles, measured on a scale
of 0 to 1, instead of percentiles (which use a scale of 0 to 100).
- Statistics: double gsl_stats_median_from_sorted_data (const double sorted_data[], size_t stride, size_t n)
-
This function returns the median value of sorted_data, a dataset
of length n with stride stride. The elements of the array
must be in ascending numerical order. There are no checks to see
whether the data are sorted, so the function
gsl_sort
should
always be used first.
When the dataset has an odd number of elements the median is the value
of element (n-1)/2. When the dataset has an even number of
elements the median is the mean of the two nearest middle values,
elements (n-1)/2 and n/2. Since the algorithm for
computing the median involves interpolation this function always returns
a floating-point number, even for integer data types.
- Statistics: double gsl_stats_quantile_from_sorted_data (const double sorted_data[], size_t stride, size_t n, double f)
-
This function returns a quantile value of sorted_data, a
double-precision array of length n with stride stride. The
elements of the array must be in ascending numerical order. The
quantile is determined by the f, a fraction between 0 and 1. For
example, to compute the value of the 75th percentile f should have
the value 0.75.
There are no checks to see whether the data are sorted, so the function
gsl_sort
should always be used first.
The quantile is found by interpolation, using the formula
quantile = (1 - \delta) x_i + \delta x_{i+1}
where i is floor
((n - 1)f) and \delta is
(n-1)f - i.
Thus the minimum value of the array (data[0*stride]
) is given by
f equal to zero, the maximum value (data[(n-1)*stride]
) is
given by f equal to one and the median value is given by f
equal to 0.5. Since the algorithm for computing quantiles involves
interpolation this function always returns a floating-point number, even
for integer data types.
Here is a basic example of how to use the statistical functions:
#include <stdio.h>
#include <gsl/gsl_statistics.h>
int
main(void)
{
double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6};
double mean, variance, largest, smallest;
mean = gsl_stats_mean(data, 1, 5);
variance = gsl_stats_variance(data, 1, 5);
largest = gsl_stats_max(data, 1, 5);
smallest = gsl_stats_min(data, 1, 5);
printf("The dataset is %g, %g, %g, %g, %g\n",
data[0], data[1], data[2], data[3], data[4]);
printf("The sample mean is %g\n", mean);
printf("The estimated variance is %g\n", variance);
printf("The largest value is %g\n", largest);
printf("The smallest value is %g\n", smallest);
return 0;
}
The program should produce the following output,
The dataset is 17.2, 18.1, 16.5, 18.3, 12.6
The sample mean is 16.54
The estimated variance is 4.2984
The largest value is 18.3
The smallest value is 12.6
Here is an example using sorted data,
#include <stdio.h>
#include <gsl/gsl_sort.h>
#include <gsl/gsl_statistics.h>
int
main(void)
{
double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6};
double median, upperq, lowerq;
printf("Original dataset: %g, %g, %g, %g, %g\n",
data[0], data[1], data[2], data[3], data[4]);
gsl_sort (data, 1, 5);
printf("Sorted dataset: %g, %g, %g, %g, %g\n",
data[0], data[1], data[2], data[3], data[4]);
median
= gsl_stats_median_from_sorted_data (data,
1, 5);
upperq
= gsl_stats_quantile_from_sorted_data (data,
1, 5,
0.75);
lowerq
= gsl_stats_quantile_from_sorted_data (data,
1, 5,
0.25);
printf("The median is %g\n", median);
printf("The upper quartile is %g\n", upperq);
printf("The lower quartile is %g\n", lowerq);
return 0;
}
This program should produce the following output,
Original dataset: 17.2, 18.1, 16.5, 18.3, 12.6
Sorted dataset: 12.6, 16.5, 17.2, 18.1, 18.3
The median is 17.2
The upper quartile is 18.1
The lower quartile is 16.5
The standard reference for almost any topic in statistics is the
multi-volume Advanced Theory of Statistics by Kendall and Stuart.
-
Maurice Kendall, Alan Stuart, and J. Keith Ord.
The Advanced Theory of Statistics (multiple volumes)
reprinted as Kendall's Advanced Theory of Statistics.
Wiley, ISBN 047023380X.
Many statistical concepts can be more easily understood by a Bayesian
approach. The following book by Gelman, Carlin, Stern and Rubin gives a
comprehensive coverage of the subject.
-
Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin.
Bayesian Data Analysis.
Chapman & Hall, ISBN 0412039915.
For physicists the Particle Data Group provides useful reviews of
Probability and Statistics in the "Mathematical Tools" section of its
Annual Review of Particle Physics.
-
Review of Particle Properties
R.M. Barnett et al., Physical Review D54, 1 (1996)
The Review of Particle Physics is available online at
http://pdg.lbl.gov/.
Go to the first, previous, next, last section, table of contents.