Content of PetroWiki is intended for personal use only and to supplement, not replace, engineering judgment. SPE disclaims any and all liability for your use of such content. More information

# Difference between revisions of "Probability and uncertainty analysis"

Probability is a mathematical concept that allows predictions to be made in the face of uncertainty. The probabilistic approach in this page defines two types of uncertainty that are associated with small-scale inherent variability, commonly is associated with relatively small (meters-length) scales.

## Types of uncertainty associated with small-scale inherent variability

The two types of uncertainty associated with small-scale inherent variability discussed are Measurement Error and Small-scale Geologic Variability.

For both types of uncertainty, it is assumed that there is an underlying population that exactly defines the system of interest. As examples of small-scale inherent variability, the proportion of ripple drift lamination (geologic variability) at any location is a fixed constant, whereas the proportion within the reservoir is those constant values integrated over the entire reservoir volume. Likewise, the degree of variability in this ratio across all possible locations can be calculated. The data from all possible locations can be assembled into a cumulative frequency distribution. This rarely is accomplished, but when it is, the results are termed a “census” and the knowledge of the population of interest is considered exhaustive.

In the absence of a census, frequency distributions can be approximated using mathematical formulae. Each distribution is completely definable in terms of a few constants (parameters). A Gaussian distribution, for instance, is completely defined by two parameters, population mean (M) and standard deviation (σ). Varying one or the other will produce a family of Gaussian distributions that vary in location on the real number line and in range.

### Measurement error

The first type is referred to as measurement error. It is irreducible and generally cannot be perfectly explained. For example, in terms of reservoir geology, it could be the small-scale randomness that is inherent in the reservoir, such as the product of the nonlinear processes of deposition and erosion. For example, the probability of finding a sediment bundle containing one depositional microstructure is simply the ratio of the volume of that microstructure to the volume of the reservoir. If this ratio does not vary systematically across the reservoir, then the volume of the depositional microstructure—say, ripple-laminated sandstone—can be predicted, albeit with the same type of uncertainty as is associated with a coin toss. That is, given a probability of 0.2, the volume represented by this microfacies would be expected to be 0.2 times the local volume of the reservoir; however, we expect the real volume to vary within the limits of our probabilistic estimate. At any point in the reservoir, the random depositional processes will permit the true volume to vary within limits tied to our probability model.

### Small-scale geologic variability

The second type of uncertainty is the small-scale geologic variability, and it stems from an incomplete sampling of reservoir topology. This can lead to an incomplete knowledge of the connectivity of flow units. The problem here is related to the nature of the flow paths (e.g., the connectivity of permeable and impermeable elements within the reservoir). It is not sufficient to know the probability that a permeable bed will occur in a borehole, nor is it enough to know the bed’s average thickness. We also need to predict the existence and location of “choke points,” permeability restrictions or barriers within the flow paths. These choke points constitute infinitesimal reservoir volume and seldom are documented by borehole data or seismic procedures. Yet, they can be critical to understanding flow because they connect local or even regional barriers. We know of their existence from well tests and pressure data and from our knowledge of outcrop or field analogs.

## Estimating parameters from a sample

Most statistical practice seeks to determine the parameters of a distribution without the cost and effort of a census, and does so by estimating the parameters on the basis of a relatively small set of carefully collected observations (valid sample). An unbiased sample will produce an unbiased estimate of the population parameters; however, it cannot be known with certainty whether any set of sample parameters is identical to the values of the population parameters. Additionally, a collection of such estimates (i.e., statistics) from repeated sampling will be centered on the values of the parameters. The spread of values around the parametric value commonly is an inverse function of the number of observations in a sample.

Not all frequency distributions are Gaussian, and the functions that generate estimates of their parameters therefore are different. The Gaussian distribution is special for several reasons, though. First, it is the distribution generated as the product of a simple nonlinear process (e.g., the velocity distribution in turbulent flow) and so is encountered often. Second, the distribution of sample means, whatever the nature of the population distribution, tends to approach the Gaussian as the number of samples increases. Third, statisticians have shown that for a wide range of non-Gaussian distributions, statistical inference of population parameters is robust in that failure of a Gaussian assumption by a surprisingly wide amount does not lead to large errors; however, there exist families of such “pathological” distributions for which common statistical inference never can be robust.

An assumption of an underlying population distribution is attractive because the number of observations needed to estimate the population parameters often is quite small. In addition, procedures exist to determine sample size as a function of the needed precision of the estimate of the parametric values. With this in mind, we now can discuss statistical inference.

## Statistical inference

Statistics may be described as the science of summarizing large volumes of information using a small set of parameters. Inferential statistics is the science of distinguishing the probable from the possible.[1] [2] In petroleum engineering, for example, reserve estimations often are described probabilistically as proven, probable, or possible (P90, P50, or P10, respectively). The simplest kind of statistical inference is whether two samples are likely to have derived from the same population. The farther apart the sample means are (keeping σ constant), the smaller the chance that they were drawn from the same population. Importantly, though, the likelihood that this is so can be inferred, and probability is what enables the statistician to use information from samples to make such inferences or to describe the population from which the samples were obtained.

Sampling to characterize a frequency distribution is based on a set of assumptions, one of the most important of which is that the samples are mutually independent. If the assumption is violated, statistical inferences can be wrong. For example, traditional statistical inference becomes problematic when samples taken near one another tend to have more similar values than do samples taken farther apart. As the distance between sample locations increases, this dependence decreases until at some threshold distance, the samples become statistically independent. Under such circumstances, the data are said to be spatially correlated. The ability to draw statistical inference from spatially correlated data is the central premise of a subdiscipline known as spatial statistics, or geostatistics.[3]

A geostatistical method known as conditional simulation can provide probabilistic information about reservoir properties (e.g., gross rock volume and total and recoverable reserves) as a probability distribution function (pdf). The conditional simulation results can be summarized in a specific way to determine the probability regarding some aspect of the reservoir (e.g., of exceeding, or not, an economic threshold). These concepts are covered in the Conditional simulation and uncertainty estimation page.

## Random variables and probability distributions

One of the tasks faced by geoscientists and reservoir engineers is that of estimating a property at a location where there has been no measurement. This requires a model of how the phenomenon behaves at unsampled locations. Without a model, one only has the sample data, and no inference can be made about the values at unsampled locations. The underlying model and its behavior, then, are essential elements of the statistical and geostatistical framework. The geostatistical method clearly identifies the basis of its models,[4] whereas many other estimation methods (e.g., linear regression, inverse distance, or least squares) do not. Furthermore, as with traditional statistics, random variables and their probability distributions are the foundation of the geostatistical method.

### Why a probabilistic approach?

Deterministic models are applicable only when the processes that generated the data are known in enough detail that an accurate description of the entire population can be made from only a few sample values. Unfortunately, though, very few earth science processes are understood so well. Although we know the physics or chemistry of the fundamental processes (e.g., depositional mechanisms, tectonic processes, and diagenetic alterations), the variables we study in earth science data sets are the products of a vast number of complex interactions that are not fully quantifiable. For the great majority of earth-science data sets, we must accept some uncertainty about how the attribute behaves between sample locations.[4] Thus, a probabilistic approach is required. The random function model concept introduced in this section and the next recognizes this fundamental uncertainty and provides the tools not only to estimate values at unsampled locations, but also to measure the reliability of such estimates.

### Random variable defined

A random variable is a numerical function defined over a sample space, whose values are generated randomly according to some probabilistic mechanism.[2] [4] Throwing a die, for example, produces values randomly from the set {1, 2, 3, 4, 5, 6}. A coin toss also produces numbers randomly. If we designate “heads” as zero and “tails” as one, then we can draw randomly from the set {0, 1}. The set of outcomes and their corresponding probabilities is known as the probability law or the probability distribution. There are two classes of random variables, and their distinction is based on the sample interval associated with the measurement. The two classes are the discrete random variable and the continuous random variable.[1]

A discrete random variable is easily identified by examining the number and nature of the values it assumes. If the variable can assume only a finite or a countable infinity of values, it must be discrete. In most practical problems, discrete random variables represent count or classified data, such as point counts of minerals in a thin section or in a facies classification. The die throw and coin toss generate discrete random variables. The probability distribution of a discrete random variable is a formula, table, or graph that provides the probability associated with each value of the discrete random variable. There are four common discrete random variable probability distributions:

• Binomial
• Negative binomial
• Poisson
• Hypergeometric

Continuous random variables are associated with sample spaces that represent the infinitely large number of sample points contained on a line interval. The probabilistic model for the frequency distribution of a continuous random variable uses a mathematically defined curve, usually smooth, that is called the pdf (or probability distribution function). Although these distributions assume a variety of shapes, the curves for many random variables observed in nature approximate a bell shape. A variety of terms commonly are used to describe this bell-shaped curve. Practitioners could say that such curves are bell-shaped, Gaussian, or normal in their distribution. The terms are synonymous, and they informally refer only to the shape of the distribution. Most of the variables used in reservoir modeling (e.g., porosity, permeability, thickness) are continuous random variables, so it is important to describe their pdf.

### Frequency distributions of continuous variables

Frequency distributions of continuous random variables follow a theoretical pdf that can be represented by a continuous curve that can have a variety of shapes; however, rather than displaying the functions as curves, the distributions most often are displayed as histograms constructed from the data. Many statistical methods, including some geostatistical ones, are based on the frequent supposition that random variables follow a normal distribution. The central limit theorem (CLT) is the foundation of the normal pdf and warrants some discussion.

The CLT states that under general conditions, as the sample size increases, the sums and means of samples drawn from a population of any distribution will approximate a normal distribution.[2] [3] The significance of the CLT is twofold. First, it explains why some measurements tend to approximate a normal distribution. Second, and more importantly, it simplifies and makes more precise the use of statistical inference. Many algorithms used to make estimations or simulations require knowledge about the pdf. If we can predict its shape or behavior accurately using only a few descriptive statistics that are representative of the population, then our estimates that are based on such predictions should be reliable. If the CLT is correct, then by knowing only the sample m and sample σ, the true pdf can be recreated precisely.

But the difficulty with the CLT and with most approximation methods is that we must have some idea of how large the sample size must be for the approximation to yield useful results. Unfortunately, we find ourselves in a circular-reasoning scenario. There is no clear-cut way to know the proper number of samples because knowing that depends on knowing the true population pdf in advance; hence, we assume the CLT is correct, and fortunately, as a practical matter, it does tend to behave well, even for small samples.

### Properties of the normal distribution

The histogram of a normal distribution is symmetrical about the mean. Therefore, the mean, median, and mode of the normal distribution occur at the same point. This histogram is referred to as the normal frequency distribution. The following percentages of the total area of the normal frequency distribution lie within these limits:

• m ± σ contains 68.26% of the data
• m ± 2 σ contains 95.46% of the data
• m ± 3 σ contains 99.73% of the data

Directly calculating any portion of the area under the normal curve requires an integration of the normal distribution function. Fortunately, for those of us who have forgotten our calculus, this integration is available in tabular form.[3] [5]

### Application of the normal distribution

The normal frequency distribution is the most widely used distribution in statistics. There are three important applications of this distribution.[3]

1. To determine whether, in fact, a given sample is normally distributed or not before applying certain tests. Most geostatistical simulation methods require that the data have a normal distribution. If they do not, the simulation results can be inaccurate and a transformation is required. To determine whether a sample comes from a normal distribution, we must calculate the expected frequencies for a normal curve of the same m and σ, and then compare them.
2. To test underlying hypotheses about the nature of a phenomenon being studied.
3. To make reliable predictions about the phenomenon. For geoscientists, this produces a better or an unbiased estimation of reservoir properties between the well data.

1. _
2. _
3. _
4. _
5. _