Three Paradigmata of Inverse Problems: Algebraic vs. Frequentist vs. Bayesian Inference

Consider a sensor that is measuring physical parameters like temperature, pressure, or velocity. This sensor introduces perturbations and noise and, hence, one key-problem is the optimal inference of parameter $\boldsymbol{x}$ using measurement  $\boldsymbol{y}$ from the sensor.  Such inference of parameters is used in many research areas like telecommunications, finances, medizine, or social science.  When we speak about inference we have to ask: What is optimal inference? Which criterion shall we use? A natural criterion is the inference error. But how shall the error be defined? In the sequel, I address these main questions and hope to give a good overview. 

The Forward Model Describes the Sensing

First, we need a model of the measurement procedure. The forward model maps the parameter $\boldsymbol{x}$ to the measurement $\boldsymbol{y}$, namely,

\[\boldsymbol{y} = \boldsymbol{y}(\boldsymbol{x})~.\]

 With an abuse of notation, we use $\boldsymbol{y}$ for a vector whereas $\boldsymbol{y}(\boldsymbol{x})$ is the vector-valued function. 

I compare inference approaches for three distinct models:

  • Both, parameter and measurement are deterministic.
  • The parameter is deterministic, but the measurement is random. The randomness is introduces by noise or lack of knowledge. 
  • Both, parameter and measurement are random. This allows us to model statistical knowledge of the parameter. 

Usually, estimation refers to the inference of real- or complex-valued parameters from real- or complex-valued measurements, respectively, whereas detection uses real- or complex-valued measurements to infer parameters (symbols) in a finite alphabet. 

The Loss Describes the Inference Error

loss function defines the inference error between estimate $\hat{\boldsymbol{x}}(\boldsymbol{y})$ and the true parameter vector $\boldsymbol{x}$. Popular choices of the loss function are the []{#square-error-loss} square error  

\[\ell^{\mathrm{SE}}(\hat{\boldsymbol{x}},\boldsymbol{x}) = || \hat{\boldsymbol{x}}(\boldsymbol{y}) - \boldsymbol{x} ||^2~,\]

the hit-or-miss error

\[\ell^{\mathrm{HoM}}(\hat{\boldsymbol{x}},\boldsymbol{x}) =1_{|| \hat{\boldsymbol{x}}(\boldsymbol{y}) - \boldsymbol{x} || > \epsilon}~, \quad \epsilon > 0~,\]

for a continuous random vector $\boldsymbol{x} \in \mathbb{R}^N$ or

\[\ell^{\mathrm{HoM}}(\hat{\boldsymbol{x}},\boldsymbol{x}) = 1_{ \hat{\boldsymbol{x}}(\boldsymbol{y}) \neq \boldsymbol{x} } \]

for $\boldsymbol{x}$ in a finite alphabet, and the absolute error

\[\ell^{\mathrm{abs}}(\hat{\boldsymbol{x}},\boldsymbol{x}) = \sum_{n=1}^{N} | \hat{x}_n - x_n |~,\]

where indicator function $1_{x}$ is unity if $x=$ true and zero if $x=$ false. Scalar $x_n$ is the $n$th element of $\boldsymbol{x}$.

The loss function weights the error and should be carefully chosen. The square-error loss weights great errors more than small errors, the hit-or-miss loss weights errors independet of the magnitude, and the absolute-error loss weights linear with the error. John D. Cook presented a nice example.  

There are three main motivations of the square-error loss:

  • The square of a signal represents power, here it is the power of the error.
  • The square stems from the exponent of the Gaussian probability density (however often no Gaussian assumption is made).
  • A closed-form solution often exists

What is Optimal?

There is no unique optimality criterion for an estimator or detector. Therefore, we focus on a popular criterion. We are seeking for an estimator $\hat{\boldsymbol{x}}(\boldsymbol{y})$ that minimizes the risk

\[ \hat{\boldsymbol{x}} = \arg\min_{\boldsymbol{x}} R~.\]

In short, the risk is the expected loss. The word expected has  different meanings for different probabilistic descriptions of the forward model.

Algebraic Inference

 First, we consider a deterministic parameter $\boldsymbol{x}$ and a deterministic forward model $\boldsymbol{y}(\boldsymbol{x})$. If the mapping $\boldsymbol{y}(\boldsymbol{x})$  is bijective, then an inverse exists and $\boldsymbol{x} = \boldsymbol{y}^{-1}(\boldsymbol{y})$. 

If the mapping $\boldsymbol{y}(\boldsymbol{x})$ is not bijective, then we search for a parameter $\boldsymbol{x}$ that fulfills our criterion of a minimal risk function  $R = \ell$.  One prominant loss  is the square error $\ell^{\mathrm{SE}} (\hat{\boldsymbol{x}},\boldsymbol{x} )$ between estimate $\hat{\boldsymbol{x}}$ and the true parameter $\boldsymbol{x}$.

An estimator is obtained by 

\[\hat{\boldsymbol{x}} = \arg\min_{\boldsymbol{x}} R = \arg\min_{\boldsymbol{x}} \ell(\hat{\boldsymbol{x}},\boldsymbol{x})~.\]

The result for $\ell = \ell^{\mathrm{SE}}$ is a least square solution using the pseudo-inverse. Replacing the $L_2$-norm in $\ell^{\mathrm{SE}}(\boldsymbol{x},\boldsymbol{x})$ by a wighted norm leads to weighted least-square solutions. 

Frequentist Inference

We could model the measurements $\boldsymbol{y}$ as radom vectors, i.e. the forward mapping $\boldsymbol{y}(\boldsymbol{x})$ is a random function. A simple example is additive noise $\boldsymbol{v}$, 

\[\boldsymbol{y} = \boldsymbol{x} + \boldsymbol{v}~,\]

where $\boldsymbol{v}$ is a random vector and this implies a random measurement vector $\boldsymbol{y}$. 

We use the same approach as in the previous section and define the frequentist risk as the expected loss, i.e. $R = \mathrm{E}_{\boldsymbol{y}}(\ell)$.  The estimate ist

\[\hat{\boldsymbol{x}} = \arg\min_{\boldsymbol{x}} R = \arg\min_{\boldsymbol{x}} \mathrm{E}_{\boldsymbol{y}}(\ell)~.\]

If we use the square-error loss for a  continuous random parameter  $\boldsymbol{x}$ and an unbiased estimator exists, then we obtain the minimum variance unbiased (MVU) estimator. Using the hit-or-miss loss, we obtain the maximum likelihood (ML) estimator $\hat{\boldsymbol{x}} = \arg\max_{\boldsymbol{x}} v(\boldsymbol{y}|\boldsymbol{x})$. Here, likelihood function  $v(\boldsymbol{y}|\boldsymbol{x})$ is the conditional probability density function or probability mass function of $\boldsymbol{y}$ given $\boldsymbol{x}$.

Bayesian Inference

Furthermore, our belief in a distribution of $\boldsymbol{x}$ influences our inference. The Bayesian risk is defined as the expectation of a loss function with regard to $\boldsymbol{y}$ and $\boldsymbol{x}$. That is,

\[\hat{\boldsymbol{x}} = \arg\min_{\boldsymbol{x}} R = \arg\min_{\boldsymbol{x}} \mathrm{E}_{\boldsymbol{x},\boldsymbol{y}}(\ell) ~.\]

 

 

The square-error loss leads to the minimum mean square-error (MMSE) solution $\hat{\boldsymbol{x}} = \mathrm{E}(\boldsymbol{x}|\boldsymbol{y})$, the hit-or-miss  loss to the maximum a-posteriori (MAP) solution $\hat{\boldsymbol{x}} = \arg\max_{\boldsymbol{x}} v(\boldsymbol{x}|\boldsymbol{y})$, and the absolute-error loss to the median solution $\hat{\boldsymbol{x}} = \mathrm{median}(\boldsymbol{x}|\boldsymbol{y})$.

Performance Bounds for Estimators

In frequentist and Bayesian estimation, performance bounds are used to lower bound the square-error loss, if no analytic solution of an estimator exists. Prominent performance bounds are:

  • Frequentist's lower bounds
    • Cramer-Rao 
    • Bhattacharyya 
    • Barakin 
  • Bayesian lower bounds
    • Bayesian (van-Trees)-Cramer-Rao 
    • Bayesian Bhattacharyya 
    • Bobrovski-Zakai 
    • Weiss-Weinstein
  • Ziv-Zakai bounds (Bayesian)

 

Bibliography

  • Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking  by Harry L. Van Trees, Kristine L. Bell, 2007, Wiley, ISBN 0-47-012095-9
  • Fundamentals of Statistical Signal Processing: Estimation Theory by Steven M. Kay, 1993, Prentice Hall, ISBN 0-13-345711-7
  • Fundamentals of Statistical Signal Processing: Detection Theory  Steven M. Kay, 1998, Prentice Hall, ISBN 0-13-504135-X
  • Lecture Notes on Bayesian Estimation and Classification by Mario A. T. Figueiredo (http://www.lx.it.pt/~mtf/learning/Bayes_lecture_notes.pdf)