Statistical Signal Processing

# New Directions in Statistical Signal Processing

## 1 Modeling the Mind: From Circuits to Systems

### 1.1 Introduction

How does the brain process, represent, and act on sensory signals? Through the use of computational models, we are beginning to understand how neural circuits perform these remarkably complex information-processing tasks. Psychological and neurobiological studies have identified at least three distinct long-term memory systems in the brain: (1) the perceptual/semantic memory system in the neocortex learns gradually to represent the salient features of the environment; (2) The episodic memory system in the medial temporal lobe learns rapidly to encode complex events, rich in detail, characterizing a particular episode in a particular place and time; (3) the procedural memory system, encompassing numerous cortical and subcortical structures, learns sensory-memory mappings. In this chapter, we consider several major developments in computational modeling that shed light on how the brain learns to represent information at three broad levels, reflecting these three forms of memory: (1) sensory coding, (2) episodic memory, and (3) representations that guide actions. Rather than providing a comprehensive review of all models in these areas, our goal is to highlight some of the key developments in the field, and to point to the most promising directions for future work.

### 1.2 Sensory Coding

#### 1.2.1 Linsker's Informax Principle

Neurons should maximize the amount of mutual information between their input x and output y:

(1)
\begin{align} I_{x;y} = \left<ln\left[\frac{p(x|y)}{p(x)}\right]\right> \end{align}

One of the major developments in this field is Bell and Sejnowksi's Infomax-based independent components analysis (ICA) algorithm, which applies to nonlinear mappings with equal numbers of inputs and outputs (Bell and Sejnowski, 1995).

#### 1.2.2 Barlow's Redudancy Reduction Principle

Implicit in Linsker's work was the constraint of dimension reduction. However, in the neocortex, there is no evidence of a progressive reduction in the number of neurons at successively higher levels of processing.

One simple way for a learning algorithm to lower redundancy is reduce correlations among the outputs (Barlow and Foldiak, 1989). This can remove second-order but not higher-order dependencies.

Atick and Redlich proposed minimizing the following measure of redundancy (Atick and Redlich, 1990):

(2)
\begin{align} R=1-\frac{I_{y;s}}{C_{out}(y)} \end{align}

#### 1.2.3 Becker and Hinton's Imax Principle

Becker and Hinton (1992) proposed the Imax principle for unsupervised learning, which dictates that signals of interest should have high mutual information across different sensory channels. In the simplest case, illustrated in fig 1.3, there are two input sources, x1 and x2, conveying information about a common underlying Gaussian signal of interest, s, and each channel is corrupted by independent, additive Gaussian noise:

$x_1 = s + n_1,$
$x_2 = s + n_2.$

However, the input may be high dimensional and may require a nonlinear transformation in order to extract the signal. Thus the gaol of the learning is to transform the two input signals into output, y1 and y2, having maximal mutual information.

#### 1.2.4 Risannen's Minimum Description Length Principle

The minimum description length (MDL) principle, first introduced by Rissanen (1978), favors models that provide accurate encoding of the data using as simple a model as possible. The rational behind the MDL principle is that the criterion of discovering statistical regularities in data can be quantified by the length of the code generated to describe the data.

Algorithms which perform clustering, when cast within a statistical framework, can also be viewed as a form of MDL learning. Nowlan derived such an algorithm, called maximum likelihood competitive learning (MLCL), for training neural networks using the expectation maximization (EM) algorithm (Jacobs et al., 1991; Nowlan, 1990).

### 1.3 Models of Episodic Memory

Moving beyond sensory coding to high-level memory systems in the medial temporal lobe (MTL), the brain may use very different optimization principles aimed at the memorization of complex events or spatiotemporal episodes, and at subsequent reconstruction of details of these episodic memories.

Here, rather than recording the incoming signals in a way that abstracts away unnecessary details, the goal is to memorize the incoming signal as accurately as possible in a single learning trial.

The hippocampus is a key structure in the MTL that appears to be crucial for episodic memory. It receives input from most cortical regions, and is at the point of convergence between the ventral and dorsal visual pathways.

### 1.4 Representations that Guide Action Selection

#### TD-learning

The temporal difference (TD) learning algorithm (Sutton, 1988; Sutton and Barto, 1981) provides a rule for incrementally updating an estimate $\hat{V}_t$ of the true value function at time t by an amount called the TD-error: TD-error = $r_{t+1} + \gamma \hat{V}_{t+1} - \hat{V}_t$, which makes use of rt, the amount of reward received at time t, and the value estimates at the current and the next time step. It has been proposed that the TD-learning algorithm may be used by neurobiological systems, based on evidence that firing of midbrain dopamine neurons correlates well with TD-error (Montague et al., 1996).

#### Q-learning

The Q-learning algorithm (Watkins, 1989) extends the idea of TD learning to the problem of learning an optimal control policy for action selection. The goal for the agent is to maximize the total future expected reward. The agent learns incrementally by trial and error, evaluating the consequences of taking each action in each situation. Rather than using a value function, Q-learning employs an action value function, Q(st, at), which represents the value in taking an action at when the state of the environment is st. The learning algorithm for incrementally updating estimates of Q-values is directly analogous to TD learning, except that the TD-error is replaced by a temporal difference between Q-values at succesive points in time.

Becker and Lim (2003) proposed a model of controlled memory retrieval based upon Q-learning. People have a remarkable ability to encode and retrieve information in a flexible manner.

### 1.5 New Directions: Integrating Multiple Memory Systems

Somehow, the brain accomplishes all of these functions, and it is highly unlikely that they are carried out in isolation from one another. For example, we now know that striatal dopaminergic pathways, presumed to carry a reinforcement learning signal, affect sensory coding even in early sensory areas such as primary auditory cortex (Bao et al., 2001).

## 11 Turbo Processing

Turbo processing is the way to process data in communication receivers so that no information stemming from the channel is wasted. The first application of the turbo principle was in error correction coding, which is an essential function in modern telecommunications systems. A novel structure of concatenated codes, nicknamed turbo codes, was devised in the early 1990s in order to benefit from the turbo principle.

The turbo principle, also called the message-passing principle or belief propagation, is exploitable in signal processing other than error correction, such as detection and equalization.

### 11.1 Introduction

Error correction coding, also known as channel coding, is a fundamental function in modern telecommunications systems. Its purpose is to make these systems work even in tough physical conditions, due for intance to a low received signal level, interference, or fading. Another important field of application for error correction coding is mass storage (computer hard disk, CD and DVD-ROM, etc.), where the ever-continuing miniaturization of the elementary storage pattern makes reading the information more and more tricky.

Error correction is a digital technique, that is, the information message to protect is composed of a certain number of digits drawn from a finite alphabet. Most often, this alphabet is binary, with logical elements or bits 0 or 1.Then, error correction coding, in the so-called systematic way, involves adding some number of redundant logical elements to the original message, the whole being called a codeword. The mathematical law that is used to calculate the redundant part of the codeword is specific to a given code. Besides this mathematical law, the main parameters of a code are as follows.

• The code rate: the ratio between the number of bits in the original message and in the codeword. Depending on the application, the code rate may be as low as 1/6 or as high as 9/10.
• The minimum Hamming distance (MHD): the minimum number of bits that differ from one codeword to any other. The higher the MHD, the more robust the associated decoder confronted with multiple errors.
• The ability of the decoder to exploit soft (analog) values from the demodulator, instead of hard (binary) values. A soft value (that is, the sign and the magnitude) carries more information than a hard value (only the sign).
• The complexity and the latency of the decoder.

## 12 Blind Signal Processing Based on Data Geometric Properties

### 12.1 Introduction

Blind signal processing deals with the outputs of unknown systems excited by unknown inputs. At first sight the problem seems intractable, but a closer look reveals that certain signal properties allow us to extract the inputs or to identify the system up to some, usually not important, ambiguities.