Representation and generalisation in pattern recognition

6 October 2006, 2-5pm
The Royal Statistical Society
12 Errol Street
London, UK

This is a joint half-day meeting of the British Classification Society (BCS) and the Statistical Computing Section of the Royal Statistical Society.

Speakers
Registration
Abstracts

Speakers

Robert P.W. Duin, Delft University of Technology
"Representation related problems in pattern recognition"

David M.J. Tax, Delft University of Technology
"Outlier detection using ball descriptions with adjustable metric"

Dick de Ridder, Delft University of Technology
"Integrative bioinformatics"

Elzbieta Pekalska, University of Manchester and Delft University of Technology
"Statistical learning with general proximity measures"

Registration

Pre-registration is recommended. You can register by email: meetings@rss.org.uk or by phone (020) 7638 8998. The meeting is free and open to all.
Information about getting to Errol Street is available here.
For further information, contact Niall Adams(RSS), Jo Padmore (BCS) or Piotr Juszczak.

Abstracts

Representation related problems in pattern recognition

Robert Duin

Traditionally objects to be recognized by a pattern recognition system are represented by independently measured features. It has been extensively studied how to train such systems on the basis of a representative set of independently drawn examples. Many useful procedures have been developed inside pattern recognition as well as outside this field, e.g. in statistics and machine learning. The next steps to be taken, as demanded from the field of applications, are to weaken the traditional assumptions. Some examples that will be discussed:

datasets with (many) more measurements (features) than objects, e.g. in image recognition and bio-informatics.
a proper use of the connectivity between data samples as pixels, time samples and wavelengths in spectra.
the description of pattern classes given by non-aselectively drawn examples. This is especially of interest when classes do not overlap (domain based pattern recognition).
problems given by dissimilarities instead of features. Many dissimilarity measures appear to be non-Euclidean or even non-metric. Should these be corrected, or is it possible to design specific procedures for such data?
the use of large unlabeled datasets in addition to a small set of labeled examples: semi-supervised and active learning.

Outlier detection using ball descriptions with adjustable metric

David Tax

One-class classification is the problem of distinguishing one object class from the rest of the feature space. This can be used to detect outlier or novel data. The outliers may indicate some interesting rare event, or they should be disregarded because they cannot be reliably processed further. An example is machine condition monitoring, where failure of a machine should be detected. It is possible to sample at will from all normal operation conditions (called the target class), but to sample from the failure class (the target class) is very hard, very expensive or in some cases impossible. Therefore a classifier should be constructed that mainly relies on examples of healthy machines and that can cope with a poorly sampled class of failing machines.

In contrast with normal classification and regression problems, it requires a closed contour around the data. By minimizing the volume the contour encapsulates in the space, the chance of falsely classifying an outlier object as a genuine object is minimized. In the most ideal case the target class forms a tight, spherical cluster and all outliers are scattered around this cluster. To identify outliers one has to measure the distance from an object to the cluster center and threshold this distance. Clearly, when the threshold on the distance (or radius of the ball) is increased, the error on the target class decreases but at the cost of the outlier data that is accepted. The optimal ball has a minimum volume while it still encloses a large fraction of the target data.

When the assumption of many small noise contributions does not hold, and the central limit theorem cannot be applied, other distance measures can be used. One flexible parameterization of a distance is the $\ell_p$-distance. This distance has one free parameter $p$ that rescales distances non-linearly along individual axis before adding the contributions to the final distance. Thresholding this distance defines a $\ell_p$-ball as the decision boundary around the target class. The advantage of the ball description is that only few parameters have to be fitted to get a good description of the target class. This is particularly useful when the outlier detector is applied in high dimensional feature spaces and with small training set sizes. A second advantage is that it is possible to compute the volume captured by the ball analytically. Experiments show that for some real world datasets very good classification results are obtained and that, more specifically, the $\ell_1$-distance is particularly suited for datasets containing discrete feature values.

Integrative bioinformatics

Dick de Ridder

In molecular biology, high-throughput measurement techniques are nowadays routinely used to obtain a global view of the living cell. These techniques often create datasets with small sample size (5-500) but extremely large feature size (1,000-1,000,000). To be able to process such datasets in a sensible way, any available prior knowledge on the problem at hand has to be used to the fullest extent possible. Furthermore, there is an increasing call from biologists for integration of measurements at various levels of cellular organisation: DNA, RNA, proteins and metabolites. Integrative bioinformatics is the field of research studying methods to do so.

In our work, we study several aspects of integration in bioinformatics:

knowledge of the noise level of measurements performed on objects can be used to improve the accuracy with which they can be classified;
multiple measurements on the same object can be used to perform better marker gene selection, clustering and classification;
prior knowledge, e.g. in the form of annotation and literature databases, can be integrated with high-throughput measurement data to more reliably cluster and classify objects.

Besides these applications, we study the more theoretical question of how to best translate available prior knowledge on objects and their relations into a common language for integration: into probabilities, distances or kernels.

Statistical learning with general proximity measures

Elzbieta Pekalska

Statistical learning is one of the primary approaches to pattern recognition which assumes that knowledge can be represented in a set of discriminative features. As a result, objects are represented as vectors in feature spaces. Thanks to the well-developed theory of vector spaces, the Euclidean distance has widely been accepted as (one of) the most suitable measures to rely on in building our models and explaining the world. The reason is that a Euclidean vector space is simultaneously an inner product, normed and metric space. While this provides an arsenal of powerful approaches, we should not forget that such a vectorial representation is often a reduction of complex phenomena.

The reason is that knowledge is usually qualitative. Moreover, many problems deal with structural data descriptions which are non-vectorial by origin, such as strings, trees, shapes or binary files. The most basic quality to be used in learning is therefore the observation of how much the objects have in common or how much they differ. As a result, a suitable proximity measure can be designed in a given context in order to identify patterns and model clusters. Such a measure becomes powerful, if it is derived by focussing on differences occurring in the structure of objects. Consequently, objects can be represented by a vector of proximity values to some chosen prototypes. These are further used to construct suitable representation spaces. Since expert knowledge can be used in the definition of the measure, proximity representations enable the natural integration of qualitative knowledge with numerical methodology.

Kernel methods are mathematically elegant approaches which lie within this proximity-based paradigm. They rely on a relatively narrow class of (conditionally) positive semidefinite kernels. In our work we extend the class of kernels to indefinite kernels and general proximity representations. Classifiers trained in proximity-based representation spaces tend to outperform the nearest neighbor rule traditionally applied, especially for small training sets. These recognition techniques can be successful, if the measure is informative, independently whether it has a metric or Euclidean behaviour, or not.