Self-Organising Data Mining
Extracting Knowledge From Data
By Johann-Adolf Müller and Frank Lemke
Preface
This book is dedicated to Prof. A.G. Ivakhnenko,
the father of GMDH, on his eighty fifth birthday
The rapid development of information technology, continuing
computerization in almost every field of human activity and
distributed computing has led to a flood of data stored in
data bases and data warehouses. In the 1960s, Management Information
Systems (MIS) and then, in the 1970s, Decision Support Systems
(DSS) were praised for their potential to supply executives
with mountains of data needed to carry out their jobs. While
these systems have supplied some useful information for executives,
they have not lived up to their proponents' expectations.
They simply supplied too much data and not enough information
to be generaly useful.
Today, there is an increased need for information - contextual
data - non obvious and valuable for decision making from a
large collection of data. Commonly, a large data set is one
that has many cases or records. With this book, however, 'large'
rather refers to the number of variables describing each record.
When there are more variables than cases, the most known algorithms
are running into some problems (in mathematical statistics,
for instance, covariance matrix becomes singular so that inversion
is impossible; Neural Networks fail to learn). Even if the
data are well-behaved, a large number of variables means that
the data are distributed in a high dimensional hypercube,
causing the known dimensionality problem. Therefore, decision
making based on analysing data is an interactive and iterative
process of various subtasks and decisions and is called Knowledge
Discovery from Data. The engine of Knowledge Discovery - where
data is transformed into knowledge - is Data Mining.
There are very different data mining tools available and
many papers are published describing data mining techniques.
We think that it is most important for a more sophisticated
data mining technique to limit the user involvement in the
entire data mining process to the inclusion of well-known
a priori knowledge. This makes the process more automated
and more objective. Most users' primary interest is in generating
useful and valid model results without having to have extensive
knowledge of mathematical, cybernetic and statistical techniques
or sufficient time for complex dialog driven modelling tools.
Soft computing, i.e., Fuzzy Modelling, Neural Networks, Genetic
Algorithms and other methods of automatic model generation,
is a way to mine data by generating mathematical models from
empirical data more or less automatically.
In the past years there has been much publicity about the
ability of Artificial Neural Networks to learn and to generalize
despite important problems with design, development and application
of Neural Networks:
- Neural Networks have no explanatory power by default to
describe why results are as they are. This means that the
knowledge (models) extracted by Neural Networks is still
hidden and distributed over the network.
- There is no systematical approach for designing and developing
Neural Networks. It is a trial-and-error process.
- Training of Neural Networks is a kind of statistical estimation
often using algorithms that are slower and less effective
than algorithms used in statistical software.
- If noise is considerable in a data sample, the generated
models systematically tend to being overfitted.
- In contrast to Neural Networks that use
- Genetic Algorithms as an external procedure to optimize
the network architecture and
- several pruning techniques to counteract overtraining,
- this book introduces principles of evolution - inheritance,
mutation and selection - for generating a network structure
systematically enabling automatic model structure synthesis
and model validation. Models are generated from the data
in the form of networks of active neurons in an evolutionary
fashion of repetitive generation of populations of competing
models of growing complexity and their validation and selection
until an optimal complex model - not too simple and not
too complex - has been created. That is, growing a tree-like
network out of seed information (input and output variables'
data) in an evolutionary fashion of pairwise combination
and survival-of-the-fittest selection from a simple single
individual (neuron) to a desired final, not overspecialized
behavior (model). Neither, the number of neurons and the
number of layers in the network, nor the actual behavior
of each created neuron is predefined. All this is adjusted
during the process of self-organisation, and therefore,
is called self-organising data mining.
A self-organising data mining creates optimal complex models
systematically and autonomously by employing both parameter
and structure identification. An optimal complex model
is a model that optimally balances model quality on a given
learning data set ("closeness of fit") and its generalisation
power on new, not previously seen data with respect to the
data's noise level and the task of modelling (prediction,
classification, modelling, etc.). It thus solves the basic
problem of experimental systems analysis of systematically
avoiding "overfitted" models based on the data's information
only. This makes self-organising data mining a most automated,
fast and very efficient supplement and alternative to other
data mining methods.
The differences between Neural Networks and this new approach
focus on Statistical Learning Networks and induction. The
first Statistical Learning Network algorithm of this new type,
the Group Method of Data Handling (GMDH), was developed by
A.G. Ivakhnenko in 1967. Considerable improvements were introduced
in the 1970s and 1980s by versions of the Polynomial Network
Training algorithm (PNETTR) by Barron and the Algorithm for
Synthesis of Polynomial Networks (ASPN) by Elder when Adaptive
Learning Networks and GMDH were flowing together. Further
enhancements of the GMDH algorithm have been realized in the
"KnowledgeMiner" software described and enclosed in this book.
KnowledgeMiner is a powerful and easy-to-use modelling and
prediction tool designed to support the knowledge extraction
process on a highly automated level and has implemented three
advanced self-organising modelling technologies: GMDH, Analog
Complexing and self-organising Fuzzy Rule Induction. There
are three different GMDH modelling algorithms implemented
- active neurons, enhanced network synthesis and creation
of systems of equations - to make knowledge extraction systematically,
fast and easy-to-use even for large and complex systems. The
Analog Complexing algorithm is suitable for prediction of
the most fuzzy processes like financial or other markets.
It is a multidimensional search engine to select most similar
past system states compared with a chosen (actual) reference
state from a given data set. All selected patterns will be
synthesized to a most likely, most optimistic and most pessimistic
prediction. KnowledgeMiner does this in an objective way using
GMDH finding out the optimal number of synthesized patterns
and their composition. Fuzzy modelling is an approach to form
a system model using a description language based on fuzzy
logic with fuzzy predicates. Such a language can describe
a dynamic multi-input/multi-output system qualitatively by
means of a system of fuzzy rules.
Therefore, the generated models can be
- linear/nonlinear time series models,
- static/dynamic linear/nonlinear multi-input/single-output
models,
- systems of linear/nonlinear difference equations (multi-input/multi-output
models),
- systems of static/dynamic multi-input/multi-output fuzzy
rules described analytically in all four cases, as well
as
- nonparametric models obtained by Analog Complexing.
This book provides a thorough introduction to self-organising
data mining technologies for business executives, decision
makers and specialists involved in developing Executive Information
Systems (EIS) or in modelling, data mining or knowledge discovery
projects. It is a book for working professionals in many fields
of decision making: Economics (banking, financing, marketing),
business oriented computer science, ecology, medicine and
biology, sociology, engineering sciences and all other fields
of modelling of ill-defined systems.
Each chapter includes some practical examples and a reference
list for further reading. The accompanying diskette/internet
download contains the KnowledgeMiner Demo version and several
executable examples. This book offers a comprehensive view
to all major issues related to self-organising data mining
and its practical application for solving real-world problems.
It gives not only an introduction to self-organising data
mining, but provides answers to questions like:
- what is self-organising data mining compared with other
known data mining techniques,
- what are the pros, cons and difficulties of the main data
mining approaches,
- what problems can be solved by self-organising data mining,
specifically by using the KnowledgeMiner modelling and prediction
tool,
- what is the basic methodology for self-organising data
mining and application development using a set of real-world
business problems exemplarily,
- how to use KnowledgeMiner and how to prepare a problem
for solution.
The book spans eight chapters. Chapter 1 discusses several
aspects of knowledge discovery from data as an introductory
overview and understanding, such as why it is worth building
models for decision support and how we think forecasting can
be applied today to get valuable predictive control solutions.
Also considered are the pros, cons and difficulties of the
two main approaches of modelling: Theory-driven and data-driven
modelling.
Chapter 2 explains the idea of a self-organising data mining
and put it in context to several automated data-driven modelling
approaches. The algorithm of a self-organising data mining
is introduced and we describe how self-organisation works
generally, what conditions it requires, and how existing theoretical
knowledge can be embedded into the process.
Chapter 3 introduces and describes some important terms in
self-organising modelling: Statistical Learning Networks,
inductive approach, GMDH, nonphysical models, and model of
optimal complexity.
Chapter 4 focuses on parametric regression based GMDH algorithms.
Several algorithms on the principles of self-organisation
are considered, and also the important problem of selection
criteria choice and some model validation aspects are discussed.
In chapter 5, three nonparametric algorithms are discussed.
First, there is the Objective Cluster Analysis algorithm that
operates on pairs of closely spaced sample points. For the
most fuzzy objects, the Analog Complexing algorithm is recommended
selecting the most similar patterns from a given data set.
Thirdly, a self-organising fuzzy-rule induction can help to
describe and predict complex objects qualitatively.
In chapter 6 we want to point to some application opportunities
of self-organising data mining from our own experience. Selected
application fields and ideas on how a self-organising modelling
approach can contribute to improve results of other modelling
methods - simulation, Neural Networks and econometric modelling
(statistics) - are suggested. Also included in this chapter
is a discussion on a synthesis of model results, its goals
and its options while the last part gives a short overview
of existing self-organising data mining software.
- In chapter 7 the KnowledgeMiner software is described
in more detail to give the reader an understanding of its
self-organising modelling implementations and to help examining
the examples included in the accompanied diskette or Internet
download.
Chapter 8 explains based on several sample applications from
economics, ecology, medicine and sociology how it is possible
to solve complex modelling, prediction, classification or
diagnosis tasks systematically and fast using the knowledge
extraction capabilities of a self-organising data mining approach.
Since self-organising data mining will evolve quickly, especially
the KnowledgeMiner software, the following Internet addresses
can be referenced to for news, updates and new versions, but
also for new research results and other discussions and comments
reflecting this book:
http://www.knowledgeminer.com
http://www.informatik.htw-dresden.de/~muellerj
.
We would like to extend our thanks to our wives and our entire
families for their encouragement and understanding during
the writing of this book. A special thanks to Julian Miller
from Script Software International for his supporting work
and for his active promotion of the KnowledgeMiner software
from the beginning. Thank you also to Russell
Gum for proof reading the manuscript and to all persons
who helped with their comments, suggestions and critics.
This book serves all KnowledgeMiner users as a documentation
and guide about theory and application of self-organising
data mining. It may also form the basis for discussing these
items in the KnowledgeMiner discussion forum.
Suggestions and remarks are appreciated.
March 6, 2000
Johann-Adolf Müller Frank Lemke
|