Self-Organising Data Mining
From the preface of "Self-Organising
Data Mining" by J.-A. Müller and Frank Lemke
"Today, there is an increased need for information - contextual
data - non obvious and valuable for decision making from a large
collection of data. Commonly, a large data set is one that has
many cases or records. With this book, however, 'large' rather
refers to the number of variables describing each record. When
there are more variables than cases, the most known algorithms
are running into some problems (in mathematical statistics,
for instance, covariance matrix becomes singular so that inversion
is impossible; Neural Networks fail to learn). Even if the data
are well-behaved, a large number of variables means that the
data are distributed in a high dimensional hypercube, causing
the known dimensionality problem. Therefore, decision making
based on analysing data is an interactive and iterative process
of various subtasks and decisions and is called Knowledge Discovery
from Data. The engine of Knowledge Discovery - where data is
transformed into knowledge - is Data Mining.
There are very different data mining tools available and
many papers are published describing data mining techniques.
We think that it is most important for a more sophisticated
data mining technique to limit the user involvement in the
entire data mining process to the inclusion of well-known
a priori knowledge. This makes the process more automated
and more objective. Most users' primary interest is in generating
useful and valid model results without having to have extensive
knowledge of mathematical, cybernetic and statistical techniques
or sufficient time for complex dialog driven modelling tools.
Soft computing, i.e., Fuzzy Modelling, Neural Networks, Genetic
Algorithms and other methods of automatic model generation,
is a way to mine data by generating mathematical models from
empirical data more or less automatically.
In the past years there has been much publicity about the
ability of Artificial Neural Networks to learn and to generalize
despite important problems with design, development and application
of Neural Networks:
- Neural Networks have no explanatory power by default to
describe why results are as they are. This means that the
knowledge (models) extracted by Neural Networks is still
hidden and distributed over the network.
- There is no systematical approach for designing and developing
Neural Networks. It is a trial-and-error process.
- Training of Neural Networks is a kind of statistical estimation
often using algorithms that are slower and less effective
than algorithms used in statistical software.
- If noise is considerable in a data sample, the generated
models systematically tend to being overfitted.
- In contrast to Neural Networks that use
- Genetic Algorithms as an external procedure to optimize
the network architecture and
- several pruning techniques to counteract overtraining,
- this book introduces principles of evolution - inheritance,
mutation and selection - for generating a network structure
systematically enabling automatic model structure synthesis
and model validation. Models are generated from the data
in the form of networks of active neurons in an evolutionary
fashion of repetitive generation of populations of competing
models of growing complexity and their validation and selection
until an optimal complex model - not too simple and not
too complex - has been created. That is, growing a tree-like
network out of seed information (input and output variables'
data) in an evolutionary fashion of pairwise combination
and survival-of-the-fittest selection from a simple single
individual (neuron) to a desired final, not overspecialized
behavior (model). Neither, the number of neurons and the
number of layers in the network, nor the actual behavior
of each created neuron is predefined. All this is adjusted
during the process of self-organisation, and therefore,
is called self-organising data mining.
A self-organising data mining creates optimal complex models
systematically and autonomously by employing both parameter
and structure identification. An optimal complex model
is a model that optimally balances model quality on a given
learning data set ("closeness of fit") and its generalisation
power on new, not previously seen data with respect to the
data's noise level and the task of modelling (prediction,
classification, modelling, etc.). It thus solves the basic
problem of experimental systems analysis of systematically
avoiding "overfitted" models based on the data's information
only. This makes self-organising data mining a most automated,
fast and very efficient supplement and alternative to other
data mining methods.
The differences between Neural Networks and this new approach
focus on Statistical Learning Networks and induction. The
first Statistical Learning Network algorithm of this new type,
the Group Method of Data Handling (GMDH), was developed by
A.G. Ivakhnenko in 1967. Considerable improvements were introduced
in the 1970s and 1980s by versions of the Polynomial Network
Training algorithm (PNETTR) by Barron and the Algorithm for
Synthesis of Polynomial Networks (ASPN) by Elder when Adaptive
Learning Networks and GMDH were flowing together. Further
enhancements of the GMDH algorithm have been realized in the
"KnowledgeMiner" software described and enclosed in this book.
- ...
This book provides a thorough introduction to self-organising
data mining technologies for business executives, decision
makers and specialists involved in developing Executive Information
Systems (EIS) or in modelling, data mining or knowledge discovery
projects. It is a book for working professionals in many fields
of decision making: Economics (banking, financing, marketing),
business oriented computer science, ecology, medicine and
biology, sociology, engineering sciences and all other fields
of modelling of ill-defined systems.
Each chapter includes some practical examples and a reference
list for further reading. The accompanying diskette/internet
download contains the KnowledgeMiner Demo version and several
executable examples. This book offers a comprehensive view
to all major issues related to self-organising data mining
and its practical application for solving real-world problems.
It gives not only an introduction to self-organising data
mining, but provides answers to questions like:
- what is self-organising data mining compared with other
known data mining techniques,
- what are the pros, cons and difficulties of the main data
mining approaches,
- what problems can be solved by self-organising data mining,
specifically by using the KnowledgeMiner modelling and prediction
tool,
- what is the basic methodology for self-organising data
mining and application development using a set of real-world
business problems exemplarily,
- how to use KnowledgeMiner and how to prepare a problem
for solution. ..."
Why Data Mining is needed
Decision making in every field of human activity needs problem
detection in addition to a decision makers feeling that a
problem exists or that something is wrong. The basis for every
decision is models. It is worth building models to aid decision
making for the following reasons:
models make it possible:
- to recognize the structure and function of complicated
objects (subject of identification) which leads to deeper
understanding of the problem. Models can usually be analysed
more readily than the original problem;
- to find appropriate means which can be used for exercising
an active influence on the objects (subject of control);
- to predict what the respective objects have to expect
in the future (subject of prediction) but also to experiment
with models, and thus to answer "what-if" type questions.
Therefore mathematical modeling formed the core of almost
all decision support systems.
Models can be derived from existing theory (theory driven
approach or theoretical systems analysis) and/or from data
(data driven approach or experimental systems analysis).
a. Theory-driven approach
For complex ill-defined systems, such as economic, ecological,
social, biological a.o. systems, we have insufficient a priori
knowledge about the relevant theory of the system under research.
Modeling based on a theory driven approach is considerably
affected by the fact that the modeler often has to know things
about the system that are generally impossible to find. This
concerns uncertain a priori information with regard to the
selection of the model structure (factors of influence and
functional relations) as well as insufficient knowledge about
interference factors (actual interference factors and factors
of influence which can not be measured). According to this,
insufficient a priori information concerns the required a
priori knowledge on the object under research be connected
to:
- the main factors of influence (endogenous variables or
input variables) and also the classification of variables
as endogenous and exogenous;
- the functional form of the relation between the variables
including the dynamic specification of the model;
- the description of errors such as their correlation structure.
In order to overcome these problems and to deal with ill-defined
systems and, in particular, with insufficient a priori knowledge,
there is a need to find ways on how it is possible, with the
help of emergent information engineering, to reduce the time
and resource intensive model formation process required before
one can start initial task solving. Computer-aided design
of mathematical models may soon prove as highly valuable in
bridging the gap.
b. Data-driven approach
Modern information technologies delivers a flood of data
and there is a question how to leverage them. Commonly, statistically
based principles are used for model formation. But with them
there is always the need to have a priori knowledge about
the structure of the mathematical model.
In addition to the epistemological problems of commonly used
statistical principles of model formation, there are several
methodological problems which may arise in conjunction with
the insufficience of a priori information. This indeterminacy
of the starting position marked by the subjectivity and incompletedness
of the theoretical knowledge and an insufficient data basis
leads to several methodological problems.
Knowledge discovery from data and specifically data mining
techniques and tools can assist humans in analyzing the mountains
of data and to turn information located in the data into successful
decision making.
Data mining includes not just a single analytical technique
but many methods and techniques depending on the nature of
the enquiry. These methods contain data visualization, tree-based
methods and methods of mathematical statistics as well as
those for knowledge extraction from data using self-organizing
modeling to turn information located in the data into successful
decision making.
Data mining is an interactive and iterative process of numerous
subtasks and decisions such as data selection and pre-processing,
choice and application of data mining algorithms and analysis
of the extracted knowledge. Most important for a more sophisticated
data mining application is to try to limit the involvement
of users in the overall data mining process to the inclusion
of existing a priori knowledge while making this process more
automated and more objective.
Automatic model generation like GMDH,
Analog Complexing,
and Fuzzy Rule Induction is
based on these demands and provides sometimes the only way
to generate models of ill-defined problems.
|