|
| Support >
Frequently Asked Questions |
Q: What does GMDH stand for?
A: GMDH - Group Method of Data Handling. It is an inductive, statistical
learning network technology using the cybernetical approach
of self-organization including systems, information and control
theory and computer science. GMDH is not a traditional statistical
modeling method. It is an interdisciplinary approach to overcome
some main disadvantages of statistics and NN's. Below is a
description of GMDH from the preface to Farlow's Book.
"In statistics nowadays there is a distinguishable
trend away from the restrictive assumptions of parametric
analysis and toward the more computer-oriented area of what
is generally known as nonparametric data analysis. One of
the more fascinating concepts from this new generation of
research is what is known as the GMDH algorithm, which was
introduced and is currently being developed by the Ukrainian
cyberneticist and engineer A.G.Ivakhnenko.
What is known these days as a heuristic, the GMDH algorithm
constructs high-order regression-type models for complex systems
and has the advantage over traditional modeling in that the
modeler can more-or-less throw into the algorithm all sorts
of input/ output types of observations, and the computer does
the rest. The computer self-organizes the model from a simple
one to one of optimal complexity by a methodology not unlike
the process of natural evolution. It is the purpose of this
book to introduce to English-speaking people the basic GMDH
algorithm, present variations and examples of its use and
list a bibliography of all published work in this growing
area of research."
From: S. J. Farlow, Self-Organizing methods in Modeling.
GMDH Type Algorithm (1984)
Here is what Prof. A.G. Ivakhnenko is saying:
"The Group Method of Data Handling (GMDH) is self-organizing
approach based on sorting-out of gradually complicated models
and evaluation of them by external criterion on separate part
of data sample. As input variables can be used any parameters,
which can influence on the process.
Linear or non-linear, probabilistic models or clusterizations
are selected by minimal value of an external criterion.
The sorting algorithms are rather simple and they get information
directly from data sample. The effective input variables,
number of layers and neurons in hidden layers, optimal model
structure are determined automatically. This is based on
that fact that external criterion characteristic have minimum
during complication of model structure. GMDH inductive approach
is different from commonly used deductive techniques and
networks.
The GMDH was developed for complex systems modelling, forecasting
and data mining, analysis of multivariate processes, decision
support after "what-if" scenario, diagnostics, pattern recognition
and clusterization of data sample. Since 1968 many books,
more than 230 doctoral dissertations were devoted to investigations
in very different fields. It was proved, that for inaccurate,
noisy or small data can be found best optimal simplified
model, accuracy of which is higher and structure is simpler
than structure of usual full physical model. For real problems
with noised or short data samples, simplified forecast models
becomes more effective.
Recent developments of the GMDH have led to neuronets with
active neurons, which realize twice-multilayered structure:
neurons are multilayered and they are connected into multilayered
structure. This gives possibility to optimize the set of
input variables at each layer, while the accuracy increases.
The accuracy of forecasting, approximation or pattern recognition
can be increased beyond the limits which are reached by
neuronets or statistical methods.
Not only GMDH algorithms, but many modeling or pattern
recognition algorithms can be used as active neurons. Its
accuracy can be increased in two ways:
- each output of algorithm (active neuron) generate new
variable which can be used as a new factor in next layers
of neuronet;
- the set of input factors can be optimized at each layer.
In usual once-multilayered NN the set of input variables
can be chosen once only. The output variables of previous
layers in such networks are very effective secondary inputs
for the neurons of next layers.
Neuronets with active neurons and basic GMDH algorithms
was described in
Self-Organization of Neuronets with Active
Neurons.
Ivakhnenko, A.G., Ivakhnenko, G.A., Mueller, J.A.; Pattern
recognition and image analysis, 1994, vol.4, no.2;
Self-Organization of Nets of Active Neurons.
Ivakhnenko A.G., Mueller J.A.; SAMS, 1995, vol.20, pp.93-106.
The GMDH theory was also published in
Inductive
Learning Algorithms for Complex System Modeling.
Madala H.R. and Ivakhnenko A.G., 1994, CRC Press; and
Self-organizing
Methods in Modelling
(Statistics: Textbooks and Monographs, vol.54), Farlow,
S.J. (ed.), 1984, Marcel Dekker Inc.
You can find a short intro in Paper
1 or a comprehensive reading in the the new book by Mueller/Lemke
"Self-Organising Data Mining".
You may also want to look at the publications
area for more information.
Q: Would you consider your products suitable for financial
and product-demand forecasting using numerous variable inputs?
A: Yes, this is one of the primary application fields
for KnowledgeMiner. In contrast to statistics or NN's you
can use more variables than samples available for modeling.
For example, you can create a prediction model (lin. system
of equations e.g.) of 40 variables, but only 30 observations
for each variable are available. You can consider up to 500
input variables (lagged and unlagged) in KnowledgeMiner to
model complex time processes. Additionally, KnowledgeMiner
has implemented Analog Complexing as an extremely powerful
prediction technique for fuzzy processes like financial markets.
KnowledgeMiner when used on financial markets could really
strike gold!
Q: It "feels" like KnowledgeMiner might assist in detecting
relationships among certain patient groups by clinical criteria
vs. fluid measurements that may be missed by an individual.
If I understand the application of KnowledgeMiner, I believe
I should be able to take our database of eye features along
with the diopter measurements for each patient in the database,
plug those into KnowledgeMiner and then KnowledgeMiner will
derive an equation for calculating the diopter measurement
of a patient as a function of the patient's image features.
Is this true?
A: Yes, exactly. This is something KnowledgeMiner
can do.
Q: First, I'd like to complement you on your choice of
platform ;-). With Motorola's new math libraries and the higher
clock speeds of their chips, math intensive applications such
as KnowledgeMiner (KM) are best run on a Mac; Byte's recent
benchmarks show the new Macs running twice as fast in SpecInt
and 50% faster in SpecFPU than comparable P5 or P6 chips running
at the same speed.
I'm a defense contractor in the U.S. and I also work as a
consultant doing image processing programming and object classification
work. I downloaded the KM Demo last night and was very impressed
with what I think I saw! Congradulations on a very impressive
algorithm and its implementation into a GUI that everyone
can use. One of the difficulties with Statistical Pattern
Recognition in my application is that one might not get a
sufficiently sophisticated classifier to give the best possible
results (for example using a linear classifier instead of
a more complex quadratic classifier). It appears that KM does
not suffer from this problem because is appears to produce
a bonafide nonlinear equation which should optimally accommodate
any irregular shaping of the class populations in feature
space. Is this true?
A: Yes, you are correct. One important feature of
KnowledgeMiner is that it creates models in an evolutionary
way: From very simple models to increasingly more complex
ones. It stops automatically when an optimally complex model
is found. That is, when it begins to overfit the design data
(the data used to create relationships between variables).
Q: The possibility of time lag model is really interesting
too. In human training studies, the number of measures per
year is very low (2-6), compare to testing variables (10-20).
How many subjects are necessary too?
A: The same is true if you want to create a dynamic
model. In contrast to statistics or Neural Networks, KnowledgeMiner
can deal with a very small number of cases (6+). In fact,
the number of cases used for modeling can be smaller than
the number of variables (so-called under-determined tasks).
So, it is really possible for you to use 10 variables and
6-10 samples only for creation of a linear system of equations.
Q: What would be the largest table (columns and rows) KnowledgeMiner
could accommodate if allocated 100 MB of RAM?
A: The table contains approximated values as an orientation:
|
100 rows
|
500 inputs
|
|
200
|
350
|
|
300
|
280
|
|
400
|
240
|
|
500
|
210
|
|
1000
|
150
|
|
2000
|
110
|
|
5000
|
70
|
KnowledgeMiner optimizes several modeling tasks, so it is
not possible to give exact values in advance. The real memory
requirements may actually be smaller.
Q: We've purchased NGO and our company is Windows-based.
I'll be doing KM at home on my Performa 6400/180. If I get
the full version, what kind of performance can I expect on
the performa?
A: Two aspects: processor speed and RAM. 180 MHz are
good even for large problems. For small modeling problems
(< 50 inputs and < 100 samples) it will take a few minutes
and let's say 100KB-2MB of RAM temporarily to create a GMDH
model (once familiar with it). However, RAM requirements will
grow rapidly (10-100MB and more) with larger modeling problems
(>100 inputs and > 500 samples). It can take
then up to an hour or two to get a model. Compared to alternative
methods with this kind of problem complexity, which would
take days or weeks.
Q: How many records of data can I put into KM if I have
11 inputs and 6 outputs? Will I need a different data sheet
for each output even if the input values are the same? Is
copying and pasting the easiest way or save subsequent output
entries with different names?
A: KM can handle up to 500
inputs (including lagged variables for dynamic modeling) and
a virtually unlimited number of outputs (read: models) in
a single document using the same physical data sheet without
copying/pasting any data. All models are stored in a model
base and for each column of the sheet, 4 different model types
can be created and stored simultaneously: a time series model
(auto-regressive), an input-output model (static or dynamic),
a system model (multi-input/multi-output) and an Analog Complexing
model.
KM 3.0 has implemented a third modeling method: self-organizing
fuzzy-rule induction or Fuzzy-GMDH. So, a fifth model can
be added to the model base for each column. Also, KM 3 will
extend the spreadsheet from actually 1,000 rows up to 10,000
rows.
Q: I've recently downloaded KM and I am wondering how it
compares to NGO for Windows. I'm going to compare models and
"closeness" of fit between the two but I'm concerned the demo
version will cut me out at 4 levels and not fit as well as
it may have if I had the full potential of the full edition.
What are advantages of KM over NGO or even more expensive
software packages such as GenSym?
A: This has been described a little elsewhere in this
FAQ. An important advantage is also that KM always produces
a model description usable for interpretation and analysis.
You can see why results are as they are and what variables
KM has selected out as relevant. For fuzzy-rule induction,
for example, you will get models in an almost natural language
as this model from the wine recognition example shows:
IF N_Flavanoids & NOT_N_Nonflavanoid
phenols & NOT_N_Color intensity
OR NOT_N_Ash &
N_OD280/OD315 of diluted wines
OR NOT_N_Color intensity &
NOT_P_Magnesium & N_Flavanoids
& NOT_N_Alcalinity of ash &
NOT_P_Hue
THEN wine_cultivar #3
The main difference, however, is that KM, in addition to
the black-box approach and the connectionism of NNs, is based
on a third principle called inductive self-organization.
Inductive self-organizing modeling theory and praxis have
proven that "closeness of fit" cannot be the only criterion
for finding a "best" model. It is necessary, during modeling,
to validate each model candidate's performance on some new
data. If this step is missing (as commonly seen in neuro-fuzzy-genetic
approaches), models will tend inherently to be overfitted.
This is, because it is always possible (at least theoretically)
to formulate a model that fits any given (finite) learning
data set with almost 100% accuracy - driven by the rule "the
more complicated the model is, the more accurately it will
fit the given data." This is also true for completely random
samples. For noisy data, this means that, at a certain point
in modeling, the model begins to fit the noise (overfitting),
which results in bad or catastrophic performance on new data.
The model fits better the design data, but at the same time,
it loses accuracy when applied to some previously unseen data.
It is too complex. So, the problem is to find that point where
a model begins to reflect random relations. This we call creating
an optimally complex model. GMDH can do this.
Q: I seem to be having a problem defining input/output variables for a
multi- input/multi-output GMDH model. I have 87 input variables, and I want
to use only these inputs to estimate 174 output (target) variables. I want
to employ a nonlinear, static, multi-input/multi-output model. All variables
(input and output) exhibit strong collinearity among nearby variables (they
are ordered). I want my resulting equations to be expressed only in terms of
the 87 input variables (i.e., not in some combination of the inputs and
outputs). However, this is not what I see in the results; in many of the
equations, both various input and output variables are used. This occurs
even when the "systems of equations" options is NOT selected. Thus, I have a
feeling that the problem may be with how I am selecting my output variables.
When the selection mask is on, I define the first Y (output variable), and I
then select ALL other variables (otherwise, KM won't consider the variables
in any way (as input or output) if it is not selected. I have tried to use
the Range option to enter all the target (output) variables, but exercising
this option does not seem to have the desired effect. So my question is: HOW
DO I EXPLICITLY DEFINE MULTIPLE OUTPUT VARIABLES?
A:
i.) The objective of systems of equations is describing the interdependence
structure of a set of system variables. Variables depend one each other,
they are considered input and output variable, likewise. In a coming
version, it is also possible to mark a variable as input, exclusively
(exogenous variable). But I understand that you not want to model the
interdependence structure, but just a set of multi-input/single-output
models. If this is true and if you use a Mac, you can solve this problem by
running an applescript. For example, if your inputs are located in columns
X1 - X87, followed by the 174 output variables, use something like this:
tell application "KnowledgeMiner 4.0"
activate
set i to 89 -- the first target column
repeat 174 times
select cell i -- select the target variable
select cells 2 thru 88 with union -- select the inputs
create new GMDH model with properties {linear:false} -- create the
nonlinear model
set i to i+1 -- next one
end repeat
end tell
ii.) If you know a priori that certain inputs are highly collinear, remove
redundant variables before modeling if possible. KM checks for collinearity
during
modeling, but you can save modeling time by reducing variables dimension.
Q: The AppleScript worked great. Thanks. I was just wondering though how to
include other options/switches in the AppleScript? For instance, how would
you define the data length, or choose whether normalization is used,
how/whether layer break-through is applied, the % of active neurons, and the
number of network layers via an AppleScript. I guess I just need to know the
syntax. With the AppleScript that you provided me, is the default for these
other parameters/options the same as what appears in the dialog box once
"More options" are shown (i.e., normalization =yes; layer break-through on
lagged and non-lagged inputs (which, in the case of a static model, only
non-lagged inputs would be considered for layer break-through); active
neurons=10%; default # of network layers, etc.). Is this correct? How
sensitive is the model output to the # of network layers option anyway?
A: Dragging and dropping KM on the Script Editor icon will open the KM
dictionary with commands and objects it supports. You also find a summary in
the KnowledgeMiner Dictionary pdf file located KM's AppleScript support
folder. Generally, when creating GMDH models via AppleScript, the default
parameters of the "Fewer Choices" dialog window are used: with
normalization, layer break-through on all inputs, active neurons and # of
models selected are self-adjusting. To keep it simple, these parameters
cannot be modified via AppleScript (yet). They have been fine-tuned recently
and will be available in the coming version. Others - those listed in the
dictionary - can be set up in a script.
Q: I am a user of your product. Rencently I'am doing some research on
GMDH, so I use KnowledgeMiner very frequently. It's very good, but I dont know
whether I can choose the external criteria and grouping of data. If your
software have this service, could you tell me? Thank you!
A: KnowledgeMiner has a built-in hirarchy of criteria:
- main criterion: Prediction Error Sum of Squares, and
- an auxiliary criterion: Approximation Error Variance.
We have chosen PESS, because it turned out very effective and stable, and
KM's main purpose is prediction, i.e., high generalization power is
required. In the KM implementation, PESS performs a leave-one-out
cross-validation on any generated model candidate during modeling, so data
subdivision into training and testing sets is done internally. There is no
need for a user to consider this regularization task. However, you can hold
out an examination data set used to measure a model's performance on that
data during modeling, too. The examination data are always cut from the end
of the data. More info you will find in our book that, hopefully, will be
available in Chinese in 2002.
Q: As a clarification of an earlier question, I think what I was really
wondering about the complexity of the output model equations is this: are
the outputs constrained to be of order 2? I had previously been under the impression that, if needed, KnowledgeMiner has the capability to output models of order n. However, after examining the
output that I have so far, as well as the documentation at the beginning of
the KM chapter (chpt 7?) in the self-organizing book, it now seems to me
that the most nonlinear polynomials KnowledgeMiner can output is only
second-order polynomials. Is this true? If so, what is the justification? It seems like there might have been one in the book, but I didn't really understand it. If true, how would GMDH compete with an artificial neural network in this respect if the ANN has the
capability to model highly nonlinear relationships while this GMDH
implementation (may) only model second-order relationships?
A: Your initial impression was correct. A nonlinear GMDH model can be of order n, of course. Only the transfer function of a single active neuron has a max. order of 2.
A key difference between ANN and GMDH here is that GMDH generates optimal
complex models due to its inherent noise filtration. The point is that you
can model any finite data set with almost zero error by just increasing the
complexity (polynomial order, number of inputs) of the model. However, if the data contains noise - and this is
the case in any real-world data - such an "ideally" fitted model also
describes that noise, and therefore, will do bad on new data. Only an
optimal complex model - with respect to noise variance in the data - will
not overfit the data and will generalize well. ANNs do not have such a built-in mechanism to avoid
overfitting during learning. Looking at any learning data error criterion,
exclusively, does not suffice to state a model good or bad. Although still
widely done so in data mining, it's a hype.
So noise filtration is key to avoid modeling stochastic correlations in a model. Version 5 of KnowledgeMiner adds a second level of validation to every parametric GMDH model.
Q: I have been using KnowledgeMiner for nearly two years now, and have
found it extremely useful. I have a particular problem at hand which I
am seeking you advice as to how best to use KM.
Context
I have a data set about 35 variables wide and many thousands of records
long. It is not a time series. The records are not inter-dependent.
The objective is to predict a single variable. This is a classification
problem: the variable can only take on certain values. In the simplest
abstraction, it assumes binary characteristics (the element represented by
the variable is either on or off, say a 0 or 1). In its most complex
abstraction, it can assume four states only.
The data is noisy and the correlation to a unifying rule appears weak.
Using GMDH Input/Output modelling, the approximate error variance is
usually around 0.83! Nonetheless, there is just sufficient discrimination
to determine a pattern emerging to make me think that an FRI approach may
yield better results.
A: We've experienced similar results on long and noisy (incomplete) datasets for classification. Here, the Approximation Error Variance (AEV) criterion is not a very appropriate
measure. For a binary classification the ROC provides better quality
measures. In fact, even if AEV is high, ROC often points out that the
discrimination is quite well.
Q: I am not familiar with the ROC function introduced in KMv3.3. Where can I find out more about it?
A: We have put some general information about ROC analysis to the site.
Q: How does one combine the predictions of an FRI Model and an Input/Output model (the book suggests you take the average)?
A: The simplest way is using their mean. Another way is creating a hybrid model. Here, just use all your models (their data representation)
as inputs of a GMDH model. The result is a combined model or, when only
one model is chosen, just a vote. Then apply this model on the
predictions of each selected individual model to get a final prediction.
|