BUG

Issues related to “Sim test” node ([email protected])
Modeler 16 is brilliant, wonderful and amazing – but there are issues with the “Sim Gen” node. This
document as two parts: 1. The issues of the Sim Gen node, and 2. The evidence that the results the IBM
Modeler produces may not be optimal, and a demonstration that it’s possible to produce results which
are statistically in reasonable bounds. The concept is valuable and the implementation probably needs
some “tweaks”.
Some of the questions I have related to the “gen sim” node:
1. You can generate numeric values which are caste as Strings from within the sim gen node
which are, of course, uninterpretable. It should not be possible (or even “legal”) to do this.
2. When a request is made for Integer that range from 1 to 9, it produces Reals that range
from 1.0 to 8.0.
3. When a request is made for the uniform distribution (the equivalent of a Beta with shape1
and shape2 = 1.0), it produces distributions which are not just significantly different from
the correct shape but are substantially, significantly, and systematically different from what
I think is a reasonable universe.
4. When a request is made for a set of correlations between variables, there is apparently no
test to determine if the matrix can be made positive definite (a minimal requirement for a
correlation matrix to be useful).
5. Finally, when categorical variables are created, I could not find any evidence of appropriate
correlations between levels of factors and continuous variables.
The proof of the issues and a demonstration that you can produce a correct result which is different
from what is produced by the “gen sim” node.
Distribution Errors
Assume that we begin with a “Sim Gen”node and I request a distribution that looks like this:
What it produces is this (and similar results in other trials):
Which I can test to see if it deviates significantly from a uniform distribution (at say the 0.00000001
level).
This is evident from this histogram which looks at the differential slope across the distribution.
And I can do a direct test which proves the distribution is not close to what one might anticipate in a
random uniform universe where we sample 100,000 cases.
Following are the statistical tests demonstrating that the shape1 and shape2 values are outliers for what
we would anticipate where we sampled 100,000 cases.
REAL1: Shape1 = 1.0 – 0.046 = 0.954 shape2 = 1.0 – 0.046 = 0.954
REAL2: Shape1 = 1.0 – 0.035 = 0.965 shape2 = 1.0 – 0.029 = 0.971
REAL3: Shape1 = 1.0 – 0.053 = 0.947 shape2 = 1.0 – 0.055 = 0.945
And when we look at the distribution of scores which are broken up by into 10 x-axis equal length
records, we get the following showing a distinct non-uniformity.
What I believe it SHOULD look like:
On the other hand, I (and I’m sure many others) can produce a uniform distribution which does not
deviate substantially from what we’d anticipate in a random universe which produces uniform
distribution. We can see here that the shape values are all very close to 1.0.
We can also see this with the histogram below… This is what I produced - the distribution is near flat
and not uniformly skewed one way or another. This is what a randomly generated uniform distribution
ought to look like.
Issues with the correlations:
There are similar issues with the correlations. When a request is made for the following correlations
(table below), it produces correlations which are substantially different. First, it’s apparent it does not
attempt to create data fitting these constraints and producing a positive definite matrix at the same
time.
You can see below that the requested 0.800 becomes 0.726 and the requested 0.60 becomes 0.500. In
the population with a sample size of 100,000 (which is what is given in the IBM/SPSS demo, the
correlations should not be that far off.
What is one example of what the correlations should look like with a sample size of 100,000:
Using the reporting capabilities of “gen sim” (which I think are quite valuable), you can see I can produce
reasonable correlations when we have a sample size of 100,000.
It is also straight forward to demonstrate fitting categorical variables to known correlations and straight
forward to accomplish. This ought to be a part of the deliverable.