Issues related to “Sim test” node ([email protected]) Modeler 16 is brilliant, wonderful and amazing – but there are issues with the “Sim Gen” node. This document as two parts: 1. The issues of the Sim Gen node, and 2. The evidence that the results the IBM Modeler produces may not be optimal, and a demonstration that it’s possible to produce results which are statistically in reasonable bounds. The concept is valuable and the implementation probably needs some “tweaks”. Some of the questions I have related to the “gen sim” node: 1. You can generate numeric values which are caste as Strings from within the sim gen node which are, of course, uninterpretable. It should not be possible (or even “legal”) to do this. 2. When a request is made for Integer that range from 1 to 9, it produces Reals that range from 1.0 to 8.0. 3. When a request is made for the uniform distribution (the equivalent of a Beta with shape1 and shape2 = 1.0), it produces distributions which are not just significantly different from the correct shape but are substantially, significantly, and systematically different from what I think is a reasonable universe. 4. When a request is made for a set of correlations between variables, there is apparently no test to determine if the matrix can be made positive definite (a minimal requirement for a correlation matrix to be useful). 5. Finally, when categorical variables are created, I could not find any evidence of appropriate correlations between levels of factors and continuous variables. The proof of the issues and a demonstration that you can produce a correct result which is different from what is produced by the “gen sim” node. Distribution Errors Assume that we begin with a “Sim Gen”node and I request a distribution that looks like this: What it produces is this (and similar results in other trials): Which I can test to see if it deviates significantly from a uniform distribution (at say the 0.00000001 level). This is evident from this histogram which looks at the differential slope across the distribution. And I can do a direct test which proves the distribution is not close to what one might anticipate in a random uniform universe where we sample 100,000 cases. Following are the statistical tests demonstrating that the shape1 and shape2 values are outliers for what we would anticipate where we sampled 100,000 cases. REAL1: Shape1 = 1.0 – 0.046 = 0.954 shape2 = 1.0 – 0.046 = 0.954 REAL2: Shape1 = 1.0 – 0.035 = 0.965 shape2 = 1.0 – 0.029 = 0.971 REAL3: Shape1 = 1.0 – 0.053 = 0.947 shape2 = 1.0 – 0.055 = 0.945 And when we look at the distribution of scores which are broken up by into 10 x-axis equal length records, we get the following showing a distinct non-uniformity. What I believe it SHOULD look like: On the other hand, I (and I’m sure many others) can produce a uniform distribution which does not deviate substantially from what we’d anticipate in a random universe which produces uniform distribution. We can see here that the shape values are all very close to 1.0. We can also see this with the histogram below… This is what I produced - the distribution is near flat and not uniformly skewed one way or another. This is what a randomly generated uniform distribution ought to look like. Issues with the correlations: There are similar issues with the correlations. When a request is made for the following correlations (table below), it produces correlations which are substantially different. First, it’s apparent it does not attempt to create data fitting these constraints and producing a positive definite matrix at the same time. You can see below that the requested 0.800 becomes 0.726 and the requested 0.60 becomes 0.500. In the population with a sample size of 100,000 (which is what is given in the IBM/SPSS demo, the correlations should not be that far off. What is one example of what the correlations should look like with a sample size of 100,000: Using the reporting capabilities of “gen sim” (which I think are quite valuable), you can see I can produce reasonable correlations when we have a sample size of 100,000. It is also straight forward to demonstrate fitting categorical variables to known correlations and straight forward to accomplish. This ought to be a part of the deliverable.
© Copyright 2026 Paperzz