The stat node

DEXTER: a light USER MANUAL
Support for DEXTER V.1.0.0.1
This document is a swift survey of DEXTER features, free version, coming from an internal NEHOOV
tool used for data wrangling and data analytics. We designed some features that are useful to us,
before for instance the use of a Machine Learning downstream layer, such as PREDICT or
NEURALSIGHT.
We hope you will enjoy the design of DEXTER & this way of doing analytics! This is a summary of 15
years of operational experience. Without programming, DEXTER allows you to obtain relevant
findings in just few clicks.
We aim at receiving your feedbacks – ideally in terms of improvement proposals, and suggestion for
adding some others features.
Last remark: this user manual was written by a non-English native, so try to be indulgent .
Purpose of DEXTER
DEXTER is a tool devoted for processing data with purpose to (1) streamline data, (2) offer some
features for data analytics, including some stochastic methods. Following our ECO DATA ANALYTICS
APPROACH, we aim at running DEXTER just on a laptop, even on databases that have tens or hundreds
of millions of records. Minimal tech. requirements are UC core I5, and at last 6 GB of RAM. But 12 or
16 GB is better, of course.
We decided to design DEXTER with some free packages such as for the graph or the table
components – these are not top of the top features, but have the advantage to be free. We didn’t
want to fall in the mafia caveat of the unfree graphical components, because of… you know, our ECO
DATA ANALYTICS approach ;)
Opening DEXTER
After launching DEXTER, you reach the general panel, mostly blank:
http://www.nehoov.com
Fig. 1
In all this document, we will write VZ for the (left vertical zone), GZ for the top horizontal zone (‘Graph
Zone’) and TZ for the bottom horizontal zone (‘Table Zone’). There is also an info bar IB on the top
bottom, in blue, but that is most of the time not used in this free version.
As advised on IB when you open DEXTER, first you need to create a universe. It is a DEXTER object that
will allow you to import databases, represented as nodes, as you shall see later. In a universe, you
specify also by default some technical import parameters.
As you can see in Fig. 1, DEXTER features are split in 4 feature menus: Universe, Field, Wrangling and
Tool. By clicking on this About menu, you will discover that it is the 1.0.1 version, without any
limitation of use!
Importing data
After your first universe creation, as advised on IB, you need now to import database(s) just by
opening the file menu and clicking on “Import”. On this free version, importing directly from
databases (such as SQL or ORACLE) is not available. You can just import csv or text file(s).
Fig. 2
http://www.nehoov.com
The browser import file(s) zone allows to select multiple .csv or text files that need to have the same
“shape” in term of data localization. There is a “smart” import process that allows to consider files
with void lines or comment cells. For instance, the following csv file can be imported:
Fig. 3
In DEXTER you obtain
Fig. 4
The following example, containing introductions lines, having line 10 as blank, having column H void
except for H2,
Fig. 5
is imported in DEXTER as this:
Fig. 6
http://www.nehoov.com
Remark that column H on the imported file that contains only one non-void value, is discarded by
DEXTER, because we won’t do that much with this. So, DEXTER has a rudimentary AI for import.
Before continuing, let’s introduce some DEXTER vocabulary: in a data table, lines are called records,
and columns, fields. In fact, a field is more than that, it is an attribute that has a name and a type, and
a list of values. A record is just a realization of a collection of fields.
Remark also (Fig. 6) that after the import, void values are coloured in orange, and the “date” value of
field y is coloured in red: DEXTER identified that field y is of type integer, hence a date value is not
considered as acceptable. We will speak about cleaning later.
Coming back on the import pop-up (Fig. 2), there are 4 options:
•
•
•
•
Default block size allows to reserve a part of the RAM for DEXTER use.
Max degree of parallelism is the maximum number of simultaneous threads allowed to
DEXTER. These 2 parameter values are coming from the universe preference (see later).
Comprehensive field identification, when ticked, allows to analyse all the fields during the
import (e.g. different date formats in a field).
Compute stat when import allows to run or not a rudimentary statistic report on all the fields.
These 2 last options can be triggered later, after the import, and even field by field to gain time (on
huge data set, you probably don’t need stat on all the fields).
After validation, the import process starts in a background mode. The background area on the top of
VZ allows you to follow the progression of the import.

Fig. 7
When the import is ended, by clicking on the open button, 2 nodes are created in the graph zone TZ:
Fig. 8
http://www.nehoov.com
In Figure 8 the file “evap” was imported, creating the data node “evap”, and an associated node that
store the stat results. In VZ, a description zone allows you to change the name of the node, and to
store comments (by default it is the path from the imported file(s):
Fig. 9
Finally, if you click directly on the node in GZ, you make it “green”, and a node zone appears on the
VZ, providing some basic info about the node:
Fig. 10
As you shall see later, DEXTER handles many kind of objects (data set, visualization reports, stat
reports…), all represented by nodes, with possible links. Here is an example of a more substantial
universe:
Fig. 11
http://www.nehoov.com
Figure 11 shows a universe of 12 data nodes, with many son nodes. It is clear that this user didn’t opt
for clarity in the choice of his node names! Of course, with such a substantial universe, it is relevant
to focus on specific zones: for this, on the VZ, there is the view zone:
Fig. 12
By clicking on the name of one data node it refreshes GZ and TZ, viewing only this node and its sons.
By clicking on the Overview button, you come back to the full view of the universe. You can also play
with the crux button of GZ (bottom right), that will allow you to design your view by yourself (we will
describe the zoom feature later in this next section).
A remark: by adding a node to a universe, an algorithm recalculates all the positions of the others
nodes in order to have a suitable general view. This might be quite surprising on the beginning.
I forgot an obvious thing: when you click on a data node, on the table zone TZ an EXTRACT of data is
presented. When importing your first node, you must see something like this this:
Fig. 13
On VZ, at the bottom there is a Log zone, that will store all the actions you performed on the selected
node. For instance:
http://www.nehoov.com
Fig. 14
One can say a lot more about all of this, and it will be partly done in the next sections. Just remember
that this document is a light user manual. You shall see by using DEXTER that a lot of information are
also given in all the GUI (“graphical user interface”) in order to help you to understand the features!
The graph view philosophy
Why in NEHOOV did we opt for a graph view of data sets and son reports? After all, a universe is just
some kind of folder where data sets and reports are stored! The answer is simple: in DEXTER, for most
of the actions you will perform on a node, you will always be able to perform it by creating a subnode!
To us it was then obvious to link all these nodes in graphs.
Lot of people are doing data manipulation directly on an Excel file, or a database, and sometime they
corrupt the data set. In fact, not sometime, most of time! Remember that DEXTER is not devoted to
data miners, but to operational people! Trying to help them to organize their sessions of work was
one of our goal. Indeed, by creating subnodes, we keep the sanity of the father node. For lot of actions
you are in no view precluded to create a subnode, you can stay in your data node. But we strongly
advise you to do it!
This is also a nice way to work: you can “fork” your father node in 2 subnodes for instance, that can
represent two distinct pieces of data (different time windows, different type of clients…). Working
with a graph view force you, and help you to streamline your actions.
After settling this graph view paradigm, it was obvious to us to perform the same for all the reports:
we designed them as specific nodes. In this free version of DEXTER, they are not numerous. Below is
given the list of all our node icons:
Data
Stat
report
Cross stat
report
Coloured
F-inverse Topological
XY-chart
clustering N-binning
visualization
report
report
2D t-SNE
projection
visualization
Fig. 15
http://www.nehoov.com
Hence one icon for data, 3 for reports, and 3 for visualization features. In fact, the Topological 1binning report is half a report, half a visualization… we will see this later, as for the fact that reports
& visualisations are performed in TZ, temporary replacing the data table.
Let’s go back to graph visualization feature. When the graph is huge, as already touched upon there
is a crux on bottom right that allows to perform some zoom features. Figure 16 gives the example of
a 4-binning graph, in TZ zone, where the zoom crux is also present:
Fig. 16
After clicking on the crux, the zoom windows should open:
Fig. 17
http://www.nehoov.com
Left & right arrows allow to go back & forth on the last selected zoom options. The ‘house’ icon allows
to zoom on a small area on the top left of the graph. First of the 3 remaining icons allows to dezoom
up to the global overview of the graph, second one allows to zoom on the ‘center’ of the graph, and
third one keep the current zoom but center the view on the middle of the graph. The visible area in
the zoom window is squared in red, and borders of this square can be moved in order to resize this
square. One can obtain for instance the following view:
Fig. 18
If you click on the red square and keep the finger on it, you can move the square to another area,
allowing you to wander in the graph! You can also click on any node in order to get informations in
VZ. Figure 19 provides such info’s for a node of a 4-binning, where min & max for some fields are
given:
Fig. 19
http://www.nehoov.com
Again, the zoom features can be used both in GZ and TZ. Hence, for GZ, the data nodes and report
nodes area, you have 2 possibilities if you want to zoom in a data node and its sons: the zoom crux,
or the data node list in VZ.
The table zone TZ
First, an important point: DEXTER is not EXCEL. You cannot wander in the data table, for changing
values or sorting in various ways. We decided not to allow freely manual update of data. The TZ view
is also more rudimentary than Excel one (remember, we use free component). In fact, in practice, TZ
should be reduced at its minimum: the data table is just here for illustration. And we decided to
display by default just the first few records – this is also meaningful when one handle huge data sets.
In fact, DEXTER is devoted for data reshaping BY USING ALGORITHMS (not manually), data analytic
reports and data visualization. This is it! If you want to manipulate by yourself the data you created
in DEXTER, just export them and use your favourite tool!
Anyway, let’s have a survey of TZ:
Fig. 20
We added a freeze counter on the first column. Above the table, you can modify the first record you
want to see, and the number of the next ones to view (“Nb records”). Same for the fields, with First
visible field and Nb fields. You can also sort data according one selected field in a List box, and select
the level of precision you want in the digit display of numbers. That’s all! These very basically options
allow you to view pieces of data in a very big data set. Again, DEXTER is not a tool devoted for data
table view.
A useful remark: if you reduce the number of visible fields, and leave DEXTER and forget about this,
you might be quite surprised when you will come back: don’t be afraid that a part of your data
disappeared, just check the value of Nb fields, and compare with the total number of fields given in
the node zone!
Another important remark: in DEXTER the last field is flagged “Output” (see the purple circled last
field in Fig. 20). This is by default and you can change this in the field menu (see later). This is an
http://www.nehoov.com
implicit choice, and the reason lies in forecasting features: Neural Networks models generated by
PREDICT can be used directly in DEXTER, and by default in PREDICT the forecasting target is the last
field.
The Universe menu
We already used this menu for import, but there are some other features:
-
Create subnode makes a son copy of the
current selected node, that is a son
-
Duplicate data node makes a copy of
the current selected node, that is an
independent node (no link with the
selected node
-
Delete node allows you to delete a
node
Fig. 21
-
Create and duplicate are useful if you don’t want to overload the graph of the current node.
-
Duplicate & Delete can be also triggered just via a right click on a node. Caution: the erased
node is the node clicked by the user, NOT the current selected node. Thus you can select a
data node, and then delete a son node just by right clicking!
-
Manage nodes is a feature disabled in the current version.
-
Export creates a csv copy of all the data in the selected node.
-
Smart Export creates several csv files from 2 sets of fields. This is an example of a feature
asked by one of our clients! Selecting this feature makes the following GUI appears:
Fig. 22
http://www.nehoov.com
User selects one or several fields F1…Fp in the left table, and others fields G1…Gq in the right
table. After clicking ‘ok’, it is mandatory to give a name to this smart export, for instance
‘sm1’. Then q files are created, with names ‘sm1 N(G1)’,…, ‘sm1 N(Gq)’:
Fig. 23
Each file ‘sm1 N(Gi)’ contains all data related to fields F1,…,Fp, and Gi.
Here is an obvious example of use : if in a data node there are 12 fields of monthly cash flow
called ‘January’,…,’December’, and for a personal reason you need to build 12 files containing
just one of these monthly fields and all the remaining data, you have just to
-
Select all the fields except the 12 monthly fields in the left table,
Select the 12 monthly fields in the right table.
Then the 12 files will be generated.
- Export descriptor generates a csv file that contains information about each field:
Fig. 24
It summarizes for each fields the info given in the node info zone and in the statistic report
subnode of the selected node.
Finally, the last feature, Universe preferences, allows the user to settle the following parameters:
http://www.nehoov.com
Fig. 25
The 3 desactivation features are linked with the Field menu features (see below).
When creating a new classify/truncate/shortened field F’ from a field F, if ticked the field F is
desactivated after the creation of F’. Layout algorithm proposes 8 algorithms for the graph
representation of the universe. Just chose the one you like it!
Default block size/max degree of parallelism and the next 2 tick boxes are the by default values of
these 4 parameters when importing files. Default block size is ~ your RAM size/20, in order to be able
to continue to use your laptop. For a night computation, you can increase the value, but one rule: try
to be below 70-80% of your RAM size, in order to avoid Windows multiple swaps!
Binning size and binning overlap are the by default values when using the 1-binning algorithm of the
Wrangling menu.
The Field menu
This menu is for field management.
Fig. 26
http://www.nehoov.com
- Activate/Desactivate field is for making field visible or not in the data table. In fact, in DEXTER one
cannot delete field, one can just deactivate it. However, if you switch some fields to the desactivation
zone (right table of Figure 27), and if you select some of them and click on the ‘purge’ button, then
these fields will be erased for ever (no possible recovery).
Fig. 27
- Move field is just a feature to change the place of a field or several fields in the data table, by
selecting the field(s) to move and a target field, and then by selecting if the fields will be move before
or after the target field (move before / move after).
- The 3 next features are just a swift way to desactivate fields when missing, missing or errors, or bad
correlation occurs. Bad correlation uses the correlation table given in the stat report (see later), and
desactivate all fields that have correlation coefficient with the output field below a specified value by
the user.
- The next 6 values allow to create fields. Truncated & shortened fields are obvious, & explained in
the GUI.
- Classify field allows to create a field F’ from a field F that has only a finite number of values N. The
range [min, max] of F is sliced in N equal classes, and a value of F’ is computed from the value of F
according its place in the classes: the value of F’ can be either the middle or the average of the class.
Figure 28 provides an example:
Fig. 28
http://www.nehoov.com
Here the output field Lid120m was used to create a classification field of 4 classes, giving only 4
possible middle values 45, 135, 225 and 315. The creation of classification field is very useful: you can
create one such field and then trigger a clustering statistic report (see later)!
- Create counter field provides just a field that contains values 1, 2, … Number of records. Sometime
useful when sorting.
- Create target value field allows to create one or several fields coming from one of several values
given by the user. When one selects this option the pop-up of Figure 29 appears:
Fig. 29
A selection of fields has to be done in the left table, and one or more values have to be given in the
right Edit zone (the ‘target values’). Once this is a done, a (Boolean) field for each target value will be
created, having value 1 for a record if at last one selected field has the target value, 0 otherwise.
Fig. 30
Figure 30 shows the 2 target field created, ‘Target Soo, Nox=52’ and ‘Target Soo, Nox=211’. Records
2 and 3 have ‘Target Soo, Nox=52’ equal to 1, since Soo=52, and record 3 has ‘Target Soo,Nox=211’
equal to 1, since Nox=211. CAUTION: there is a bug inside the Microsoft component that drives the
data table, entailing the replacement of ‘.’ By ‘DOT’ in the column titles…
http://www.nehoov.com
- Create Markov extension from a field F allows to “translate” values of a field F according a second
field of type date D. It creates multiple new fields. Looking at D there is a minimal time step t from
passing from a record to the next one (in a chronological order). We will build new fields from F by
“time shifting”:
Fig. 31
You select the date field D on the left table, and the numeric field F on the right table. Depth is the
size of the shift (you can increase or decrease it), on Fig. 24 it is 3. Then, three Markov extensions F’,
F’’ and F’’’ will be created from F by just using F’(d)=F(d-t), F’’(d)=F(d-2t), F’’’(d)=F(d-3t) where “d-x*t”
means we consider the value of F on the record that has for D-value d-x*t. In practice, this is simply a
shift, as you can see in Fig. 32:
Fig. 32
Of course, the 3 first records have voids on their Markov values. Markov extensions are interesting
for instance when building forecasting models or when you want to study time-dependencies (just
create Markov extension, and recompute the stat report).
Create neighbor field allows to create 2 new fields. First field provides, for each record, the record
number of the closest record according a (quadratic) distance. Second field gives simply the value of
this distance. As usual, in order to define this distance, the user can select in a list the desired fields,
and also fix the values of weights (by default at 1). A cut-off value, epsilon, can also be specified: as
soon as a record has its distance below epsilon, it is designed as the neighbor.
http://www.nehoov.com
Create formula field is a feature that is unavailable in this free version of DEXTER. It allows to create
a field from others, by defining a python formula on the selected fields.
Finally, the last feature of the field menu is just the classical matrix transpose function. A simple popup is just asking whether a subnode has to be created or not (then the node in itself is transposed).
The Wrangling menu
In this menu you will find features for “wrangling”, that is modifying your data.
Fig. 33
There are 4 family of features:
-
Extract a subset of records (it is called a ‘sampling‘)
Clean ambiguous, error or missing values
Define what is an outlier value, and clean it
Delete some records, replace some values by others
Last family is obvious and well explained inside DEXTER, as for the 6 sampling features, I won’t develop
them here; sampling allows to considered a subset of records by processing one of the 5 algorithms:
from simply taking 1 record every N times, to considering a probability distribution of the data by a
first huge sampling, and then building the final sampling according to it. Just remember that when
doing a sampling, it is better to create a subnode (the box is always by default ticked). If not, you will
definitively loose the records discarded during the sampling process!
Cleaning values is also easy in Dexter: after specifying what you want to clean (errors, or missing
values, or both) you choose the method you want to apply. As illustrated in Fig. 34:
http://www.nehoov.com
Fig. 34
You can simply delete the records containing errors or missing, or replace these values by 2*max of
the concerned field(s) in order to isolate these values. You can also replace by average, or by neighbor,
that is the records the closest (for quadratic distance) to the one processed. You can also replace by
the closest neighbor, but normalized – that is all records are normalized in ]-1,1[.
When you select neighbor or normalized neighbor, a pop-up asks you to specify the fields you want
to use for the definition of the distance, and also a parameter, epsilon. Epsilon is here to stop the
process of searching a neighbor: when the computed distance is below epsilon, the algorithm ends
and the last record will be used for replacing the values of the field. By defining your distance and
epsilon for the definition of neighbor, you have various ways to replace the error or missing data of
the fields of your data node!
Remark that by default, the create subnode box is ticked, so you will get a subnode after this cleaning
process.
Defining outliers is also easy in DEXTER: you select the fields you want to work with and then, by
clicking “show curves”, it gives you a look at the duration curves that is fields sorted from min to max:
Fig. 35
http://www.nehoov.com
You can normalize the curves in ]-1, 1[, and you can also merge the curves, that is it creates a vector
of fields containing all the values of the selected fields, and then sorts it. Figure 36 gives an example
on 4 selected fields, left chart showing the 4 normalized duration curves, and right chart the merged
normalized duration curve.
Fig. 36
Merge is relevant when selected fields are “of same kind”, for instance the generation of 4 wind
turbines in a wind farm of same power.
After viewing the curve(s), you can then decide for a min and/or max values that will define outliers,
for the fields you select on the bottom left zone: all values for this fields that are not in [min, max] will
be considered as outliers. Figure 37 provides such an example, for a merge:
Fig.37
http://www.nehoov.com
Min & Max values selected by the user appear now on the outlier chart, and if you come back on the
table data, outliers are showed in Orange (remark normalized min & max are shifted to true values).
After defining outliers, you can then clean them just by using the “clean outliers” option, leading you
to the GUI of Figure 38:
Fig. 38
The table provides for each fields the min & max values defined by the user. Methods are almost the
same than for error or missing value cleaning. Here you can replace simply by the selected min & max,
or you can specify min & max values for all the fields at once. For replacement with neighbor, you
specify a stopping criteria epsilon: when the distance between the processed record and a tested
record is below epsilon, the algorithm stops and take it as neighbor. In any case, even after the
cleaning values that were outliers stay coloured in orange in the data table.
Hence, with the Wrangling menu, you can modify your data by using algorithms!
http://www.nehoov.com
The Tools menu
-
-
-
Feature #records between values is
obvious, giving just the number of
values of a field between two records.
Run Predict Model allows the user to
launch Predict Neural Networks, when
installed. Hence you have the capability
to generate forecast directly in DEXTER.
The Self-Organizing Map feature is
disabled on this free version.
Compute stat allows to generate or
regenerate the statistic report, as during
the import process. I used the word
regenerate, because after several data
manipulations (e.g. cleaning) it can be
relevant to recompute the stat report.
Fig. 39
Create XY-graph allows to build a (X, Y) chart according to 2 fields X and Y: this is the set of
points (X, Y) for all the records of the data node. You can also use a third field Z to colour the
points.
Fig. 40
When you select this feature TZ is replaced by the XY-editor such as in Fig 40, where you select
your (X, Y) fields, and eventually the third field for the colour. If you give a label name and
then save, a node is created. So you can view XY-chart with or without saving it! When you
save it, a XY-node is created and your chart is stored (you can delete it via a delete button).
On a huge data set of 100 000 000 of records for instance, the XY-chart feature processes an
algorithm that selects a sample of relevant points (roughly speaking, 20 000 points in the
convex hulls of the connex components, and 30 000 points inside).
http://www.nehoov.com
Another option hidden in to generate a Y-graph, that is simply the curve of a serie (the
sequential order of the values of the field), just by selecting a Field in the Y-zone and nothing
in the X-zone:
Fig. 41
One can also select a field F in the Y-zone and a field C for the Colour: in that case, the serie F
is split in n sub-series such that C is a constant on each sub-serie:
Fig. 42
Figure 42 shows 6 series, because the colorization was performed by using a ‘classification’
field with 6 classes. The length of each of these 6 curves varies, pending the size of the classes
of the classification field.
http://www.nehoov.com
-
F-inverse stat report is a clustering report based on the values of one selected field F:
The
algorithm
creates
clusters of records where F has
constant value on each clusters,
and we provide diameters and the
distance matrix between clusters (a
diameter of a set of records is the
maximal distance between its pairs
of records; a distance between 2
clusters is the minimal value of all
distances between a point of the
first cluster and a point of the
second).
The right table allows to select
fields for the definition
of
the
(quadratic) distance, so
you can get various reports just by
playing with the distance!
Fig. 43
You need to give a label name at your report, because a node is created that stores the report
(the delete button makes this node disappear, if wanted).
If for instance you have a marketing data node that contains records of info clients, by
selecting the field “postal code”, the F-inverse stat report provides you metric information
about clusters of records that have the same postal code. You can for instance define the
metric distance just with the field of product ID, to see if some clusters are neighbors or
strangers!
-
Field cluster stat provides statistic reports according a value-clustering from a class of field
you select. As example, if you select F1 and F2 for clustering, you can define clusters of
records just by asking (F1, F2) have a constant value (x_C, y_C) for all the records of a same
cluster C. Then you can provide the stat reports on the remaining fields, for all these
clusters!
Fields for clustering are selected on the left table of Figure 44.
http://www.nehoov.com
Fig. 44
On the right table you select the fields for the stat report, you give a label (because a node
will be created), and you select a splitted or a crossed report. Crossed report will perform a
clustering as given in the above example, where splitted report will perform a clustering (&
then a report) for each field separately.
Here is an example, given in Figure 45, where the field cluster stat is based on a classification
field (8 classes, just look at the name of the clusters):
Fig. 45
First we created a classification field of 10 values from a field Power; then, we called a Finverse stat report on this classification field. First column of Fig. 35 provides the 10 values of
this field, and next column the statistics for all the remaining fields. Remark that you can
export the table to csv, or you can export this table as a NEW NODE, in order to be able to
apply DEXTER features on these data!
Now it is time to explain 2 advanced features of this menu.
-
t-SNE. t-Distributed Stochastic Neighbor Embedding (t-SNE) is an algorithm created by
Laurens van der Maaten. It is an algorithm that performs a 2D-projection of the node records
& try to preserve the distribution of their pairwise distances. See Wiki or
https://www.youtube.com/watch?v=RJVL80Gg3lA&list=UUtXKDgv1AVoG88PLl8nGXmw
for more detail. In NEHOOV we wrote our own t-SNE algorithm from the academic papers.
http://www.nehoov.com
When you select this feature, a pop-up appears with some explanations and 3 by-default
parameters. After clicking the generate button, the t-SNE is computed (in this free version not
in a background mode,). A result is a XY-chart, such as in Fig. 46:
Fig. 46
The (X, Y)-coordinates can be exported to .csv file, or you can also save them as fields in the
data node. You can also save this as a graph node, to store the chart.
It is interesting to perform a t-SNE, save the coordinates as fields, create 2 associated
classification fields according the number of “areas” in the t-SNE, and then process a 2-binning
(see next section).
-
N-binning. We implemented a nice version of topological N-binning, for any integer N>0. This
will provide a clustering of all the records of a data node, represented as a graph where edges
show “some” connexions between clusters. You can skip the following if you don’t want the
“math” on this, and jump to Figure 50. Let’s start with N=1, that is a let’s present what is a 1binning.
First we need a field F, a number c, an overlap % p, a cut-off value v and a method, SET or
RANGE. Then we start by processing 3 steps as in Figure 47 to create an overlapping initial
clustering:
Fig. 47
We sort records according the increasing values of F (step 1); then we slice in c equal pieces
our set of records (step 2), that we call initial clusters: if the selected method is SET, the set
of increasing values of F is splitted in c pieces of same size; if the selected method is RANGE,
we slice the range [min, max] of F-values in c equal pieces. In that case, it is obvious that our
clusters won’t have the same size (in term of number of elements), because the values of F
http://www.nehoov.com
are not necessarily uniformly distributed. This is the difference between SET and RANGE
methods.
Then we “dilate” by p % the size of each pieces in order to get overlaps (step 3): that means
we duplicate some records during this process of dilatation. For instance, on Figure 47, the
record “red start” in the second initial cluster is duplicated in order to get this record both on
initial clusters 1 and 2.
Once we get this we will process each initial cluster in order to get sub-clustering. For this, we
need a distance d, by default the weighted quadratic distance on fields, but the user can
change it. For any initial cluster C containing records R_1…R_q, we compute the sequences
of consecutive differences X_2=d(F_2, F_1), X_3=d(F_3, F_2), X_4=d(F_4, F_3), etc… with F_i
the value of F in the record R_i.
Now we sort the X_i’s increasingly, and we take the v-th greatest value V of the sequence,
where v is our cut-off value. Then we can define our sub clusters of C by simply parsing the
records R_1, R_2, etc. First sub-cluster C_1 contains R-1. We add to C_1 records R_2, R_3,
etc… as long as X_2, X_3, etc… is below V. When we reach a record R_i that verify X_i>V, we
create a new sub-cluster C_2 and we restart the process.
This gives us the nodes of our graph, and we say that 2 nodes are linked if there is a record
that belong to both of them. On the end one get a graph, such as in Figure 48:
Fig. 48
http://www.nehoov.com
In DEXTER nodes are labeled “A x B”, where A is the number of the initial cluster and B the
number of the subcluster that decomposed initial cluster A.
The 1-binning description is ended. For N-binning with N>1, we select N fields F1…F_N. For
each i=1,…,N we split the range [min_i, max_i] of F_i in c clusters C_(i,j) according the SET or
RANGE method, as described before. We thus obtain c^N clusters, defined by the Cartesian
product C(i,j1)x…xC(i,j_N), providing N-dimensional pavings. Then the p% dilatation is obvious,
as depicted for instance in Figure 49, where N=2 (we colored 2 dilated clusters for illustration):
Fig. 49
Left diagram shows clusters C(i,j) before dilatation, right diagram, after it. Then, the building
of subclusters follows exactly the process depicted in the 1-binning case, and one obtain a
graph as given in Figure 48.
So far we fully presented the building of the binning graph, but there is also another result in
the binning process: the ‘distance’ matrix. For this, a new set of fields needs to be settled by
the user that will be used to compute distances between the subclusters of the binning graph.
In that way, one can say that the binning graph is built with a first set of fields, and the distance
evaluation with a second one. An example: one can build a binning graph by using some
customer attribute fields (e.g. localization, age, etc), and then compute the distance between
clusters of customers by using the monthly amount of purchases, allowing to identify
dissimilarities between clusters (that is, those with high distances).
The distance matrix has the following shape:
Fig. 50
http://www.nehoov.com
First column provides the names of the subclusters. For each of them, we get diameter (= the
maximal pairwise distance value), min & max of fields that defined the distance, count, and
min & max NON ZERO value of pairwise distance between subclusters (we write min* and
max* with a ‘*’ to underline this fact that it is non zero). Finally, each distance between the
subcluster and the others is given.
Playing with all the binning parameters (e.g. own weighted quadratic distances for
subclustering and for the matrix evaluation) allows to obtain very different nodes and very
different dissimilarities between nodes!
Figure 51 shows the GUI:
Fig. 51
In the first table you can chose the N fields for the N-binning. In the second you will define the distance
that will be used for the final computation of the subclusters by using the cut-off value (in red). Initial
cluster number N is compulsory, such as a labelling for the creation of the node. Third table is used
to define the distance that will be used for the matrix report that provides diameters and
interdistances between subclusters (see Figure 48). These distances are weighted quadratic; you can
modify the “1” values to fix your own weights. Finally, the radio-buttons sorting methods allows to
choose how the initial clusters are computed: by slicing the range [min, max] of the field in N equal
http://www.nehoov.com
pieces (option range  initial clusters can have different number of records), or by slicing the set of
records in N equal subsets according to F increasing (option set).
Results are given in the data table zone, they are presented in 3 folders you can reach by clicking on
the 3 radio-buttons on top left, see Figure 48 for the first folder that present the sub cluster graph.
Second folder provides the graph of initial clusters:
Fig. 52
Third folder gives the metric matrix:
Fig. 53
On all these 3 folders, you can add a binning field to the data node: this field provides for each record
the name of the cluster it belongs. Thus, you can process clever downstream analysis by using Finverse or field cluster stat! You can also export the matrix as a new data node, or simply as a csv file.
http://www.nehoov.com
The stat node
Just by clicking on this node, some obvious indicators are given, by field:
Fig. 54
Each field has on the top right a button for relaunching the stat computation for the field (to be used
for instance after a cleaning, or when the stat computation was not asked during the import
process). On the top right of this table there is a button that allows to relaunch the computation for
all the fields.
Field Type is a List Box that allows to change the type of the field is needed. For instance if a field
has only 0 or 1 values, DEXTER will define its type as ‘boolean’, and you might want it as integer.
Cardinal gives the number of different values of
the field; singleton shows the number of values
of the field that occurs only once; P10, P50 and
P90 are percentiles. Right buttons allow to
compute the stat field by field; in the case it was
not performed during the import process.
Below the first table lies the correlation table,
of all fields to the output field:
Fig. 55
Time of use
In this free version of DEXTER, almost all the algorithms are working in a background mode (except
for t-SNE and Add Field), allowing the user to continue to work in DEXTER. However, in this free
version it is not possible to launch more than one background algorithm: it is compulsory to wait for
the end of the background task before launching another one.
http://www.nehoov.com
Nevertheless, we give below few assessment of computation time, on big data sets. M stand for
million, k for thousand. Laptop used was a RAM 6 GB, Intel Core I5 (1 UC double core). Universe
preference was 600 Mb for block size and 5 threads:
20 k records, 2k fields
Import:
1 M records, 100 fields
Import:
75 M records, 13 fields
Import:
1M records, 30 fields
R-reservoir:
75 M records, 13 fields
R-reservoir:
75 M records, 13 fields
1-binning
Fig. 56
In Figure 57 a 4-binning is given on a data node of 100 000 records and 200 fields, with N=8,
overlap=2% and cut-off=2.
Fig. 57
This binning graph has 1 500 nodes, and was computed in 218 133 seconds, that is a bit more of 2
days!! This on a tiny lap-top, coz we have plenty of time… this is ECO DATA ANALYTICS!!
http://www.nehoov.com
Copyright NEHOOV©, 2016
This material is free of use.
http://www.nehoov.com