DEXTER: a light USER MANUAL Support for DEXTER V.1.0.0.1 This document is a swift survey of DEXTER features, free version, coming from an internal NEHOOV tool used for data wrangling and data analytics. We designed some features that are useful to us, before for instance the use of a Machine Learning downstream layer, such as PREDICT or NEURALSIGHT. We hope you will enjoy the design of DEXTER & this way of doing analytics! This is a summary of 15 years of operational experience. Without programming, DEXTER allows you to obtain relevant findings in just few clicks. We aim at receiving your feedbacks – ideally in terms of improvement proposals, and suggestion for adding some others features. Last remark: this user manual was written by a non-English native, so try to be indulgent . Purpose of DEXTER DEXTER is a tool devoted for processing data with purpose to (1) streamline data, (2) offer some features for data analytics, including some stochastic methods. Following our ECO DATA ANALYTICS APPROACH, we aim at running DEXTER just on a laptop, even on databases that have tens or hundreds of millions of records. Minimal tech. requirements are UC core I5, and at last 6 GB of RAM. But 12 or 16 GB is better, of course. We decided to design DEXTER with some free packages such as for the graph or the table components – these are not top of the top features, but have the advantage to be free. We didn’t want to fall in the mafia caveat of the unfree graphical components, because of… you know, our ECO DATA ANALYTICS approach ;) Opening DEXTER After launching DEXTER, you reach the general panel, mostly blank: http://www.nehoov.com Fig. 1 In all this document, we will write VZ for the (left vertical zone), GZ for the top horizontal zone (‘Graph Zone’) and TZ for the bottom horizontal zone (‘Table Zone’). There is also an info bar IB on the top bottom, in blue, but that is most of the time not used in this free version. As advised on IB when you open DEXTER, first you need to create a universe. It is a DEXTER object that will allow you to import databases, represented as nodes, as you shall see later. In a universe, you specify also by default some technical import parameters. As you can see in Fig. 1, DEXTER features are split in 4 feature menus: Universe, Field, Wrangling and Tool. By clicking on this About menu, you will discover that it is the 1.0.1 version, without any limitation of use! Importing data After your first universe creation, as advised on IB, you need now to import database(s) just by opening the file menu and clicking on “Import”. On this free version, importing directly from databases (such as SQL or ORACLE) is not available. You can just import csv or text file(s). Fig. 2 http://www.nehoov.com The browser import file(s) zone allows to select multiple .csv or text files that need to have the same “shape” in term of data localization. There is a “smart” import process that allows to consider files with void lines or comment cells. For instance, the following csv file can be imported: Fig. 3 In DEXTER you obtain Fig. 4 The following example, containing introductions lines, having line 10 as blank, having column H void except for H2, Fig. 5 is imported in DEXTER as this: Fig. 6 http://www.nehoov.com Remark that column H on the imported file that contains only one non-void value, is discarded by DEXTER, because we won’t do that much with this. So, DEXTER has a rudimentary AI for import. Before continuing, let’s introduce some DEXTER vocabulary: in a data table, lines are called records, and columns, fields. In fact, a field is more than that, it is an attribute that has a name and a type, and a list of values. A record is just a realization of a collection of fields. Remark also (Fig. 6) that after the import, void values are coloured in orange, and the “date” value of field y is coloured in red: DEXTER identified that field y is of type integer, hence a date value is not considered as acceptable. We will speak about cleaning later. Coming back on the import pop-up (Fig. 2), there are 4 options: • • • • Default block size allows to reserve a part of the RAM for DEXTER use. Max degree of parallelism is the maximum number of simultaneous threads allowed to DEXTER. These 2 parameter values are coming from the universe preference (see later). Comprehensive field identification, when ticked, allows to analyse all the fields during the import (e.g. different date formats in a field). Compute stat when import allows to run or not a rudimentary statistic report on all the fields. These 2 last options can be triggered later, after the import, and even field by field to gain time (on huge data set, you probably don’t need stat on all the fields). After validation, the import process starts in a background mode. The background area on the top of VZ allows you to follow the progression of the import. Fig. 7 When the import is ended, by clicking on the open button, 2 nodes are created in the graph zone TZ: Fig. 8 http://www.nehoov.com In Figure 8 the file “evap” was imported, creating the data node “evap”, and an associated node that store the stat results. In VZ, a description zone allows you to change the name of the node, and to store comments (by default it is the path from the imported file(s): Fig. 9 Finally, if you click directly on the node in GZ, you make it “green”, and a node zone appears on the VZ, providing some basic info about the node: Fig. 10 As you shall see later, DEXTER handles many kind of objects (data set, visualization reports, stat reports…), all represented by nodes, with possible links. Here is an example of a more substantial universe: Fig. 11 http://www.nehoov.com Figure 11 shows a universe of 12 data nodes, with many son nodes. It is clear that this user didn’t opt for clarity in the choice of his node names! Of course, with such a substantial universe, it is relevant to focus on specific zones: for this, on the VZ, there is the view zone: Fig. 12 By clicking on the name of one data node it refreshes GZ and TZ, viewing only this node and its sons. By clicking on the Overview button, you come back to the full view of the universe. You can also play with the crux button of GZ (bottom right), that will allow you to design your view by yourself (we will describe the zoom feature later in this next section). A remark: by adding a node to a universe, an algorithm recalculates all the positions of the others nodes in order to have a suitable general view. This might be quite surprising on the beginning. I forgot an obvious thing: when you click on a data node, on the table zone TZ an EXTRACT of data is presented. When importing your first node, you must see something like this this: Fig. 13 On VZ, at the bottom there is a Log zone, that will store all the actions you performed on the selected node. For instance: http://www.nehoov.com Fig. 14 One can say a lot more about all of this, and it will be partly done in the next sections. Just remember that this document is a light user manual. You shall see by using DEXTER that a lot of information are also given in all the GUI (“graphical user interface”) in order to help you to understand the features! The graph view philosophy Why in NEHOOV did we opt for a graph view of data sets and son reports? After all, a universe is just some kind of folder where data sets and reports are stored! The answer is simple: in DEXTER, for most of the actions you will perform on a node, you will always be able to perform it by creating a subnode! To us it was then obvious to link all these nodes in graphs. Lot of people are doing data manipulation directly on an Excel file, or a database, and sometime they corrupt the data set. In fact, not sometime, most of time! Remember that DEXTER is not devoted to data miners, but to operational people! Trying to help them to organize their sessions of work was one of our goal. Indeed, by creating subnodes, we keep the sanity of the father node. For lot of actions you are in no view precluded to create a subnode, you can stay in your data node. But we strongly advise you to do it! This is also a nice way to work: you can “fork” your father node in 2 subnodes for instance, that can represent two distinct pieces of data (different time windows, different type of clients…). Working with a graph view force you, and help you to streamline your actions. After settling this graph view paradigm, it was obvious to us to perform the same for all the reports: we designed them as specific nodes. In this free version of DEXTER, they are not numerous. Below is given the list of all our node icons: Data Stat report Cross stat report Coloured F-inverse Topological XY-chart clustering N-binning visualization report report 2D t-SNE projection visualization Fig. 15 http://www.nehoov.com Hence one icon for data, 3 for reports, and 3 for visualization features. In fact, the Topological 1binning report is half a report, half a visualization… we will see this later, as for the fact that reports & visualisations are performed in TZ, temporary replacing the data table. Let’s go back to graph visualization feature. When the graph is huge, as already touched upon there is a crux on bottom right that allows to perform some zoom features. Figure 16 gives the example of a 4-binning graph, in TZ zone, where the zoom crux is also present: Fig. 16 After clicking on the crux, the zoom windows should open: Fig. 17 http://www.nehoov.com Left & right arrows allow to go back & forth on the last selected zoom options. The ‘house’ icon allows to zoom on a small area on the top left of the graph. First of the 3 remaining icons allows to dezoom up to the global overview of the graph, second one allows to zoom on the ‘center’ of the graph, and third one keep the current zoom but center the view on the middle of the graph. The visible area in the zoom window is squared in red, and borders of this square can be moved in order to resize this square. One can obtain for instance the following view: Fig. 18 If you click on the red square and keep the finger on it, you can move the square to another area, allowing you to wander in the graph! You can also click on any node in order to get informations in VZ. Figure 19 provides such info’s for a node of a 4-binning, where min & max for some fields are given: Fig. 19 http://www.nehoov.com Again, the zoom features can be used both in GZ and TZ. Hence, for GZ, the data nodes and report nodes area, you have 2 possibilities if you want to zoom in a data node and its sons: the zoom crux, or the data node list in VZ. The table zone TZ First, an important point: DEXTER is not EXCEL. You cannot wander in the data table, for changing values or sorting in various ways. We decided not to allow freely manual update of data. The TZ view is also more rudimentary than Excel one (remember, we use free component). In fact, in practice, TZ should be reduced at its minimum: the data table is just here for illustration. And we decided to display by default just the first few records – this is also meaningful when one handle huge data sets. In fact, DEXTER is devoted for data reshaping BY USING ALGORITHMS (not manually), data analytic reports and data visualization. This is it! If you want to manipulate by yourself the data you created in DEXTER, just export them and use your favourite tool! Anyway, let’s have a survey of TZ: Fig. 20 We added a freeze counter on the first column. Above the table, you can modify the first record you want to see, and the number of the next ones to view (“Nb records”). Same for the fields, with First visible field and Nb fields. You can also sort data according one selected field in a List box, and select the level of precision you want in the digit display of numbers. That’s all! These very basically options allow you to view pieces of data in a very big data set. Again, DEXTER is not a tool devoted for data table view. A useful remark: if you reduce the number of visible fields, and leave DEXTER and forget about this, you might be quite surprised when you will come back: don’t be afraid that a part of your data disappeared, just check the value of Nb fields, and compare with the total number of fields given in the node zone! Another important remark: in DEXTER the last field is flagged “Output” (see the purple circled last field in Fig. 20). This is by default and you can change this in the field menu (see later). This is an http://www.nehoov.com implicit choice, and the reason lies in forecasting features: Neural Networks models generated by PREDICT can be used directly in DEXTER, and by default in PREDICT the forecasting target is the last field. The Universe menu We already used this menu for import, but there are some other features: - Create subnode makes a son copy of the current selected node, that is a son - Duplicate data node makes a copy of the current selected node, that is an independent node (no link with the selected node - Delete node allows you to delete a node Fig. 21 - Create and duplicate are useful if you don’t want to overload the graph of the current node. - Duplicate & Delete can be also triggered just via a right click on a node. Caution: the erased node is the node clicked by the user, NOT the current selected node. Thus you can select a data node, and then delete a son node just by right clicking! - Manage nodes is a feature disabled in the current version. - Export creates a csv copy of all the data in the selected node. - Smart Export creates several csv files from 2 sets of fields. This is an example of a feature asked by one of our clients! Selecting this feature makes the following GUI appears: Fig. 22 http://www.nehoov.com User selects one or several fields F1…Fp in the left table, and others fields G1…Gq in the right table. After clicking ‘ok’, it is mandatory to give a name to this smart export, for instance ‘sm1’. Then q files are created, with names ‘sm1 N(G1)’,…, ‘sm1 N(Gq)’: Fig. 23 Each file ‘sm1 N(Gi)’ contains all data related to fields F1,…,Fp, and Gi. Here is an obvious example of use : if in a data node there are 12 fields of monthly cash flow called ‘January’,…,’December’, and for a personal reason you need to build 12 files containing just one of these monthly fields and all the remaining data, you have just to - Select all the fields except the 12 monthly fields in the left table, Select the 12 monthly fields in the right table. Then the 12 files will be generated. - Export descriptor generates a csv file that contains information about each field: Fig. 24 It summarizes for each fields the info given in the node info zone and in the statistic report subnode of the selected node. Finally, the last feature, Universe preferences, allows the user to settle the following parameters: http://www.nehoov.com Fig. 25 The 3 desactivation features are linked with the Field menu features (see below). When creating a new classify/truncate/shortened field F’ from a field F, if ticked the field F is desactivated after the creation of F’. Layout algorithm proposes 8 algorithms for the graph representation of the universe. Just chose the one you like it! Default block size/max degree of parallelism and the next 2 tick boxes are the by default values of these 4 parameters when importing files. Default block size is ~ your RAM size/20, in order to be able to continue to use your laptop. For a night computation, you can increase the value, but one rule: try to be below 70-80% of your RAM size, in order to avoid Windows multiple swaps! Binning size and binning overlap are the by default values when using the 1-binning algorithm of the Wrangling menu. The Field menu This menu is for field management. Fig. 26 http://www.nehoov.com - Activate/Desactivate field is for making field visible or not in the data table. In fact, in DEXTER one cannot delete field, one can just deactivate it. However, if you switch some fields to the desactivation zone (right table of Figure 27), and if you select some of them and click on the ‘purge’ button, then these fields will be erased for ever (no possible recovery). Fig. 27 - Move field is just a feature to change the place of a field or several fields in the data table, by selecting the field(s) to move and a target field, and then by selecting if the fields will be move before or after the target field (move before / move after). - The 3 next features are just a swift way to desactivate fields when missing, missing or errors, or bad correlation occurs. Bad correlation uses the correlation table given in the stat report (see later), and desactivate all fields that have correlation coefficient with the output field below a specified value by the user. - The next 6 values allow to create fields. Truncated & shortened fields are obvious, & explained in the GUI. - Classify field allows to create a field F’ from a field F that has only a finite number of values N. The range [min, max] of F is sliced in N equal classes, and a value of F’ is computed from the value of F according its place in the classes: the value of F’ can be either the middle or the average of the class. Figure 28 provides an example: Fig. 28 http://www.nehoov.com Here the output field Lid120m was used to create a classification field of 4 classes, giving only 4 possible middle values 45, 135, 225 and 315. The creation of classification field is very useful: you can create one such field and then trigger a clustering statistic report (see later)! - Create counter field provides just a field that contains values 1, 2, … Number of records. Sometime useful when sorting. - Create target value field allows to create one or several fields coming from one of several values given by the user. When one selects this option the pop-up of Figure 29 appears: Fig. 29 A selection of fields has to be done in the left table, and one or more values have to be given in the right Edit zone (the ‘target values’). Once this is a done, a (Boolean) field for each target value will be created, having value 1 for a record if at last one selected field has the target value, 0 otherwise. Fig. 30 Figure 30 shows the 2 target field created, ‘Target Soo, Nox=52’ and ‘Target Soo, Nox=211’. Records 2 and 3 have ‘Target Soo, Nox=52’ equal to 1, since Soo=52, and record 3 has ‘Target Soo,Nox=211’ equal to 1, since Nox=211. CAUTION: there is a bug inside the Microsoft component that drives the data table, entailing the replacement of ‘.’ By ‘DOT’ in the column titles… http://www.nehoov.com - Create Markov extension from a field F allows to “translate” values of a field F according a second field of type date D. It creates multiple new fields. Looking at D there is a minimal time step t from passing from a record to the next one (in a chronological order). We will build new fields from F by “time shifting”: Fig. 31 You select the date field D on the left table, and the numeric field F on the right table. Depth is the size of the shift (you can increase or decrease it), on Fig. 24 it is 3. Then, three Markov extensions F’, F’’ and F’’’ will be created from F by just using F’(d)=F(d-t), F’’(d)=F(d-2t), F’’’(d)=F(d-3t) where “d-x*t” means we consider the value of F on the record that has for D-value d-x*t. In practice, this is simply a shift, as you can see in Fig. 32: Fig. 32 Of course, the 3 first records have voids on their Markov values. Markov extensions are interesting for instance when building forecasting models or when you want to study time-dependencies (just create Markov extension, and recompute the stat report). Create neighbor field allows to create 2 new fields. First field provides, for each record, the record number of the closest record according a (quadratic) distance. Second field gives simply the value of this distance. As usual, in order to define this distance, the user can select in a list the desired fields, and also fix the values of weights (by default at 1). A cut-off value, epsilon, can also be specified: as soon as a record has its distance below epsilon, it is designed as the neighbor. http://www.nehoov.com Create formula field is a feature that is unavailable in this free version of DEXTER. It allows to create a field from others, by defining a python formula on the selected fields. Finally, the last feature of the field menu is just the classical matrix transpose function. A simple popup is just asking whether a subnode has to be created or not (then the node in itself is transposed). The Wrangling menu In this menu you will find features for “wrangling”, that is modifying your data. Fig. 33 There are 4 family of features: - Extract a subset of records (it is called a ‘sampling‘) Clean ambiguous, error or missing values Define what is an outlier value, and clean it Delete some records, replace some values by others Last family is obvious and well explained inside DEXTER, as for the 6 sampling features, I won’t develop them here; sampling allows to considered a subset of records by processing one of the 5 algorithms: from simply taking 1 record every N times, to considering a probability distribution of the data by a first huge sampling, and then building the final sampling according to it. Just remember that when doing a sampling, it is better to create a subnode (the box is always by default ticked). If not, you will definitively loose the records discarded during the sampling process! Cleaning values is also easy in Dexter: after specifying what you want to clean (errors, or missing values, or both) you choose the method you want to apply. As illustrated in Fig. 34: http://www.nehoov.com Fig. 34 You can simply delete the records containing errors or missing, or replace these values by 2*max of the concerned field(s) in order to isolate these values. You can also replace by average, or by neighbor, that is the records the closest (for quadratic distance) to the one processed. You can also replace by the closest neighbor, but normalized – that is all records are normalized in ]-1,1[. When you select neighbor or normalized neighbor, a pop-up asks you to specify the fields you want to use for the definition of the distance, and also a parameter, epsilon. Epsilon is here to stop the process of searching a neighbor: when the computed distance is below epsilon, the algorithm ends and the last record will be used for replacing the values of the field. By defining your distance and epsilon for the definition of neighbor, you have various ways to replace the error or missing data of the fields of your data node! Remark that by default, the create subnode box is ticked, so you will get a subnode after this cleaning process. Defining outliers is also easy in DEXTER: you select the fields you want to work with and then, by clicking “show curves”, it gives you a look at the duration curves that is fields sorted from min to max: Fig. 35 http://www.nehoov.com You can normalize the curves in ]-1, 1[, and you can also merge the curves, that is it creates a vector of fields containing all the values of the selected fields, and then sorts it. Figure 36 gives an example on 4 selected fields, left chart showing the 4 normalized duration curves, and right chart the merged normalized duration curve. Fig. 36 Merge is relevant when selected fields are “of same kind”, for instance the generation of 4 wind turbines in a wind farm of same power. After viewing the curve(s), you can then decide for a min and/or max values that will define outliers, for the fields you select on the bottom left zone: all values for this fields that are not in [min, max] will be considered as outliers. Figure 37 provides such an example, for a merge: Fig.37 http://www.nehoov.com Min & Max values selected by the user appear now on the outlier chart, and if you come back on the table data, outliers are showed in Orange (remark normalized min & max are shifted to true values). After defining outliers, you can then clean them just by using the “clean outliers” option, leading you to the GUI of Figure 38: Fig. 38 The table provides for each fields the min & max values defined by the user. Methods are almost the same than for error or missing value cleaning. Here you can replace simply by the selected min & max, or you can specify min & max values for all the fields at once. For replacement with neighbor, you specify a stopping criteria epsilon: when the distance between the processed record and a tested record is below epsilon, the algorithm stops and take it as neighbor. In any case, even after the cleaning values that were outliers stay coloured in orange in the data table. Hence, with the Wrangling menu, you can modify your data by using algorithms! http://www.nehoov.com The Tools menu - - - Feature #records between values is obvious, giving just the number of values of a field between two records. Run Predict Model allows the user to launch Predict Neural Networks, when installed. Hence you have the capability to generate forecast directly in DEXTER. The Self-Organizing Map feature is disabled on this free version. Compute stat allows to generate or regenerate the statistic report, as during the import process. I used the word regenerate, because after several data manipulations (e.g. cleaning) it can be relevant to recompute the stat report. Fig. 39 Create XY-graph allows to build a (X, Y) chart according to 2 fields X and Y: this is the set of points (X, Y) for all the records of the data node. You can also use a third field Z to colour the points. Fig. 40 When you select this feature TZ is replaced by the XY-editor such as in Fig 40, where you select your (X, Y) fields, and eventually the third field for the colour. If you give a label name and then save, a node is created. So you can view XY-chart with or without saving it! When you save it, a XY-node is created and your chart is stored (you can delete it via a delete button). On a huge data set of 100 000 000 of records for instance, the XY-chart feature processes an algorithm that selects a sample of relevant points (roughly speaking, 20 000 points in the convex hulls of the connex components, and 30 000 points inside). http://www.nehoov.com Another option hidden in to generate a Y-graph, that is simply the curve of a serie (the sequential order of the values of the field), just by selecting a Field in the Y-zone and nothing in the X-zone: Fig. 41 One can also select a field F in the Y-zone and a field C for the Colour: in that case, the serie F is split in n sub-series such that C is a constant on each sub-serie: Fig. 42 Figure 42 shows 6 series, because the colorization was performed by using a ‘classification’ field with 6 classes. The length of each of these 6 curves varies, pending the size of the classes of the classification field. http://www.nehoov.com - F-inverse stat report is a clustering report based on the values of one selected field F: The algorithm creates clusters of records where F has constant value on each clusters, and we provide diameters and the distance matrix between clusters (a diameter of a set of records is the maximal distance between its pairs of records; a distance between 2 clusters is the minimal value of all distances between a point of the first cluster and a point of the second). The right table allows to select fields for the definition of the (quadratic) distance, so you can get various reports just by playing with the distance! Fig. 43 You need to give a label name at your report, because a node is created that stores the report (the delete button makes this node disappear, if wanted). If for instance you have a marketing data node that contains records of info clients, by selecting the field “postal code”, the F-inverse stat report provides you metric information about clusters of records that have the same postal code. You can for instance define the metric distance just with the field of product ID, to see if some clusters are neighbors or strangers! - Field cluster stat provides statistic reports according a value-clustering from a class of field you select. As example, if you select F1 and F2 for clustering, you can define clusters of records just by asking (F1, F2) have a constant value (x_C, y_C) for all the records of a same cluster C. Then you can provide the stat reports on the remaining fields, for all these clusters! Fields for clustering are selected on the left table of Figure 44. http://www.nehoov.com Fig. 44 On the right table you select the fields for the stat report, you give a label (because a node will be created), and you select a splitted or a crossed report. Crossed report will perform a clustering as given in the above example, where splitted report will perform a clustering (& then a report) for each field separately. Here is an example, given in Figure 45, where the field cluster stat is based on a classification field (8 classes, just look at the name of the clusters): Fig. 45 First we created a classification field of 10 values from a field Power; then, we called a Finverse stat report on this classification field. First column of Fig. 35 provides the 10 values of this field, and next column the statistics for all the remaining fields. Remark that you can export the table to csv, or you can export this table as a NEW NODE, in order to be able to apply DEXTER features on these data! Now it is time to explain 2 advanced features of this menu. - t-SNE. t-Distributed Stochastic Neighbor Embedding (t-SNE) is an algorithm created by Laurens van der Maaten. It is an algorithm that performs a 2D-projection of the node records & try to preserve the distribution of their pairwise distances. See Wiki or https://www.youtube.com/watch?v=RJVL80Gg3lA&list=UUtXKDgv1AVoG88PLl8nGXmw for more detail. In NEHOOV we wrote our own t-SNE algorithm from the academic papers. http://www.nehoov.com When you select this feature, a pop-up appears with some explanations and 3 by-default parameters. After clicking the generate button, the t-SNE is computed (in this free version not in a background mode,). A result is a XY-chart, such as in Fig. 46: Fig. 46 The (X, Y)-coordinates can be exported to .csv file, or you can also save them as fields in the data node. You can also save this as a graph node, to store the chart. It is interesting to perform a t-SNE, save the coordinates as fields, create 2 associated classification fields according the number of “areas” in the t-SNE, and then process a 2-binning (see next section). - N-binning. We implemented a nice version of topological N-binning, for any integer N>0. This will provide a clustering of all the records of a data node, represented as a graph where edges show “some” connexions between clusters. You can skip the following if you don’t want the “math” on this, and jump to Figure 50. Let’s start with N=1, that is a let’s present what is a 1binning. First we need a field F, a number c, an overlap % p, a cut-off value v and a method, SET or RANGE. Then we start by processing 3 steps as in Figure 47 to create an overlapping initial clustering: Fig. 47 We sort records according the increasing values of F (step 1); then we slice in c equal pieces our set of records (step 2), that we call initial clusters: if the selected method is SET, the set of increasing values of F is splitted in c pieces of same size; if the selected method is RANGE, we slice the range [min, max] of F-values in c equal pieces. In that case, it is obvious that our clusters won’t have the same size (in term of number of elements), because the values of F http://www.nehoov.com are not necessarily uniformly distributed. This is the difference between SET and RANGE methods. Then we “dilate” by p % the size of each pieces in order to get overlaps (step 3): that means we duplicate some records during this process of dilatation. For instance, on Figure 47, the record “red start” in the second initial cluster is duplicated in order to get this record both on initial clusters 1 and 2. Once we get this we will process each initial cluster in order to get sub-clustering. For this, we need a distance d, by default the weighted quadratic distance on fields, but the user can change it. For any initial cluster C containing records R_1…R_q, we compute the sequences of consecutive differences X_2=d(F_2, F_1), X_3=d(F_3, F_2), X_4=d(F_4, F_3), etc… with F_i the value of F in the record R_i. Now we sort the X_i’s increasingly, and we take the v-th greatest value V of the sequence, where v is our cut-off value. Then we can define our sub clusters of C by simply parsing the records R_1, R_2, etc. First sub-cluster C_1 contains R-1. We add to C_1 records R_2, R_3, etc… as long as X_2, X_3, etc… is below V. When we reach a record R_i that verify X_i>V, we create a new sub-cluster C_2 and we restart the process. This gives us the nodes of our graph, and we say that 2 nodes are linked if there is a record that belong to both of them. On the end one get a graph, such as in Figure 48: Fig. 48 http://www.nehoov.com In DEXTER nodes are labeled “A x B”, where A is the number of the initial cluster and B the number of the subcluster that decomposed initial cluster A. The 1-binning description is ended. For N-binning with N>1, we select N fields F1…F_N. For each i=1,…,N we split the range [min_i, max_i] of F_i in c clusters C_(i,j) according the SET or RANGE method, as described before. We thus obtain c^N clusters, defined by the Cartesian product C(i,j1)x…xC(i,j_N), providing N-dimensional pavings. Then the p% dilatation is obvious, as depicted for instance in Figure 49, where N=2 (we colored 2 dilated clusters for illustration): Fig. 49 Left diagram shows clusters C(i,j) before dilatation, right diagram, after it. Then, the building of subclusters follows exactly the process depicted in the 1-binning case, and one obtain a graph as given in Figure 48. So far we fully presented the building of the binning graph, but there is also another result in the binning process: the ‘distance’ matrix. For this, a new set of fields needs to be settled by the user that will be used to compute distances between the subclusters of the binning graph. In that way, one can say that the binning graph is built with a first set of fields, and the distance evaluation with a second one. An example: one can build a binning graph by using some customer attribute fields (e.g. localization, age, etc), and then compute the distance between clusters of customers by using the monthly amount of purchases, allowing to identify dissimilarities between clusters (that is, those with high distances). The distance matrix has the following shape: Fig. 50 http://www.nehoov.com First column provides the names of the subclusters. For each of them, we get diameter (= the maximal pairwise distance value), min & max of fields that defined the distance, count, and min & max NON ZERO value of pairwise distance between subclusters (we write min* and max* with a ‘*’ to underline this fact that it is non zero). Finally, each distance between the subcluster and the others is given. Playing with all the binning parameters (e.g. own weighted quadratic distances for subclustering and for the matrix evaluation) allows to obtain very different nodes and very different dissimilarities between nodes! Figure 51 shows the GUI: Fig. 51 In the first table you can chose the N fields for the N-binning. In the second you will define the distance that will be used for the final computation of the subclusters by using the cut-off value (in red). Initial cluster number N is compulsory, such as a labelling for the creation of the node. Third table is used to define the distance that will be used for the matrix report that provides diameters and interdistances between subclusters (see Figure 48). These distances are weighted quadratic; you can modify the “1” values to fix your own weights. Finally, the radio-buttons sorting methods allows to choose how the initial clusters are computed: by slicing the range [min, max] of the field in N equal http://www.nehoov.com pieces (option range initial clusters can have different number of records), or by slicing the set of records in N equal subsets according to F increasing (option set). Results are given in the data table zone, they are presented in 3 folders you can reach by clicking on the 3 radio-buttons on top left, see Figure 48 for the first folder that present the sub cluster graph. Second folder provides the graph of initial clusters: Fig. 52 Third folder gives the metric matrix: Fig. 53 On all these 3 folders, you can add a binning field to the data node: this field provides for each record the name of the cluster it belongs. Thus, you can process clever downstream analysis by using Finverse or field cluster stat! You can also export the matrix as a new data node, or simply as a csv file. http://www.nehoov.com The stat node Just by clicking on this node, some obvious indicators are given, by field: Fig. 54 Each field has on the top right a button for relaunching the stat computation for the field (to be used for instance after a cleaning, or when the stat computation was not asked during the import process). On the top right of this table there is a button that allows to relaunch the computation for all the fields. Field Type is a List Box that allows to change the type of the field is needed. For instance if a field has only 0 or 1 values, DEXTER will define its type as ‘boolean’, and you might want it as integer. Cardinal gives the number of different values of the field; singleton shows the number of values of the field that occurs only once; P10, P50 and P90 are percentiles. Right buttons allow to compute the stat field by field; in the case it was not performed during the import process. Below the first table lies the correlation table, of all fields to the output field: Fig. 55 Time of use In this free version of DEXTER, almost all the algorithms are working in a background mode (except for t-SNE and Add Field), allowing the user to continue to work in DEXTER. However, in this free version it is not possible to launch more than one background algorithm: it is compulsory to wait for the end of the background task before launching another one. http://www.nehoov.com Nevertheless, we give below few assessment of computation time, on big data sets. M stand for million, k for thousand. Laptop used was a RAM 6 GB, Intel Core I5 (1 UC double core). Universe preference was 600 Mb for block size and 5 threads: 20 k records, 2k fields Import: 1 M records, 100 fields Import: 75 M records, 13 fields Import: 1M records, 30 fields R-reservoir: 75 M records, 13 fields R-reservoir: 75 M records, 13 fields 1-binning Fig. 56 In Figure 57 a 4-binning is given on a data node of 100 000 records and 200 fields, with N=8, overlap=2% and cut-off=2. Fig. 57 This binning graph has 1 500 nodes, and was computed in 218 133 seconds, that is a bit more of 2 days!! This on a tiny lap-top, coz we have plenty of time… this is ECO DATA ANALYTICS!! http://www.nehoov.com Copyright NEHOOV©, 2016 This material is free of use. http://www.nehoov.com
© Copyright 2026 Paperzz