Lavastorm Desktop Professional Tech Note

Correlating Data
Lavastorm Desktop Professional allows the user to acquire and transform data from multiple
data sources. Often a correlation between two of the acquired data sources is required.
Furthermore, data may need to be enriched with a second data source. The Correlation library
contains multiple nodes to perform these functions. This Tech Note details the Lookup node
which enriches a data source with reference data and the Join nodes which combine two data
sets by a specified key value.
Lookup Node
The Lookup node is used to enrich a data source with reference data. The primary data source
must be connected to the first input pin and the reference data to the second. If the input pins
are not correctly attached to the node, the reference data will be enriched resulting in an
unexpected output. The Lookup node works best if the second data set is small as the second
data set is loaded into memory when the node is processed.
After connecting the data source and reference data to the Lookup node’s input pins, the node
can be configured by double-clicking on it. The configuration window is shown in Figure 2.
Figure 1 - Lookup Node.
Lavastorm Desktop Professional Tech Note
1 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 1
Figure 2 - Lookup node configuration.
First, be sure to provide the node with a meaningful name. Next, the matching key from each
data set must be specified. Right-click within the InputKey area to select the key value from the
first data set (Figure 3). Repeat within the LookupKey area to select the key value from the
second set. Once the node is run, all records from the first data set will be outputted along
with any matching fields from the second data source. If there are no matching fields, nulls
appear in the second data set fields (Figure 4). Unmatched fields can be omitted from the
output by un-commenting (removing the # in from of) line 5 which uses the phrase where
matchIsFound (Figure 5). The output of the Lookup node can be further customized using
BRAINscript.
Lavastorm Desktop Professional Tech Note
2 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 2
Figure 3 – Right-clicking to select the key value input field.
Figure 4 - Null field resulting from no matching key in the reference data.
Lavastorm Desktop Professional Tech Note
3 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 3
Figure 5 - Controlling the output of unmatched records
Join Nodes
Data records from two full data sources can be combined using the Join nodes which reside in
the Lavastorm Correlation library (Figure 6). Specialized Join nodes allow you to perform the
four types of Joins; inner, outer, left and right; and combinations of each of them. To use a Join
node, double-click on the specific type of join that you wish to perform. Each type of Join node
has the same configuration parameters which are shown in Figure 7.
In order to perform the join between two data sources, the records in each data source must
be sorted by the input’s join key. The Join node provides two parameters, SortLeftInput and
SortRightInput, which, when set to true, will sort the inputs by the key. If the input data is not
sorted, the error in Figure 8 appears.
Figure 6 – Join nodes in the Correlation library.
Lavastorm Desktop Professional Tech Note
4 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 4
Figure 7 – Configuration window for the Join nodes
Figure 8 – Join node error which occurs if the data is not sorted by the data source’s join input key.
The left data source input key on which the join will occur is entered after LeftInputKey. The
field can be accessed directly by right-clicking within the LeftInputKey area and selecting the
proper input as was shown in Figure 3.
Lavastorm Desktop Professional Tech Note
5 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 5
By default, the RightInputKey contains the text {{^LeftInputKey^}}. Any word enclosed in {{^^}}
is a parameter. In this case, the parameter refers to the LeftInputKey entered in the previous
step. If both the right and left input keys have the same name, the default RightInputKey can
be used. If the keys are different, the right input key must be entered. Right-click within the
RightInputKey area to select the appropriate field.
Expressions can also be entered within the InputKey fields. Figure 9 displays an example in
which the BRAINscript function int() is used to convert the AccountID field in the right input
data source to an integer. In this case, the conversion was necessary because the AccountID
field in the right input is a string whereas the AccountID on the left is an integer. The input key
of the two data sources being joined must have the same data type.
The Script section is specific to the type of Join node used. The default script does not need to
be modified; however, BRAINscript can be added to customize the output.
X-Ref Node
The X-Ref node (Figure 9), found in the Correlation library, provides full visibility into the two
joined data sources by performing a left, inner and right join. This powerful node contains
three outputs, one for each type of join it performs, resulting in a transparent view of all the
data separated into the records that matched and the orphan records on each side.
The X-Ref configuration window is shown in Figure 10. The configuration parameters are the
same as those in the other Join nodes and are configured in the same way as explained in the
Join Node section. The three output pins that result from running the node can be seen in
Figure 10. The top pin contains the orphan records on the left; the middle pin contains all the
matching records; and the bottom pin contains all orphaned record on the right.
Figure 9 – X-Ref node.
Lavastorm Desktop Professional Tech Note
6 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 6
Figure 10 – X-Ref node configuration window.
The Results of Duplicate Key Values in Lookup and Join Nodes
Lookup Node
Multiple records containing the same join key in a data set will result in different outputs
depending on whether a Lookup node or one of the Join nodes is used. In the Lookup node,
duplicate key values in the reference (second) data source are ignored; only one record per key
value is used for the lookup. Because the duplicate records in the second data source are
discarded, the lookup may result in unexpected outputs if duplicates aren’t properly handled
ahead of time. Duplicate key values in the first data set are passed through the Lookup node
and the reference data is added to it. Figure 11 provides an example of how duplicate key
values are handled in the Lookup node.
Lavastorm Desktop Professional Tech Note
7 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 7
Figure 11 – Results of duplicate records in a Lookup node.
Join Nodes
In Join nodes, multiple records containing the same key value will result in a Cartesian product
of the records. A Cartesian product is the product of two data sets. An example of the result of
records with duplicate key values for the Join node is shown in Figure 12. In this example, the
ID field is the join key. For ID value A, there are two records on the left side of the join and one
record on the right which results in an output of two records (2x1=2). For ID B, there are two
records in the left data source and two records in the right data source. The Cartesian product
of these records will result in four records (2x2=4) which is every possible combination of fields
between the two sources for ID B.
Figure 12 – Results of duplicate records in a Join node.
Lavastorm Desktop Professional Tech Note
8 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 8
Checking for Duplicates
In order to avoid unexpected results in the Lookup node or memory errors due to large
unexpected duplicate key values in the Join nodes, it is best to check for duplicates using the
Duplicate Detector node (Figure 13) found in the Profiling and Patterns library.
Figure 13 – The Duplicate Detection Node.
The configuration window of the Duplicate Detector node is shown in Figure 14. To check for
duplicates on the join key of your data sources, enter the join key field in the InputExpr by
right-clicking within the InputExpr expression and selecting the correct field. If
ErrorIfDuplicates is set to true (the default), the node will error if duplicates are found. There
are two outputs to the duplicate detector – the first are all the unique records and the second
are all the duplicate records.
Lavastorm Desktop Professional Tech Note
9 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 9
Figure 14 – Duplicate Detection node configuration.
Handling Duplicate Records
If required, duplicate records can be removed from the data set by either using the Sort node
or the Agg Ex node both found in the Aggregation and Transformation library. The Sort node is
configured to remove duplicates by entering the join key field in the CompareOrderExpr area
and setting Unique to true. The Unique parameter removes duplicates in a non-deterministic
manner. The Agg Ex node can be used as an alternative to the Sort node to remove duplicates.
To configure the Agg Ex node, first set the SortInput parameter to true if the data is not already
sorted by the join key. Enter the join key in the GroupBy field and replace
referencedFields(1,{{^GroupBy^}}) with a *. The Script section should then read as follows:
emit *
where lastInGroup
The Agg Ex will then sort the records and output only the last record for each key.
Lavastorm Desktop Professional Tech Note
10 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 10
Lookup vs. Join Nodes
While both the Lookup and Join nodes provide similar functions by correlating two data sets,
there are different cases in which one should be used over the other. Figure 15 summaries the
difference in use and functionality between the two types of nodes.
Figure 15 – Lookup vs. Join nodes.
Lavastorm Desktop Professional Tech Note
11 2012 | www.lavastorm.com
Lavastorm Analytics©
Page 11

Download Report

Lavastorm Desktop Professional Tech Note

Paperzz.com

Your Paperzz