VizJSON Data Guide Introduction Simple data tables are specified

VizJSON Data Guide
Introduction
Simple data tables are specified by giving the fields and rows of data values, as described in the
VizJSON Reference Guide. This document describes the specification of derived data. Derived
data tables use the same tabular model as simple data, where the table has zero or more fields
(columns) with properties including a unique identifier, and zero or more rows of data values.
Derived fields and values can thus be used in the grammar part of the VisJSON specification in
exactly the same way as simple data.
Derived data tables are specified in the data section of a VisJSON specification, with an entity
that declares how the table is to be created. The fields and data values of a derived table are
computed from another data table called the source table, specified using the source property
with the source table's identifier. In the VizJSON specification, the derived table must appear
after the source table; an exception is thrown if this is not the case. The source table may be a
simple table, a derived table, or a table created by a data provider.
The fields and data rows of the derived table are calculated from those of the source table. The
calculation is specified with the type property which gives the transformation to be applied to the
source table data, and group, input, and output properties which specify how the transform is to
be applied to create fields and how the rows of the derived table are to be calculated.
Derived data is built in the following manner:



The fields are determined by examining the specification. Both group and output
properties can create fields in the derived table.
The group property is used to construct groups. The source data rows are broken into
separate groups, one for each unique combination of values in the group fields.
The data transformation is applied separately to each group of source rows, producing
one or more rows of the derived table. The group values are copied into the derived
fields. The output values are computed from the values in the group of source rows,
using information in the output specification for the derived field and (usually) the input
specifications.
Thus typically a derived table will consist of fields defined by group and fields defined by the
output transforms applied to the input fields. This makes the derived data specification look a
little like a SQL statement such as,
SELECT groupVar1 AS groupOutput1, ..., groupVarN AS groupOutputN,
transform1(input1, ..., inputM) AS output1,
...,
transformP(input1, ..., inputM) AS outputP
FROM source
GROUP BY groupVar1, ... groupVarN
1
The transform type affects how the derived rows are calculated. One important aspect of this is
how many derived rows are produced per source group or row. There are three general patterns:



One derived row per group. The output values are calculated from all the source values
in the group. This is used by the summary transformation, for example, where the output
values are obtained by applying statistical summary operations such as sum, average,
median, or count.
One derived row per source row. The output values are still calculated from all the
input values in the group, but usually one of the source rows has some special role. This
is used by the percent transformation, for example, where the output value is the source
value as a percentage of the sum of its group values. This type of transform also usually
allows copying any source field into the derived table without transforming it. Each
derived row also has what is called a primary row in the source table, which is available
through the RAVE data API.
Multiple derived rows per group. The output values generally represent the results of a
curve-fitting algorithm and consist of multiple (x,y) pairs which trace the output curve for
each group. This is used by the smooth.regression transformation, for example.
Specification
Derived data sources do not have fields or rows. The type and source properties must be present
for a data specification to be recognized as a derived table. The following properties are used:
id
[String] Identifier to be used to refer to this data table. Unique within this specification.
type
[String] Type of transformation to use. Currently supported transformations are summary,
order. percent, integrate, smooth.regression, smooth.kernel, density.kernel, and
distribution.normal.
source
[Reference to Data entity] This is a reference to the source table that will be used to
compute the derived values in this data entity.
group
[Array of Entity] Defines source fields used to partition the source data rows into groups.
Group fields may be copied into the derived table.
input
[Array of Entity] Defines the input fields from the source data. Input fields are used by
the data transformations but are not copied into the derived table.
output
[Array of Entity] Defines fields in the derived table and the way in which each is
calculated from the source fields.
group
2
Defines groupings for the transforms to work on. If not present, only one group is used. The
values of the group fields may optionally be copied into the derived table. Each entry defines a
field to be used for grouping, and has the following structure:
input
[Reference to Field] Required. Defines the field to use for the grouping. Must be a field
of the source data.
output
[Entity] Optional. Defines the derived field to which the input group field will be copied.
If missing, the field is not included in the derived table. The field definition uses the same
properties as in a simple table field specification. When present, the field definition must
have at least the "id" property. All other field parameters are optional, and if not present
default values calculated from the input field will be used.
granularity
[Number] Optional. Defines a granularity for this field, with exactly the same results as
definining the granularity in the output field specification.
input
Defines input fields used to calculate the output fields. In some cases input fields are not required
or can be specified in the output, but some transforms require multiple fields which play
different roles in the transformation. For example a weighted sum uses two fields, one of which
acts as a weight and multiplies the values of the other before they are summed. The input entity
defines these field and their roles, which are used by all output fields.
field
[Reference to Entity] Required. Reference to a source field that is to be used in a given
role.
role
[String] Optional (but usually necessary). The role that the input field plays in the
transformation, for example x, y, sort, or weight. The applicable roles depend on the
transformation type.
output
[Entity] Defines output fields. Each entity has a definition for a field and has information used to
calculate the values of the field including input fields, output roles, and parameters used by the
transformation.
field
[Entity] Required. A field specification for the output field using the same properties as
in a simple table field specification The field definition must have at least the "id"
property. All other field parameters are optional, and if not present default values
calculated from the input field will be used.
role
3
[String] Optional. The role of the output field for the results of the transformation. Some
transformations require roles, others do not.
input
[Reference to Field] Optional. Used for output transformations that operate on a single
input field and whose results do not depend on, or affect, any other outputs that are
calculated.
parameters
[Entity] Parameters used to calculate the output values. The applicable parameters are
dependent upon the transformation.
Note: The input property of the data specification defines source fields which are used by all the
data transformations. The input property of an output field specification defines one source field
which is used by that output. The input property of a group specification defines one source field
which is used for grouping.
Note: The output property of the data specification defines a field present in the output data
which is computed by a data transformation. The output property of a group specification
indicates that the field will be copied into the output data.
Examples of Table Calculation
This section gives several examples of derived tables, with a brief explanation of how the
derived table rows are calculated. The examples use this source table with id "salesTable". It has
two categorical fields giving a product type and specific product and numerical fields with sales
and other information.
product_id
10
8
23
7
60
48
29
35
product_type
2
0
1
0
0
2
1
1
sales
125
63
84
15
145
210
78
77
ship_weight
2.5
1.2
2.2
1.8
3.4
1.5
0.9
2.5
year_introduced
1998
1998
2006
1998
2013
2010
2002
2002
Summary Example
The summary transform produces one output row per source group. In this specification we
group on the "product_type" field of the source table, and include that group field in the output
fields with name "ptype". We also calculate two summaries of the "sales" field, one applying the
"sum" operation and one the "mean". The data specification for the table is:
4
{
"id": "summaryTable",
"type": "summary",
"source": { "$ref": "salesTable" },
"group": [
{ "input": { "$ref": "product_type" }, "output": { "id":
"ptype" } }
],
"output": [
{ "input": { "$ref": "sales" }, "field": { "id":
"total_sales" }, "parameters": { "summaryFunction": "sum" } },
{ "input": { "$ref": "sales" }, "field": { "id":
"average_sales" }, "parameters": { "summaryFunction": "mean" } }
]
}
When this specification is processed it is recognized as a derived-data specification because the
"source" and "type" properties are present. The source table is located, and the ids which are
used as source fields in the specification ("product_type" and "sales") are checked; an exception
is thrown if any of these are not source table fields.
The rows of that table are grouped by the value of the "product_type" field, creating three groups
for the values 0, 1, 2. Since the "type" is "summary", one derived row will be created for each
group. The derived rows will be in the order of the group values. The group for value 0 has three
rows, in which the "sales" field having values 63, 15, and 145. When this group is processed:



The value 0 of group field "product_type" is copied to the output "ptype". (Note that all
the rows necessarily have the same value for the group field).
The values 63, 15, and 145 of the "sales" field are combined with the "sum" operation,
giving value 223 for the output "total_sales".
The values 63, 15, and 145 of the "sales" field are combined with the "mean" operation,
giving value 74.333 for the output "average_sales".
The groups for values 1 and 2 are processed similarly, and the resulting table is:
ptype
0
1
2
total_sales
average_sales
223
239
335
74.333
79.667
167.5
Since the table has one row per group, the derived rows do not have a primary row in the source
table; it is not possible to say that any one row of the source is uniquely associated with the
output. For the same reason, it is not possible to copy non-group fields into the derived table. For
example the "ship_weight" field for the group with value 0 has the values 1.2, 1.8, and 3.4, so
there is no single value that could be assigned to the output field. We could of course use apply a
summary function to the "ship_weight" field and use the result as a field of the output table.
5
The unspecified properties of the fields are inherited from the source fields, or calculated from
the result rows. For example the "ptype" field was created from the categorical "product_type"
field, so it will have the same category labels as that field (and thus the value 0 will have the
same label in both).
The SQL equivalent for this specification is:
SELECT product_type AS ptype, SUM(sales) AS total_sales, AVG(sales) AS
average_sales
FROM salesTable
GROUP BY product_type;
Percent Example
The percent transform produces one output row per input row. In this specification we group on
the "product_type" field of the source table, and include that group field in the output fields with
name "ptype". We apply the "percent" operation to the "sales" field to create the "sales_percent",
but we also copy both the "sales" and the "year_introduced" fields (renaming them, as is
necessary since all RAVE identifiers must be unique) into the derived table. The data
specification for the table is:
{
"id": "percentTable",
"type": "percent",
"source": { "$ref": "salesTable" },
"group": [
{ "input": { "$ref": "product_type" }, "output": { "id":
"ptype" } }
],
"output": [
{ "input": { "$ref": "sales" }, "field": { "id":
"percent_sales" }, "parameters": { "summaryFunction": "percent" } },
{ "input": { "$ref": "sales" }, "field": { "id":
"sales2" } },
{ "input": { "$ref": "year_introduced" }, "field": {
"id": "year2" } }
]
}
The specification is again recognized as a derived-data specification, and the source table is
located and the field identifiers are checked. The groups on "product_type" are then formed as
before. Since the "type" is "percent", one output row is created from each input row, but the
other rows of its group also contribute to the calculation.
The rows are processed and output in the order of the group values, so the first source row used
is the one with "product_id" 8, "product_type" 0, "sales" 63, "ship_weight" 1.2, and
"year_introduced" 1998.
6


The value 0 of group field "product_type" is copied to the output "ptype".
The output field "percent_sales" has input field "sales" and the "percent" summary
function. The values of the "sales" input field for the group are 63, 15, and 145. The
percent is calculated as 100*63/(63+15+145), or 28.251. Note here the dependency on
the source row, since the value 63 is the one used in the numerator; the dependency on
the rest of the group is in the sum in the denominator.
The output field "sales2" has input field "sales" but no summary function, so the value 63
is copied into the output.
Similarly the output field "year2" is set by copying the value 1998 from the source
"year_introduced".


The remaining rows are processed similarly, and the resulting table is:
ptype
0
0
0
1
1
1
2
2
percent_sales
sales2
28.251
6.726
65.023
35.146
32.636
32.218
37.313
62.687
63
15
145
84
78
77
125
210
year2
1998
1998
2013
2006
2002
2002
1998
1998
Each derived row has a primary row in the source table. This is also what makes it possible to
copy the non-grouped, non-transformed values into the "sales2" and "year2" fields of the derived
table.
Density Kernel Example
The density.kernel transform produces multiple output rows per input group, giving the (x,y)
coordinates of a kernel density estimate fit to the input data. In this specification we group on the
"product_type" field of the source table, and include that group field in the output fields with
name "ptype". The "ship_weight" is used as the input "x" role, and the "x" and "y" output roles
become the "kernelX" and "kernelY" derived fields. The data specification for the table is:
{
"id": "kernelTable",
"type": "density.kernel",
"source": { "$ref": "salesTable" },
"group": [
{ "input": { "$ref": "product_type" }, "output": { "id": "ptype" } }
],
"input": [
{ "field": { "$ref": "ship_weight" }, "role": "x" }
7
],
"output": [
{ "field": { "id": "kernelX" }, "role": "x" },
{ "field": { "id": "kernelY" }, "role": "y" }
]
}
After finding the source table and checking the field references, the groups on "product_type" are
formed. Since the transform "type" is "density.kernel", an Epancechnikov function will be fit to
the "ship_weight" data values for each group, generating numerous rows of data for each group.
For example the group for value 0 with input values 1.2, 1.8, and 3.4 generates 41 rows with "x"
values ranging from 0.04 to 4.55 and "y" values between 0 and 1.22. These are combined with a
similar number of rows for each of the groups with values 1 and 2, and the resulting table is (in
part):
ptype
0
0
0
...
0
1
1
...
1
2
2
...
2
2
kernelX
kernelY
0.048
0.160
0.273
...
4.552
-0.252
-0.154
...
3.652
0.348
0.430
...
3.570
3.652
0.0
0.121
0.229
...
0.0
0.0
0.106
...
0.0
0.0
0.090
...
0.090
0.0
Since each derived row corresponds to a group in the source table, they do not have primary
rows and it is not possible to copy rows other than the group from the source table to the derived
table.
Summary data transform
The summary transform computes basic statistical summaries similar to SQL SUM or COUNT
operations. In the basic form of this transform, the output will contain one row per each group.
That row will contain the summary values defined as output fields, in addition to specified group
fields.
8
There is a second form for this transform called replicate, that can be requested in the output
parameters. When using replicate mode, the output will contain the same number of rows as the
input, and all output fields that reference data input fields and with no summary function
specified will be copied to the output. This means you may have the same input data as part of
the output plus some extra columns from the summarized values. This is similar to doing an SQL
summary and joining the result with the original table on the group columns.
The data parameters are used as follows:
type
Always summary.
group
Used to specify the grouping. Group fields may be copied into the derived table.
input
Used to specify the input fields and their roles in the output values, e.g. to specify the
field that plays the roles of a "weight".
output
The parameters entity defines the summary operation. The parameters must have
property summaryFunction, described below. They may also have property replicate,
which if true causes the output table to have one row per input table as previously
described.
The input property is the field to which the summary is applied. This is optional for the
count summary function and required for the other summary functions, all of which
operate on this field. It is required for fields that are to be copied to the output when using
replicate.
The role property is not used by summary data.
9
"id": "summaryData",
"type": "summary",
"group": [
{
"input": {"$ref": "Order method"},
"output": {"id": "summary_OrderMethod"}
}
],
"output": [
{
"field": {"id": "summary_ProductLine"},
"input": {"$ref": "Product line"}
},
{
"field": {"id": "summary_Quantity"},
"input": {"$ref": "Quantity"}
},
{
"field": {"id": "maxQuantity"},
"input": {"$ref": "Quantity"},
"parameters": {
"replicate": true,
"summaryFunction": "max"
}
}
],
"source": {"$ref": "data"}
Summary Functions
The summaryFunction property of the output parameters defines how the value in the output
field is calculated from the input field. This is a string property with allowed values:
count
A count of the number of rows (similar to SQL COUNT). This function does not require
an input field.
percentOfCount
A percentage value of the number of rows counted in a group against the total rows. This
function does not require an input field.
sum
The sum of the values of the input field (the one in the output specification), similar to
SQL SUM. If the data input fields include one with role "weight", a weighted sum is
produced with each input field value multiplied by the weight field value before being
added to the sum.
mean
The sum of the input data field divided by the count (similar to SQL AVG). Weighting is
not used.
min
10
The minimum of the input data field (similar to SQL MIN). Weighting is not used.
max
The maximum of the input data field (similar to SQL MAX). Weighting is not used.
median
The 50th weighted percentile. 50% of the total (weight of the) values are smaller or equal
than the median. Note that the median might fall between two data points; in that case,
the two neighboring values are interpolated.
lowerFence
The smallest value in the data that is bigger or equal than the lower hinge minus 1.5 times
the difference between the upper and lower hinges.
lowerHinge
The 25th weighted percentile. 25% of the total (weight of the) values are smaller or equal
than the lower hinge. Note that the lower hinge might fall between two data points; in
that case, the two neighboring values are interpolated.
upperHinge
The 75th weighted percentile. 75% of the total (weight of the) values are smaller or equal
than the lower hinge. Note that the lower hinge might fall between two data points; in
that case, the two neighboring values are interpolated.
upperFence
The largest value in the data that is smaller or equal than the upper hinge plus 1.5 times
the difference between the upper and lower hinges.
Note: The median, lowerFence, lowerHinge, upperHinge, and upperFence summary functions
are intended for use with the boxplot schema element. The median is the first position in the
boxplot element, and the other four functions (in that order) are the four components of the
element.
11
"output": [
{
"field": {"id": "binnedCount"},
"parameters": {"summaryFunction": "count"}
},
{
"field": {"id": "percentBinnedCount"},
"parameters": {
"summaryFunction": "percentOfCount"
}
},
{
"field": {"id": "binnedLower"},
"input": {"$ref": "fX"},
"parameters": {"summaryFunction": "min"}
},
{
"field": {"id": "binnedUpper"},
"input": {"$ref": "fX"},
"parameters": {"summaryFunction": "max"}
}
],
Using Custom Summary Functions
Users of the summary transform can also provide and use their own Summary Functions. To do
this they need to register a SummaryFunctionProviderFactory object into the
SummaryFunctionService. This factory will be responsible for returning SummaryFunctions by
12
ID (the ID will be the value provided in the spec for the summaryFunction attribute). As an
example (using pseudo-code):
public class MyCustomSummaryFunctionProviderFactory implements
SummaryFunctionProviderFactory {
public List<String> getSummaryFunctionIDList() {
ArrayList<String> list = new ArrayList<String>();
list.add("external.myCustomFunction");
return list;
}
public SummaryFunction getSummaryFunction(String functionID) {
if("external.myCustomFunction".equals(functionID)) {
return new MySummaryFunction();
} else {
return null;
}
}
}
//later, the factory can be registered by doing:
SummaryFunctionProviderFactory myFactory = new
MyCustomSummaryFunctionProviderFactory();
SummaryFunctionService.getInstance().registerFactory(myFactory);
An then, the spec could look like:
{
"id": "data",
"source": { "$ref": "base" },
"type": "summary",
"group": [
{
"input": { "$ref": "base_type" },
"output": { "id": "type" }
}
],
"output": [
{
"field":{"id": "myCustomFieldID"},
"parameters": {"summaryFunction": "external.myCustomFunction"}
}
]
}
In this case, the SummaryFunction with ID "external.myCustomFunction" will be used when
computing the summary value for the field "myCustomFieldID".
The SummaryFunction needs to implement the method compute. This method will receive as
parameters the values of the current "column" that must be aggregated, and the corresponding
weights for those column values. There is a 3rd parameter called precomputedData which is
13
optionally used by summary functions that may want to share data between invocations. For
example, to implement an unweighted sum function:
public Number compute(Number[] values, double[] weights, Object
precomputedData) {
double sum = 0.0;
for (int i = 0; i < values.length; i++) {
if (values[i] != null) {
sum += values[i];
}
}
return sum;
}
Order transform
The order data transform creates one derived output row per input row. The transform re-orders
the rows by the values of specified fields, with an option to sort the fields in ascending or
descending order.
type
Always order.
group
Not used and has no effect.
input
Not used and has no effect.
output
[Array of Entity] Each entity defines one derived field as a copy of a source field, using
the input and field properties in the usual way. The role property is described below. The
parameters property may have an ascending boolean property, default false, which
determines the sort order.
The output properties define the source fields to be copied to the derived table and those fields
which are to be used for sorting. Any or all fields may be copied. The sorting is applied on those
fields with a role of "sort" or "sortN" where N is a non-negative number, called the sort
priority; "sort" is treated as "sort0" with sort priority 0. The rows are sorted by fields with a
sort role, using a stable sort in which null values are sorted before non-null values. The fields are
processed in the order of increasing sort priority, with ties broken by the original order of the
fields in the output array.
For example, the specification:
{
"id": "orderTable",
"type": "order",
"source": { "$ref": "sourceTable" },
"output": [
{ "input": { "$ref": "f0" }, "field": { "id": "s_f0" } },
14
{ "input": { "$ref": "f3" }, "field":
"role": "sort1" },
{ "input": { "$ref": "f1" }, "field":
"role": "sort1", "parameters": { "ascending": true }
{ "input": { "$ref": "f2" }, "field":
{ "input": { "$ref": "f4" }, "field":
"role": "sort" },
]
}
{ "id": "s_f3" },
{ "id": "s_f1" },
},
{ "id": "s_f2" } },
{ "id": "s_f4" },
will copy the values for each of the source fields "f0", "f1", etc. into the corresponding derived
field "s_f0", "s_f1", etc. Fields "f0" and "f2" have no sort role so are simply copied; they do
not affect the order of the rows in the derived table. The other fields have a sort role, with field
"f4" having sort priority 0 and fields "f3" and "f1" both having sort priority 1. Field "f3"
appears before "f1" in the list, so the rows will be stable-sorted first by field "f4", then by field
"f3", and finally by field "f1". The first two sorts are in the default descending order, and the
last in ascending order. Note that this sorting affects only the derived table, and the source table
is unchanged.
The row order in RAVE data tables has relatively little impact on the final chart. Three
applications where the order transform may be useful are:



Categorical spans. When specifying a span of a scale for a categorical data value, the
span property "method" can be set to "data". This causes the categories to be ordered by
their first appearance in the data rows. Ordering the rows by a field will thus indirectly
order the categories,
Inputs to other derived tables. The data transformations that create derived tables
process groups in the order of the values, and within each group process the rows in their
table order. In some cases, using an order transform to create a derived table then basing
another derived table on that ordered table can affect the results.
Z-ordering. The z-order (drawing order) of the chart shapes is sometimes the same as the
row order. However, the z-order is also affected by the application of aesthetics (the
shapes are grouped by common aesthetic value, then output in group order) and by
coordinate transforms such as stacking, so except in the simplest charts this is not a
reliable way to control z-order.
An example combining the first and second of these cases is found in the pareto chart, which has
a categorical X-axis and plots some other field on the Y-axis. The chart shows both an interval
(bar) with the second field, and a line with the cumulative total of the second field. The bars are
to be sorted, left-to-right, in descending order. To produce this we first apply an order transform
on the original data, copying the X and Y fields and sorting in descending order on the Y field.
This gives the table for the interval element. It has the rows in the desired order, so we include
"method":"data" in the span specifications for the X scale. We then use an integrate transform
on the sorted table to create another derived table, copying the X values and computing the
running sum of the Y values (using the "y1" role). This gives the table for the line element.
15
Percent transform
A percent transform creates one derived row per input row. The rows are grouped and, for each
row in the group, computes the percentage value of a given output field relative to the sum of the
absolute values of all the input values in that group. The percentage values are on a scale of 0 to
100, so if an input group had rows with values 1, 3, and 4, the corresponding output rows would
have values 12.5, 37.5, and 50. It is also possible to copy any of the source fields to the derived
table without applying a transform.
type
Always percent
group
Used to specify the grouping. Group fields may be copied into the derived table.
input
Not used for percent transform.
output
[Array of Entity] Each entity defines one derived field from a source field, using the
input and field properties in the usual way. If the parameters set
"summaryFunction":"percent", the percent transform is applied to the source field to
create the derived field. Otherwise the source field is copied to the derived field.
For example, the following specification groups on the source field "fX" and copies its value to
the derived table as "X". It applies the percent transform to the source field "fY" to create
derived field "Y". Derived field "Z" is simply a copy of source field "fZ".
16
{
"id": "percentified",
"source": { "$ref": "data" },
"type": "percent",
"group": [
{"input": {"$ref": "fX"}, "output": {"id": "X"}}
],
"output": [
{ "input": {"$ref": "fY"}, "field": {"id": "Y"}, "parameters":
{"summaryFunction": "percent"} },
{ "input": {"$ref": "fZ"}, "field": {"id": "Z"} }
],
}
Percent transform
Integrate data transform
Given a table with (x, dy) pairs, generates a new table with intervals [y0,y1], such that y1y0=dy, and each row has y0 equal to y1 in the previous row (or 0 for the first one).
Note: When using this data transform, do not use scale.categories to exclude or reorder
categories of the X-role.
17
Example ( input on the left, output on the right ):
Notice how, on the right, each bar applies the delta specified in the label starting where the last
bar left of. At the end, a Total bar (white) was added which starts at 0 and reachs the total sum of
all the deltas.
Example transformation:
{
"id": "wf",
"source": {"$ref": "data"},
"type": "integrate",
"input": [
{"role": "x", "field": {"$ref": "x"} },
{"role": "y", "field": {"$ref": "dy"} },
{"role": "yPercent", "field": {"$ref": "percent"} }
],
"output": [
{"role": "x", "field": {"id": "xc"},
"parameters": {"addTotalCategory": "Total", "sortCategories": true}
},
{"role": "y", "field": {"id": "yc"} },
{"role": "y0", "field": {"id": "y0"} },
{"role": "y1", "field": {"id": "y1"} },
{"role": "yPercent", "field": {"id": "yPercent"}},
{"field": {"id": "copyOfF0"}, "input": {"$ref": "F0" } }
]
}
type
Always integrate.
input
Used to specify the input fields and their roles in the output function; in this case, both
"x" and "y" roles must be present. "yPercent" is optional. It specifies the percentage of
subset contribution in fill-o-meter chart.
output
All outputs roles are optional except x.

x
: Copy of the original input x. This output field allows two optional parameters
that can be used with "parameters". One of them is "addTotalCateogory" where
the name of an additional category with the total value can be defined. When used
this option also adds a row to the output table with the total values. The other is
18






"sortCategories", a boolean value by default true, that defines whether the output
data rows will be sorted in the same order that the x data field categories where
specified, or false, each row will stay in the same order that they were defined in
the input rows section. See the previous code for an example.
y : Copy of the original input y.
y0 : The base of each waterfall interval. (equal to where the last y1 left of)
y1 : The target of each waterfall interval. (equal to where the last y1 left of)
yPercent : The target of subset contribution in each waterfall interval.
type : An aesthetic helper. It stores a categorical value based on the trend and role
of the bar. Basically, one of six categorical values: 0: normal bar/steady, 1:
normal bar/going up, 2: normal bar/going down, 3: total-bar/at zero, 4: totalbar/positive, 5: total-bar/negative
from : Another helper. It stores the x coordinate of the previous bar; or null if that
bar was invalid (for example, those with a null y input role). In the case of the
first bar, the value is null. This is usually used for creating lines linking the bars.
An output field with an input field and no role copies the value from the input table. This
is shown in the above code example where "copyOfF0" has values from the input "F0".
If the "addTotalCategory" option is used with the output x-role, these fields will have null
values in the output row for the total since there is no single input row associated with
that row.
Smooth data transform
19
{
"id": "smooth",
"source": { "$ref": "dDelimitedFileSource" },
"type": "smooth.regression",
"input": [
{"role" : "x", "field": { "$ref": "fX" }},
{"role" : "y", "field" : { "$ref": "fY" }},
{"role" : "weight", "field": { "$ref": "fW" }}
],
"output": [
{"role" : "x", "field" : {"id": "kX"} },
{"role" : "y", "field" : {"id": "kY"} },
{"role" : "accuracy", "field" : {"id": "kLabel"} }
]
}
Weighted simple linear regression
A smooth takes (x,y) and produces a smoothed version that fits a predictive model y=f(x).
type
smooth.regression
will define a simple linear regression. smooth.kernel will define a
loess smooth that does a "locally weighted regression" smooth.
input
role : x, y, weight.
Weight is an optional role. The weight field should be nonnegative and is used to calculate the smoothed value, but is not part of the calculation of
20
the window size (the degree of smoothing). As such its should not be used as a
replication weight, in the statistical sense.
output
parameters : Smooth-specific attribute parameters TBD
input : None (not used for smooth)
role : x which is a set of values spread evenly along the dimension or y which is a set
predicted values for each x or accuracy is the R^2 accuracy measure for this smooth
[Experimental].
Kernel density estimation
{
"id": "kernel",
"source": { "$ref": "dDelimitedFileSource" },
"type": "density.kernel",
"input": [
{"role" : "x", "field": { "$ref": "fX" },
{"role" : "weight", "field": { "$ref": "fW" }}
],
"output": [
{"role" : "x", "field" : {"id": "kX"} },
{"role" : "y", "field" : {"id": "kY"} }
]
}
21
of
Kernel density estimation
A density estimation takes x and produces an kernel density estimate of probability density
function y=f(x). The default kernel function is Epanechnikov function.
type
density.kernel
will define a kernel density estimation.
input
role : x, weight.
Weight is an optional role. The weight field should be non-negative
and is used to calculate the density, but is not part of the calculation of the window size
(the degree of smoothing). As such its should not be used as a replication weight, in the
statistical sense.
output
role : x
which is a set of values spread evenly along the dimension and y which is a nonnegative density for each x.
Normal distribution
{
"id": "normaldistribution",
"source": { "$ref": "dDelimitedFileSource" },
"type": "distribution.normal",
"input": [
22
{"role" : "x", "field": { "$ref": "fX" },
{"role" : "weight", "field": { "$ref": "fW" }}
],
"output": [
{"role" : "x", "field" : {"id": "kX"} },
{"role" : "y", "field" : {"id": "kY"} }
]
}
Normal distribution
A normal distribution takes x and produces a normal distribution using the sample mean and the
sample standard deviation.
type
distribution.normal
will define a normal distribution.
input
role : x, weight.
Weight is an optional role. The weight field should be non-negative
and is used to calculate the distribution.
output
role : x,
which is a set of values spread evenly along the dimension and y, which is a
non-negative distribution for each x.
23