Package `datadr`

Package ‘datadr’
October 2, 2016
Type Package
Title Divide and Recombine for Large, Complex Data
Version 0.8.6
Date 2016-09-22
Maintainer Ryan Hafen <[email protected]>
Description Methods for dividing data into subsets, applying analytical
methods to the subsets, and recombining the results. Comes with a generic
MapReduce interface as well. Works with key-value pairs stored in memory,
on local disk, or on HDFS, in the latter case using the R and Hadoop
Integrated Programming Environment (RHIPE).
License BSD_3_clause + file LICENSE
URL http://deltarho.org/docs-datadr
LazyLoad yes
LazyData yes
NeedsCompilation no
Imports data.table (>= 1.9.6), digest, codetools, hexbin, parallel,
magrittr, dplyr, methods
Suggests testthat (>= 0.11.0), roxygen2 (>= 5.0.1), Rhipe
RoxygenNote 5.0.1
Additional_repositories http://ml.stat.purdue.edu/packages
Author Ryan Hafen [aut, cre],
Landon Sego [ctb]
Repository CRAN
Date/Publication 2016-10-02 15:51:50
R topics documented:
datadr-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
addData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
addTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
3
4
R topics documented:
2
adult . . . . . . .
applyTransform .
as.data.frame.ddf
as.list.ddo . . . .
bsv . . . . . . . .
charFileHash . .
combCollect . . .
combDdf . . . .
combDdo . . . .
combMean . . .
combMeanCoef .
combRbind . . .
condDiv . . . . .
convert . . . . . .
ddf . . . . . . . .
ddf-accessors . .
ddo . . . . . . .
ddo-ddf-accessors
ddo-ddf-attributes
digestFileHash .
divide . . . . . .
divide-internals .
drAggregate . . .
drBLB . . . . . .
drFilter . . . . .
drGetGlobals . .
drGLM . . . . .
drHexbin . . . .
drJoin . . . . . .
drLapply . . . . .
drLM . . . . . .
drPersist . . . . .
drQuantile . . . .
drRead.table . . .
drSample . . . .
drSubset . . . . .
flatten . . . . . .
getCondCuts . .
getSplitVar . . .
hdfsConn . . . .
kvApply . . . . .
kvPair . . . . . .
kvPairs . . . . .
localDiskConn .
localDiskControl
makeExtractable
mr-summary-stats
mrExec . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
8
9
10
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
29
29
31
32
33
34
35
36
38
39
40
41
43
45
45
46
47
48
48
49
50
51
51
53
54
55
56
datadr-package
3
print.ddo . . . . . . .
print.kvPair . . . . .
print.kvValue . . . .
readHDFStextFile . .
readTextFileByChunk
recombine . . . . . .
removeData . . . . .
rhipeControl . . . . .
rrDiv . . . . . . . . .
setupTransformEnv .
to_ddf . . . . . . . .
updateAttributes . . .
%>% . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
datadr-package
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
58
59
60
61
63
64
65
66
67
68
69
70
datadr
Description
datadr: Divide and Recombine for Large, Complex Data
Details
http://deltarho.org/docs-datadr/
Author(s)
Ryan Hafen Maintainer: Ryan Hafen <[email protected]>
Examples
help(package = datadr)
addData
.
.
.
.
.
.
.
.
.
.
.
.
.
Add Key-Value Pairs to a Data Connection
Description
Add key-value pairs to a data connection
Usage
addData(conn, data, overwrite = FALSE)
4
addTransform
Arguments
conn
a kvConnection object
data
a list of key-value pairs (list of lists where each sub-list has two elements, the
key and the value)
overwrite
if data with the same key is already present in the data, should it be overwritten?
(does not work for HDFS connections)
Note
This is generally not recommended for HDFS as it writes a new file each time it is called, and can
result in more individual files than Hadoop likes to deal with.
Author(s)
Ryan Hafen
See Also
removeData, localDiskConn, hdfsConn
Examples
## Not run:
# connect to empty HDFS directory
conn <- hdfsConn("/test/irisSplit")
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
addData(conn, list(list("3", iris[111:150,])))
# represent it as a distributed data frame
hdd <- ddf(conn)
## End(Not run)
addTransform
Add a Transformation Function to a Distributed Data Object
Description
Add a transformation function to be applied to each subset of a distributed data object
Usage
addTransform(obj, fn, name = NULL, params = NULL, packages = NULL)
addTransform
5
Arguments
obj
a distributed data object
fn
a function to be applied to each subset of obj - see details
name
optional name of the transformation
params
a named list of objects external to obj that are needed in the transformation
function (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
Details
When you add a transformation to a distributed data object, the transformation is not applied immediately, but is deferred until a function that kicks off a computation is done. These include
divide, recombine, drJoin, drLapply, drFilter, drSample, drSubset. When any of these are
invoked on an object with a transformation attached to it, the transformation will be applied in the
map phase of the MapReduce computation prior to any other computation. The transformation will
also be applied any time a subset of the data is requested. Although the data has not been physically transformed after a call of addTransform, we can think of it conceptually as already being
transformed.
To force the transformation to be immediately calculated on all subsets use: drPersist(dat, output = ...).
The function provided by fn can either accept one or two parameters. If it accepts one parameter,
the value of a key-value pair is passed in. It if accepts two parameters, it is passed the key as the
first parameter and the value as the second parameter. The return value of fn is treated as a value of
a key-value pair unless the return type comes from kvPair.
When addTransform is called, it is tested on a subset of the data to make sure we have all of the
necessary global variables and packages loaded necessary to portably perform the transformation.
It is possible to add multiple transformations to a distributed data object, in which case they are
applied in the order supplied, but only one transform should be necessary.
The transformation function must not return NULL on any data subset, although it can return an
empty object of the correct shape to match othersubsets (e.g. a data.frame with the correct columns
but zero rows).
Value
The distributed data object provided by obj, with the tranformation included as one of the attributes
of the returned object.
Examples
# Create a distributed data frame using the iris data set, backed by the
# kvMemory (in memory) connection
bySpecies <- divide(iris, by = "Species")
bySpecies
# Note a tranformation is not present in the attributes
names(attributes(bySpecies))
6
adult
## A transform that operates only on values of the key-value pairs
##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in
# in a subset. The calls to 'as.data.frame()' and 't()' convert the
# vector output of 'apply()' into a data.frame with a single row
colMean <- function(x) as.data.frame(t(apply(x, 2, mean)))
# Test on a subset
colMean(bySpecies[[1]][[2]])
# Add a tranformation that will calculate the mean of each variable
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# Note how 'before transformation' appears to describe the values of
# several of the attributes
bySpeciesTransformed
# Note the addition of the transformation to the attributes
names(attributes(bySpeciesTransformed))
# We can see the result of the transformation by looking at one of
# the subsets:
bySpeciesTransformed[[1]]
# The transformation is automatically applied when calling any data
# operation. For example, if can call 'recombine()' with 'combRbind'
# we will get a data frame of the column means for each subset:
varMeans <- recombine(bySpeciesTransformed, combine = combRbind)
varMeans
## A transform that operates on both keys and values
##--------------------------------------------------------# We can also create a transformation that uses both the keys and values
# It will select the first row of the value, and append '-firstRow' to
# the key
aTransform <- function(key, val) {
newKey <- paste(key, "firstRow", sep = "-")
newVal <- val[1,]
kvPair(newKey, newVal)
}
# Apply the transformation
recombine(addTransform(bySpecies, aTransform))
adult
"Census Income" Dataset
Description
"Census Income" dataset from UCI machine learning repository
Usage
adult
Format
(From UCI machine learning repository)
adult
7
• age. continuous
• workclass. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Withoutpay, Never-worked
• fnlwgt. continuous
• education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc,
9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education-num: continuous
• marital. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouseabsent, Married-AF-spouse
• occupation. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty,
Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces
• relationship. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
• race. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
• sex. Female, Male
• capgain. continuous
• caploss. continuous
• hoursperweek. continuous
• nativecountry. United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines,
Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic,
Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
• income. <=50K, >50K
• incomebin. 0 if income<=50K, 1 if income>50K
Source
(From UCI machine learning repository) Link: http://archive.ics.uci.edu/ml/datasets/
Adult Donor: Ronny Kohavi and Barry Becker Data Mining and Visualization Silicon Graphics.
e-mail: [email protected] for questions.
Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A
set of reasonably clean records was extracted using the following conditions: ((AAGE>16) &&
(AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
References
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.
uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
8
applyTransform
applyTransform
Apply transformation function(s)
Description
This is called internally in the map phase of datadr MapReduce jobs. It is not meant for use outside
of there, but is exported for convenience.
Usage
applyTransform(transFns, x, env = NULL)
Arguments
transFns
from the "transforms" attribute of a ddo object
x
a subset of the object
env
the environment in which to evaluate the function (should be instantiated from
calling setupTransformEnv) - if NULL, the environment will be set up for you
Examples
# Create a distributed data frame using the iris data set, backed by the
# kvMemory (in memory) connection
bySpecies <- divide(iris, by = "Species")
bySpecies
# Note a tranformation is not present in the attributes
names(attributes(bySpecies))
## A transform that operates only on values of the key-value pairs
##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in
# in a subset. The calls to 'as.data.frame()' and 't()' convert the
# vector output of 'apply()' into a data.frame with a single row
colMean <- function(x) as.data.frame(t(apply(x, 2, mean)))
# Test on a subset
colMean(bySpecies[[1]][[2]])
# Add a tranformation that will calculate the mean of each variable
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# Note how 'before transformation' appears to describe the values of
# several of the attributes
bySpeciesTransformed
# Note the addition of the transformation to the attributes
names(attributes(bySpeciesTransformed))
# We can see the result of the transformation by looking at one of
# the subsets:
bySpeciesTransformed[[1]]
# The transformation is automatically applied when calling any data
# operation. For example, if can call 'recombine()' with 'combRbind'
# we will get a data frame of the column means for each subset:
varMeans <- recombine(bySpeciesTransformed, combine = combRbind)
as.data.frame.ddf
9
varMeans
## A transform that operates on both keys and values
##--------------------------------------------------------# We can also create a transformation that uses both the keys and values
# It will select the first row of the value, and append '-firstRow' to
# the key
aTransform <- function(key, val) {
newKey <- paste(key, "firstRow", sep = "-")
newVal <- val[1,]
kvPair(newKey, newVal)
}
# Apply the transformation
recombine(addTransform(bySpecies, aTransform))
as.data.frame.ddf
Turn ’ddf’ Object into Data Frame
Description
Rbind all the rows of a ’ddf’ object into a single data frame
Usage
## S3 method for class 'ddf'
as.data.frame(x, row.names = NULL, optional = FALSE,
keys = TRUE, splitVars = TRUE, bsvs = FALSE, ...)
Arguments
x
a ’ddf’ object
row.names
passed to as.data.frame
optional
passed to as.data.frame
keys
should the key be added as a variable in the resulting data frame? (if key is not
a character, it will be replaced with a md5 hash)
splitVars
should the values of the splitVars be added as variables in the resulting data
frame?
bsvs
should the values of bsvs be added as variables in the resulting data frame?
...
additional arguments passed to as.data.frame
Examples
d <- divide(iris, by = "Species")
as.data.frame(d)
10
bsv
as.list.ddo
Turn ’ddo’ / ’ddf’ Object into a list
Description
Turn ’ddo’ / ’ddf’ Object into a list
Usage
## S3 method for class 'ddo'
as.list(x, ...)
Arguments
x
...
a ’ddo’ / ’ddf’ object
additional arguments passed to as.list
Examples
d <- divide(iris, by = "Species")
as.list(d)
bsv
Construct Between Subset Variable (BSV)
Description
Construct between subset variable (BSV)
For a given key-value pair, get a BSV variable value by name (if present)
Usage
bsv(val = NULL, desc = "")
getBsv(x, name)
getBsvs(x)
Arguments
val
desc
x
name
a scalar character, numeric, or date
a character string describing the BSV
a key-value pair or a value
the name of the BSV to get d <- divide(iris, by = "Species", bsvFn = function(x)
list(msl = bsv(mean(x$Sepal.Length)))) getBsvs(d[[1]]$value) getBsv(d[[1]]$value,
"msl")
charFileHash
11
Details
Should be called inside the bsvFn argument to divide used for constructing a BSV list for each
subset of a division.
Author(s)
Ryan Hafen
See Also
divide, getBsvs, bsvInfo
Examples
irisDdf <- ddf(iris)
bsvFn <- function(dat) {
list(
meanSL = bsv(mean(dat$Sepal.Length), desc = "mean sepal length"),
meanPL = bsv(mean(dat$Petal.Length), desc = "mean petal length")
)
}
# divide the data by species
bySpecies <- divide(irisDdf, by = "Species", bsvFn = bsvFn)
# see BSV info attached to the result
bsvInfo(bySpecies)
# get BSVs for a specified subset of the division
getBsvs(bySpecies[[1]])
charFileHash
Character File Hash Function
Description
Function to be used to specify the file where key-value pairs get stored for local disk connections, useful when keys are scalar strings. Should be passed as the argument fileHashFn to
localDiskConn.
Usage
charFileHash(keys, conn)
Arguments
keys
keys to be hashed
conn
a "localDiskConn" object
12
combCollect
Details
You shouldn’t need to call this directly other than to experiment with what the output looks like or
to get ideas on how to write your own custom hash.
Author(s)
Ryan Hafen
See Also
localDiskConn, digestFileHash
Examples
# connect to empty localDisk directory
path <- file.path(tempdir(), "irisSplit")
unlink(path, recursive = TRUE)
conn <- localDiskConn(path, autoYes = TRUE, fileHashFn = charFileHash)
# add some data
addData(conn, list(list("key1", iris[1:10,])))
addData(conn, list(list("key2", iris[11:110,])))
addData(conn, list(list("key3", iris[111:150,])))
# see that files were stored by their key
list.files(path)
combCollect
"Collect" Recombination
Description
"Collect" recombination - collect the results into a local list of key-value pairs
Usage
combCollect(...)
Arguments
...
Additional list elements that will be added to the returned object
Details
combCollect is passed to the argument combine in recombine
Author(s)
Ryan Hafen
combDdf
13
See Also
divide, recombine, combDdo, combDdf, combMeanCoef, combRbind, combMean
Examples
# Create a distributed data frame using the iris data set
bySpecies <- divide(iris, by = "Species")
# Function to calculate the mean of the petal widths
meanPetal <- function(x) mean(x$Petal.Width)
# Combine the results using rbind
combined <- recombine(addTransform(bySpecies, meanPetal), combine = combCollect)
class(combined)
combined
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(meanPetal) %>%
recombine(combCollect)
combDdf
"DDF" Recombination
Description
"DDF" recombination - results into a "ddf" object, rbinding if necessary
Usage
combDdf(...)
Arguments
...
additional attributes to define the combiner (currently only used internally)
Details
combDdf is passed to the argument combine in recombine.
If the value of the "ddo" object that will be recombined is a list, then the elements in the list will
be collapsed together via rbind.
Author(s)
Ryan Hafen
See Also
divide, recombine, combCollect, combMeanCoef, combRbind, combDdo, combDdf
14
combDdo
Examples
# Divide the iris data
bySpecies <- divide(iris, by = "Species")
## Simple combination to form a ddf
##--------------------------------------------------------# Add a transform that selects the petal width and length variables
selVars <- function(x) x[,c("Petal.Width", "Petal.Length")]
# Apply the transform and combine using combDdo
combined <- recombine(addTransform(bySpecies, selVars), combine = combDdf)
combined
combined[[1]]
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(selVars) %>%
recombine(combDdf)
## Combination that involves rbinding to give the ddf
##--------------------------------------------------------# A transformation that returns a list
listTrans <- function(x) {
list(meanPetalWidth = mean(x$Petal.Width),
maxPetalLength = max(x$Petal.Length))
}
# Apply the transformation and look at the result
bySpeciesTran <- addTransform(bySpecies, listTrans)
bySpeciesTran[[1]]
# And if we rbind the "value" of the first subset:
out1 <- rbind(bySpeciesTran[[1]]$value)
out1
# Note how the combDdf method row binds the two data frames
combined <- recombine(bySpeciesTran, combine = combDdf)
out2 <- combined[[1]]
out2
# These are equivalent
identical(out1, out2$value)
combDdo
"DDO" Recombination
combDdo
15
Description
"DDO" recombination - simply collect the results into a "ddo" object
Usage
combDdo(...)
Arguments
...
additional attributes to define the combiner (currently only used internally)
Details
combDdo is passed to the argument combine in recombine
Author(s)
Ryan Hafen
See Also
divide, recombine, combCollect, combMeanCoef, combRbind, combMean
Examples
# Divide the iris data
bySpecies <- divide(iris, by = "Species")
# Add a transform that returns a list for each subset
listTrans <- function(x) {
list(meanPetalWidth = mean(x$Petal.Width),
maxPetalLength = max(x$Petal.Length))
}
# Apply the transform and combine using combDdo
combined <- recombine(addTransform(bySpecies, listTrans), combine = combDdo)
combined
combined[[1]]
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(listTrans) %>%
recombine(combDdo)
16
combMean
combMean
Mean Recombination
Description
Mean recombination – Calculate the elementwise mean of a vector in each value
Usage
combMean(...)
Arguments
...
additional attributes to define the combiner (currently only used internally)
Details
combMean is passed to the argument combine in recombine
This method assumes that the values of the key-value pairs each consist of a numeric vector (with
the same length). The mean is calculated elementwise across all the keys.
Author(s)
Ryan Hafen
See Also
divide, recombine, combCollect, combDdo, combDdf, combRbind, combMeanCoef
Examples
# Create a distributed data frame using the iris data set
bySpecies <- divide(iris, by = "Species")
# Add a transformation that returns a vector of sums for each subset, one
# mean for each variable
bySpeciesTrans <- addTransform(bySpecies, function(x) apply(x, 2, sum))
bySpeciesTrans[[1]]
# Calculate the elementwise mean of the vector of sums produced by
# the transform, across the keys
out1 <- recombine(bySpeciesTrans, combine = combMean)
out1
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(function(x) apply(x, 2, sum)) %>%
recombine(combMean)
combMeanCoef
17
# This manual, non-datadr approach illustrates the above computation
# This step mimics the transformation above
sums <- aggregate(. ~ Species, data = iris, sum)
sums
# And this step mimics the mean recombination
out2 <- apply(sums[,-1], 2, mean)
out2
# These are the same
identical(out1, out2)
combMeanCoef
Mean Coefficient Recombination
Description
Mean coefficient recombination – Calculate the weighted average of parameter estimates for a
model fit to each subset
Usage
combMeanCoef(...)
Arguments
...
additional attributes to define the combiner (currently only used internally)
Details
combMeanCoef is passed to the argument combine in recombine
This method is designed to calculate the mean of each model coefficient, where the same model has
been fit to subsets via a transformation. The mean is a weighted average of each coefficient, where
the weights are the number of observations in each subset. In particular, drLM and drGLM functions
should be used to add the transformation to the ddo that will be recombined using combMeanCoef.
Author(s)
Ryan Hafen
See Also
divide, recombine, rrDiv, combCollect, combDdo, combDdf, combRbind, combMean
18
combRbind
Examples
# Create an irregular number of observations for each species
indexes <- sort(c(sample(1:50, 40), sample(51:100, 37), sample(101:150, 46)))
irisIrr <- iris[indexes,]
# Create a distributed data frame using the irregular iris data set
bySpecies <- divide(irisIrr, by = "Species")
# Fit a linear model of Sepal.Length vs. Sepal.Width for each species
# using 'drLM()' (or we could have used 'drGLM()' for a generlized linear model)
lmTrans <- function(x) drLM(Sepal.Length ~ Sepal.Width, data = x)
bySpeciesFit <- addTransform(bySpecies, lmTrans)
# Average the coefficients from the linear model fits of each species, weighted
# by the number of observations in each species
out1 <- recombine(bySpeciesFit, combine = combMeanCoef)
out1
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(lmTrans) %>%
recombine(combMeanCoef)
# The following illustrates an equivalent, but more tedious approach
lmTrans2 <- function(x) t(c(coef(lm(Sepal.Length ~ Sepal.Width, data = x)), n = nrow(x)))
res <- recombine(addTransform(bySpecies, lmTrans2), combine = combRbind)
colnames(res) <- c("Species", "Intercept", "Sepal.Width", "n")
res
out2 <- c("(Intercept)" = with(res, sum(Intercept * n) / sum(n)),
"Sepal.Width" = with(res, sum(Sepal.Width * n) / sum(n)))
# These are the same
identical(out1, out2)
combRbind
"rbind" Recombination
Description
"rbind" recombination - Combine ddf divisions by row binding
Usage
combRbind(...)
Arguments
...
additional attributes to define the combiner (currently only used internally)
condDiv
19
Details
combRbind is passed to the argument combine in recombine
Author(s)
Ryan Hafen
See Also
divide, recombine, combDdo, combDdf, combCollect, combMeanCoef, combMean
Examples
# Create a distributed data frame using the iris data set
bySpecies <- divide(iris, by = "Species")
# Create a function that will calculate the standard deviation of each
# variable in in a subset. The calls to 'as.data.frame()' and 't()'
# convert the vector output of 'apply()' into a data.frame with a single row
sdCol <- function(x) as.data.frame(t(apply(x, 2, sd)))
# Combine the results using rbind
combined <- recombine(addTransform(bySpecies, sdCol), combine = combRbind)
class(combined)
combined
# A more concise (and readable) way to do it
bySpecies %>%
addTransform(sdCol) %>%
recombine(combRbind)
condDiv
Conditioning Variable Division
Description
Specify conditioning variable division parameters for data division
Usage
condDiv(vars)
Arguments
vars
a character string or vector of character strings specifying the variables of the
input data across which to divide
20
convert
Details
Currently each unique combination of values of vars constitutes a subset. In the future, specifying
shingles for numeric conditioning variables will be implemented.
Value
a list to be used for the "by" argument to divide
Author(s)
Ryan Hafen
References
• http://deltarho.org
• Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large
complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67.
See Also
divide, getSplitVars, getSplitVar
Examples
d <- divide(iris, by = "Species")
# equivalent:
d <- divide(iris, by = condDiv("Species"))
convert
Convert ’ddo’ / ’ddf’ Objects
Description
Convert ’ddo’ / ’ddf’ objects between different storage backends
Usage
convert(from, to, overwrite = FALSE)
Arguments
from
a ’ddo’ or ’ddf’ object
to
a ’kvConnection’ object (created with localDiskConn or hdfsConn) or NULL if
an in-memory ’ddo’ / ’ddf’ is desired
overwrite
should the data in the location pointed to in to be overwritten?
ddf
21
Examples
d <- divide(iris, by = "Species")
# convert in-memory ddf to one stored on disk
dl <- convert(d, localDiskConn(tempfile(), autoYes = TRUE))
dl
ddf
Instantiate a Distributed Data Frame (’ddf’)
Description
Instantiate a distributed data frame (’ddf’)
Usage
ddf(conn, transFn = NULL, update = FALSE, reset = FALSE, control = NULL,
verbose = TRUE)
Arguments
conn
an object pointing to where data is or will be stored for the ’ddf’ object - can be
a ’kvConnection’ object created from localDiskConn or hdfsConn, or a data
frame or list of key-value pairs
transFn
transFn a function to be applied to the key-value pairs of this data prior to doing
any processing, that transform the data into a data frame if it is not stored as
such
update
should the attributes of this object be updated? See updateAttributes for more
details.
reset
should all persistent metadata about this object be removed and the object created from scratch? This setting does not effect data stored in the connection
location.
control
parameters specifying how the backend should handle things if attributes are
updated (most-likely parameters to rhwatch in RHIPE) - see rhipeControl
and localDiskControl
verbose
logical - print messages about what is being done
Examples
# in-memory ddf
d <- ddf(iris)
d
# local disk ddf
conn <- localDiskConn(tempfile(), autoYes = TRUE)
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
22
ddf-accessors
addData(conn, list(list("3", iris[111:150,])))
dl <- ddf(conn)
dl
# hdfs ddf (requires RHIPE / Hadoop)
## Not run:
# connect to empty HDFS directory
conn <- hdfsConn("/tmp/irisSplit")
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
addData(conn, list(list("3", iris[111:150,])))
# represent it as a distributed data frame
hdd <- ddf(conn)
## End(Not run)
ddf-accessors
Accessor methods for ’ddf’ objects
Description
Accessor methods for ’ddf’ objects
Usage
splitRowDistn(x)
## S3 method for class 'ddo'
summary(object, ...)
## S3 method for class 'ddf'
summary(object, ...)
nrow(x)
NROW(x)
ncol(x)
NCOL(x)
## S4 method for signature 'ddf'
nrow(x)
## S4 method for signature 'ddf'
NROW(x)
ddo
23
## S4 method for signature 'ddf'
ncol(x)
## S4 method for signature 'ddf'
NCOL(x)
## S3 method for class 'ddf'
names(x)
Arguments
x
a ’ddf’ object
object
a ’ddf’/’ddo’ object
...
additional arguments
Examples
d <- divide(iris, by = "Species", update = TRUE)
nrow(d)
ncol(d)
length(d)
names(d)
summary(d)
getKeys(d)
ddo
Instantiate a Distributed Data Object (’ddo’)
Description
Instantiate a distributed data object (’ddo’)
Usage
ddo(conn, update = FALSE, reset = FALSE, control = NULL, verbose = TRUE)
Arguments
conn
an object pointing to where data is or will be stored for the ’ddf’ object - can be
a ’kvConnection’ object created from localDiskConn or hdfsConn, or a data
frame or list of key-value pairs
update
should the attributes of this object be updated? See updateAttributes for more
details.
reset
should all persistent metadata about this object be removed and the object created from scratch? This setting does not effect data stored in the connection
location.
24
ddo-ddf-accessors
control
verbose
parameters specifying how the backend should handle things if attributes are
updated (most-likely parameters to rhwatch in RHIPE) - see rhipeControl
and localDiskControl
logical - print messages about what is being done
Examples
kv <- kvPairs(kvPair(1, letters), kvPair(2, rnorm(100)))
kvddo <- ddo(kv)
kvddo
ddo-ddf-accessors
Accessor Functions
Description
Accessor functions for attributes of ddo/ddf objects. Methods also include nrow and ncol for ddf
objects.
Usage
kvExample(x)
bsvInfo(x)
counters(x)
splitSizeDistn(x)
getKeys(x)
hasExtractableKV(x)
## S3 method for class 'ddo'
length(x)
Arguments
x
a ’ddf’/’ddo’ object
Examples
d <- divide(iris, by = "Species", update = TRUE)
nrow(d)
ncol(d)
length(d)
names(d)
summary(d)
getKeys(d)
ddo-ddf-attributes
25
ddo-ddf-attributes
Managing attributes of ’ddo’ or ’ddf’ objects
Description
These are called internally in various datadr functions. They are not meant for use outside of there,
but are exported for convenience, and can be useful for better understanding ddo/ddf objects.
Usage
setAttributes(obj, attrs)
## S3 method for class 'ddf'
setAttributes(obj, attrs)
## S3 method for class 'ddo'
setAttributes(obj, attrs)
getAttribute(obj, attrName)
getAttributes(obj, attrNames)
## S3 method for class 'ddf'
getAttributes(obj, attrNames)
## S3 method for class 'ddo'
getAttributes(obj, attrNames)
hasAttributes(obj, ...)
## S3 method for class 'ddf'
hasAttributes(obj, attrNames)
Arguments
obj
’ddo’ or ’ddf’ object
attrs
a named list of attributes to set
attrName
name of the attribute to get
attrNames
vector of names of the attributes to get
...
additional arguments
Examples
d <- divide(iris, by = "Species")
getAttribute(d, "keys")
26
digestFileHash
digestFileHash
Digest File Hash Function
Description
Function to be used to specify the file where key-value pairs get stored for local disk connections,
useful when keys are arbitrary objects. File names are determined using a md5 hash of the object.
This is the default argument for fileHashFn in localDiskConn.
Usage
digestFileHash(keys, conn)
Arguments
keys
keys to be hashed
conn
a "localDiskConn" object
Details
You shouldn’t need to call this directly other than to experiment with what the output looks like or
to get ideas on how to write your own custom hash.
Author(s)
Ryan Hafen
See Also
localDiskConn, charFileHash
Examples
# connect to empty localDisk directory
path <- file.path(tempdir(), "irisSplit")
unlink(path, recursive = TRUE)
conn <- localDiskConn(path, autoYes = TRUE, fileHashFn = digestFileHash)
# add some data
addData(conn, list(list("key1", iris[1:10,])))
addData(conn, list(list("key2", iris[11:110,])))
addData(conn, list(list("key3", iris[111:150,])))
# see that files were stored by their key
list.files(path)
divide
divide
27
Divide a Distributed Data Object
Description
Divide a ddo/ddf object into subsets based on different criteria
Usage
divide(data, by = NULL, spill = 1000000, filterFn = NULL, bsvFn = NULL,
output = NULL, overwrite = FALSE, preTransFn = NULL,
postTransFn = NULL, params = NULL, packages = NULL, control = NULL,
update = FALSE, verbose = TRUE)
Arguments
data
an object of class "ddf" or "ddo" - in the latter case, need to specify preTransFn
to coerce each subset into a data frame
by
specification of how to divide the data - conditional (factor-level or shingles),
random replicate, or near-exact replicate (to come) – see details
spill
integer telling the division method how many lines of data should be collected
until spilling over into a new key-value pair
filterFn
a function that is applied to each candidate output key-value pair to determine
whether it should be (if returns TRUE) part of the resulting division
bsvFn
a function to be applied to each subset that returns a list of between subset variables (BSVs)
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
preTransFn
a transformation function (if desired) to applied to each subset prior to division
- note: this is deprecated - instead use addTransform prior to calling divide
postTransFn
a transformation function (if desired) to apply to each post-division subset
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
update
should a MapReduce job be run to obtain additional attributes for the result data
prior to returning?
verbose
logical - print messages about what is being done
28
divide
Details
The division methods this function will support include conditioning variable division for factors
(implemented – see condDiv), conditioning variable division for numerical variables through shingles, random replicate (implemented – see rrDiv), and near-exact replicate. If by is a vector of
variable names, the data will be divided by these variables. Alternatively, this can be specified by
e.g. condDiv(c("var1", "var2")).
Value
an object of class "ddf" if the resulting subsets are data frames. Otherwise, an object of class "ddo".
Author(s)
Ryan Hafen
References
• http://deltarho.org
• Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large
complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67.
See Also
recombine, ddo, ddf, condDiv, rrDiv
Examples
# divide iris data by Species by passing in a data frame
bySpecies <- divide(iris, by = "Species")
bySpecies
# divide iris data into random partitioning of ~30 rows per subset
irisRR <- divide(iris, by = rrDiv(30))
irisRR
# any ddf can be passed into divide:
irisRR2 <- divide(bySpecies, by = rrDiv(30))
irisRR2
bySpecies2 <- divide(irisRR2, by = "Species")
bySpecies2
# splitting on multiple columns
byEdSex <- divide(adult, by = c("education", "sex"))
byEdSex
byEdSex[[1]]
# splitting on a numeric variable
bySL <- ddf(iris) %>%
addTransform(function(x) {
x$slCut <- cut(x$Sepal.Length, 10)
x
divide-internals
29
}) %>%
divide(by = "slCut")
bySL
bySL[[1]]
divide-internals
Functions used in divide()
Description
Functions used in divide()
Usage
dfSplit(curDF, by, seed)
addSplitAttrs(curSplit, bsvFn, by, postTransFn = NULL)
Arguments
curDF, seed
arguments
curSplit, bsvFn, by, postTransFn
arguments
Note
These functions can be ignored. They are only exported to make their use in a distributed setting
more convenient.
drAggregate
Division-Agnostic Aggregation
Description
Aggregates data by cross-classifying factors, with a formula interface similar to xtabs
Usage
drAggregate(data, formula, by = NULL, output = NULL, preTransFn = NULL,
maxUnique = NULL, params = NULL, packages = NULL, control = NULL)
30
drAggregate
Arguments
data
formula
by
output
preTransFn
maxUnique
params
packages
control
a "ddf" containing the variables in the formula formula
a formula object with the cross-classifying variables (separated by +) on the
right hand side (or an object which can be coerced to a formula). Interactions
are not allowed. On the left hand side, one may optionally give a variable name
in the data representing counts; in the latter case, the columns are interpreted as
corresponding to the levels of a variable. This is useful if the data have already
been tabulated.
an optional variable name or vector of variable names by which to split up tabulations (i.e. tabulate independently inside of each unique "by" variable value).
The only difference between specifying "by" and placing the variable(s) in the
right hand side of the formula is how the computation is done and how the result
is returned.
"kvConnection" object indicating where the output data should reside in the case
of by being specified (see localDiskConn, hdfsConn). If NULL (default), output
will be an in-memory "ddo" object.
an optional function to apply to each subset prior to performing tabulation. The
output from this function should be a data frame containing variables with names
that match that of the formula provided. Note: this is deprecated - instead use
addTransform prior to calling divide.
the maximum number of unique combinations of variables to obtain tabulations
for. This is meant to help against cases where a variable in the formula has a
very large number of levels, to the point that it is not meaningful to tabulate and
is too computationally burdonsome. If NULL, it is ignored. If a positive number,
only the top and bottom maxUnique tabulations by frequency are kept.
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Value
a data frame of the tabulations. When "by" is specified, it is a ddf with each key-value pair corresponding to a unique "by" value, containing a data frame of tabulations.
Note
The interface is similar to xtabs, but instead of returning a full contingency table, data is returned
in the form of a data frame only with rows for which there were positive counts. This result is more
similar to what is returned by aggregate.
Author(s)
Ryan Hafen
drBLB
31
See Also
xtabs, updateAttributes
Examples
drAggregate(Sepal.Length ~ Species, data = ddf(iris))
drBLB
Bag of Little Bootstraps Transformation Method
Description
Bag of little bootstraps transformation method
Usage
drBLB(x, statistic, metric, R, n)
Arguments
x
a subset of a ddf
statistic
a function to apply to the subset specifying the statistic to compute. Must have
arguments ’data’ and ’weights’ - see details). Must return a vector, where each
element is a statistic of interest.
metric
a function specifying the metric to be applied to the R bootstrap samples of each
statistic returned by statistic. Expects an input vector and should output a
vector.
R
the number of bootstrap samples
n
the total number of observations in the data
Details
It is necessary to specify weights as a parameter to the statistic function because for BLB to
work efficiently, it must resample each time with a sample of size n. To make this computationally
possible for very large n, we can use weights (see reference for details). Therefore, only methods
with a weights option can legitimately be used here.
Author(s)
Ryan Hafen
References
Kleiner, Ariel, et al. "A scalable bootstrap for massive data." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.4 (2014): 795-816.
32
drFilter
See Also
divide, recombine
Examples
## Not run:
# BLB is meant to run on random replicate divisions
rrAdult <- divide(adult, by = rrDiv(1000), update = TRUE)
adultBlb <- rrAdult %>% addTransform(function(x) {
drBLB(x,
statistic = function(x, weights)
coef(glm(incomebin ~ educationnum + hoursperweek + sex,
data = x, weights = weights, family = binomial())),
metric = function(x)
quantile(x, c(0.05, 0.95)),
R = 100,
n = nrow(rrAdult)
)
})
# compute the mean of the resulting CI limits
# (this will take a little bit of time because of resampling)
coefs <- recombine(adultBlb, combMean)
matrix(coefs, ncol = 2, byrow = TRUE)
## End(Not run)
drFilter
Filter a ’ddo’ or ’ddf’ Object
Description
Filter a ’ddo’ or ’ddf’ object by selecting key-value pairs that satisfy a logical condition
Usage
drFilter(x, filterFn, output = NULL, overwrite = FALSE, params = NULL,
packages = NULL, control = NULL)
Arguments
x
an object of class ’ddo’ or ’ddf’
filterFn
function that takes either a key-value pair (as two arguments) or just a value (as
a single argument) and returns either TRUE or FALSE - if TRUE, that key-value
pair will be present in the result. See examples for details.
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
drGetGlobals
33
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in filterFn (most
should be taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Value
a ’ddo’ or ’ddf’ object
Author(s)
Ryan Hafen
See Also
drJoin, drLapply
Examples
# Create a ddf using the iris data
bySpecies <- divide(iris, by = "Species")
# Filter using only the 'value' of the key/value pair
drFilter(bySpecies, function(v) mean(v$Sepal.Width) < 3)
# Filter using both the key and value
drFilter(bySpecies, function(k,v) k != "Species=virginica" & mean(v$Sepal.Width) < 3)
drGetGlobals
Get Global Variables and Package Dependencies
Description
Get global variables and package dependencies for a function
Usage
drGetGlobals(f)
Arguments
f
function
34
drGLM
Details
This traverses the parent environments of the supplied function and finds all global variables using
findGlobals and retrieves their values. All package function calls are also found and a list of
required packages is also returned.
Value
a list of variables (named by variable) and a vector of package names
Author(s)
Ryan Hafen
Examples
a <- 1
f <- function(x) x + a
drGetGlobals(f)
drGLM
GLM Transformation Method
Description
GLM transformation method – Fit a generalized linear model to each subset
Usage
drGLM(...)
Arguments
...
arguments you would pass to the glm function
Details
This provides a transformation function to be called for each subset in a recombination MapReduce
job that applies R’s glm method and outputs the coefficients in a way that combMeanCoef knows
how to deal with. It can be applied to a ddf with addTransform prior to calling recombine.
Value
An object of class drCoef that contains the glm coefficients and other data needed by combMeanCoef
Author(s)
Ryan Hafen
drHexbin
35
See Also
divide, recombine, rrDiv
Examples
# Artificially dichotomize the Sepal.Lengths of the iris data to
# demonstrate a GLM model
irisD <- iris
irisD$Sepal <- as.numeric(irisD$Sepal.Length > median(irisD$Sepal.Length))
# Divide the data
bySpecies <- divide(irisD, by = "Species")
# A function to fit a logistic regression model to each species
logisticReg <- function(x)
drGLM(Sepal ~ Sepal.Width + Petal.Length + Petal.Width,
data = x, family = binomial())
# Apply the transform and combine using 'combMeanCoef'
bySpecies %>%
addTransform(logisticReg) %>%
recombine(combMeanCoef)
drHexbin
HexBin Aggregation for Distributed Data Frames
Description
Create "hexbin" object of hexagonally binned data for a distributed data frame. This computation
is division agnostic - it does not matter how the data frame is split up.
Usage
drHexbin(data, xVar, yVar, by = NULL, xTransFn = identity,
yTransFn = identity, xRange = NULL, yRange = NULL, xbins = 30,
shape = 1, params = NULL, packages = NULL, control = NULL)
Arguments
data
a distributed data frame
xVar, yVar
names of the variables to use
by
an optional variable name or vector of variable names by which to group hexbin
computations
xTransFn, yTransFn
a transformation function to apply to the x and y variables prior to binning
xRange, yRange range of x and y variables (can be left blank if summaries have been computed)
36
drJoin
xbins
the number of bins partitioning the range of xbnds
shape
the shape = yheight/xwidth of the plotting regions
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Value
a "hexbin" object
Author(s)
Ryan Hafen
References
Carr, D. B. et al. (1987) Scatterplot Matrix Techniques for Large N . JASA 83, 398, 424–436.
See Also
drQuantile
Examples
# create dummy data and divide it
dat <- data.frame(
xx = rnorm(1000),
yy = rnorm(1000),
by = sample(letters, 1000, replace = TRUE))
d <- divide(dat, by = "by", update = TRUE)
# compute hexbins on divided object
dhex <- drHexbin(d, xVar = "xx", yVar = "yy")
# dhex is equivalent to running on undivided data:
hexbin(dat$xx, dat$yy)
drJoin
Join Data Sources by Key
Description
Outer join of two or more distributed data object (DDO) sources by key
drJoin
37
Usage
drJoin(..., output = NULL, overwrite = FALSE, postTransFn = NULL,
params = NULL, packages = NULL, control = NULL)
Arguments
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
postTransFn
an optional function to be applied to the each final key-value pair after joining
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
...
Input data sources: two or more named DDO objects that will be joined, separated by commas (see Examples for syntax). Specifically, each input object
should inherit from the ’ddo’ class. It is assumed that all input sources are of
same type (all HDFS, all localDisk, all in-memory).
Value
a ’ddo’ object stored in the output connection, where the values are named lists with names according to the names given to the input data objects, and values are the corresponding data. The
’ddo’ object contains the union of all the keys contained in the input ’ddo’ objects specified in ....
Author(s)
Ryan Hafen
See Also
drFilter, drLapply
Examples
bySpecies <- divide(iris, by = "Species")
# get independent lists of just SW and SL
sw <- drLapply(bySpecies, function(x) x$Sepal.Width)
sl <- drLapply(bySpecies, function(x) x$Sepal.Length)
drJoin(Sepal.Width = sw, Sepal.Length = sl, postTransFn = as.data.frame)
38
drLapply
drLapply
Apply a function to all key-value pairs of a ddo/ddf object
Description
Apply a function to all key-value pairs of a ddo/ddf object and get a new ddo object back, unless a
different combine strategy is specified.
Usage
drLapply(X, FUN, combine = combDdo(), output = NULL, overwrite = FALSE,
params = NULL, packages = NULL, control = NULL, verbose = TRUE)
Arguments
X
an object of class "ddo" of "ddf"
FUN
a function to be applied to each subset
combine
optional method to combine the results
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
verbose
logical - print messages about what is being done
Value
depends on combine
Author(s)
Ryan Hafen
See Also
recombine, drFilter, drJoin, combDdo, combRbind
drLM
39
Examples
bySpecies <- divide(iris, by = "Species")
drLapply(bySpecies, function(x) x$Sepal.Width)
drLM
LM Transformation Method
Description
LM transformation method – – Fit a linear model to each subset
Usage
drLM(...)
Arguments
...
arguments you would pass to the lm function
Details
This provides a transformation function to be called for each subset in a recombination MapReduce
job that applies R’s lm method and outputs the coefficients in a way that combMeanCoef knows how
to deal with. It can be applied to a ddf with addTransform prior to calling recombine.
Value
An object of class drCoef that contains the lm coefficients and other data needed by combMeanCoef
Author(s)
Landon Sego
See Also
divide, recombine, rrDiv
Examples
# Divide the data
bySpecies <- divide(iris, by = "Species")
# A function to fit a multiple linear regression model to each species
linearReg <- function(x)
drLM(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = x)
# Apply the transform and combine using 'combMeanCoef'
bySpecies %>%
40
drPersist
addTransform(linearReg) %>%
recombine(combMeanCoef)
drPersist
Persist a Transformed ’ddo’ or ’ddf’ Object
Description
Persist a transformed ’ddo’ or ’ddf’ object by making a deferred transformation permanent
Usage
drPersist(x, output = NULL, overwrite = FALSE, control = NULL)
Arguments
x
an object of class ’ddo’ or ’ddf’
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Details
When a transformation is added to a ddf/ddo via addTransform, the transformation is deferred
until the some action is taken with the data (e.g. a call to recombine). See the documentation of
addTransform for more information about the nature of transformations.
Calling drPersist() on the ddo/ddf makes the transformation permanent (persisted). In the case of
a local disk connection (via localDiskConn) or HDFS connection (via hdfsConn), the transformed
data are written to disk.
Value
a ’ddo’ or ’ddf’ object with the transformation evaluated on the data
Author(s)
Ryan Hafen
See Also
addTransform
drQuantile
41
Examples
bySpecies <- divide(iris, by = "Species")
# Create the transformation and add it to bySpecies
bySpeciesSepal <- addTransform(bySpecies, function(x) x[,c("Sepal.Length", "Sepal.Width")])
# Note the transformation is 'pending' a data action
bySpeciesSepal
# Make the tranformation permanent (persistent)
bySpeciesSepalPersisted <- drPersist(bySpeciesSepal)
# The transformation no longer pending--but a permanent part of the new ddo
bySpeciesSepalPersisted
bySpeciesSepalPersisted[[1]]
drQuantile
Sample Quantiles for ’ddf’ Objects
Description
Compute sample quantiles for ’ddf’ objects
Usage
drQuantile(x, var, by = NULL, probs = seq(0, 1, 0.005), preTransFn = NULL,
varTransFn = identity, varRange = NULL, nBins = 10000, tails = 100,
params = NULL, packages = NULL, control = NULL, ...)
Arguments
x
a ’ddf’ object
var
the name of the variable to compute quantiles for
by
an optional variable name or vector of variable names by which to group quantile
computations
probs
numeric vector of probabilities with values in [0-1]
preTransFn
a transformation function (if desired) to applied to each subset prior to computing quantiles (here it may be useful for adding a "by" variable that is not present)
- note: this transformation should not modify var (use varTransFn for that) also note: this is deprecated - instead use addTransform prior to calling divide
varTransFn
transformation to apply to variable prior to computing quantiles
varRange
range of x (can be left blank if summaries have been computed)
nBins
how many bins should the range of the variable be split into?
tails
how many exact values at each tail should be retained?
42
drQuantile
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
...
additional arguments
Details
This division-agnostic quantile calculation algorithm takes the range of the variable of interest and
splits it into nBins bins, tabulates counts for those bins, and reconstructs a quantile approximation
from them. nBins should not get too large, but larger nBins gives more accuracy. If tails is
positive, the first and last tails ordered values are attached to the quantile estimate - this is useful
for long-tailed distributions or distributions with outliers for which you would like more detail in
the tails.
Value
data frame of quantiles q and their associated f-value fval. If by is specified, then also a variable
group.
Author(s)
Ryan Hafen
See Also
updateAttributes
Examples
# break the iris data into k/v pairs
irisSplit <- list(
list("1", iris[1:10,]), list("2", iris[11:110,]), list("3", iris[111:150,])
)
# represent it as ddf
irisSplit <- ddf(irisSplit, update = TRUE)
# approximate quantiles over the divided data set
probs <- seq(0, 1, 0.005)
iq <- drQuantile(irisSplit, var = "Sepal.Length", tails = 0, probs = probs)
plot(iq$fval, iq$q)
# compare to the all-data quantile "type 1" result
plot(probs, quantile(iris$Sepal.Length, probs = probs, type = 1))
drRead.table
drRead.table
43
Data Input
Description
Reads a text file in table format and creates a distributed data frame from it, with cases corresponding to lines and variables to fields in the file.
Usage
## S3 method for class 'table'
drRead(file, header = FALSE, sep = "", quote = "\"'", dec = ".",
skip = 0, fill = !blank.lines.skip, blank.lines.skip = TRUE, comment.char = "#",
allowEscapes = FALSE, encoding = "unknown", autoColClasses = TRUE,
rowsPerBlock = 50000, postTransFn = identity, output = NULL, overwrite = FALSE,
params = NULL, packages = NULL, control = NULL, ...)
## S3 method for class 'csv'
drRead(file, header = TRUE, sep = ",",
quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'csv2'
drRead(file, header = TRUE, sep = ";",
quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim'
drRead(file, header = TRUE, sep = "\t",
quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
## S3 method for class 'delim2'
drRead(file, header = TRUE, sep = "\t",
quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
Arguments
file
input text file - can either be character string pointing to a file on local disk, or an
hdfsConn object pointing to a text file on HDFS (see output argument below)
header
this and parameters other parameters below are passed to read.table for each
chunk being processed - see read.table for more info. Most all have defaults
or appropriate defaults are set through other format-specific functions such as
drRead.csv and drRead.delim.
sep
see read.table for more info
quote
see read.table for more info
dec
see read.table for more info
skip
see read.table for more info
fill
see read.table for more info
blank.lines.skip
see read.table for more info
comment.char
see read.table for more info
44
drRead.table
allowEscapes
see read.table for more info
encoding
see read.table for more info
autoColClasses should column classes be determined automatically by reading in a sample?
This can sometimes be problematic because of strange ways R handles quotes
in read.table, but keeping the default of TRUE is advantageous for speed.
rowsPerBlock
how many rows of the input file should make up a block (key-value pair) of
output?
postTransFn
a function to be applied after a block is read in to provide any additional processingn before the block is stored
output
a "kvConnection" object indicating where the output data should reside. Must
be a localDiskConn object if input is a text file on local disk, or a hdfsConn
object if input is a text file on HDFS.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in postTransFn
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
...
see read.table for more info
Value
an object of class "ddf"
Note
For local disk, the file is actually read in sequentially instead of in parallel. This is because of
possible performance issues when trying to read from the same disk in parallel.
Note that if skip is positive and/or if header is TRUE, it will first read these in as they only occur
once in the data, and we then check for these lines in each block and remove those lines if they
appear.
Also note that if you supply "Factor" column classes, they will be converted to character.
Author(s)
Ryan Hafen
Examples
## Not run:
csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
irisTextConn <- localDiskConn(file.path(tempdir(), "irisText2"), autoYes = TRUE)
a <- drRead.csv(csvFile, output = irisTextConn, rowsPerBlock = 10)
## End(Not run)
drSample
drSample
45
Take a Sample of Key-Value Pairs Take a sample of key-value Pairs
Description
Take a Sample of Key-Value Pairs Take a sample of key-value Pairs
Usage
drSample(x, fraction, output = NULL, overwrite = FALSE, control = NULL)
Arguments
x
a ’ddo’ or ’ddf’ object
fraction
fraction of key-value pairs to keep (between 0 and 1)
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Examples
bySpecies <- divide(iris, by = "Species")
set.seed(234)
sampleRes <- drSample(bySpecies, fraction = 0.25)
drSubset
Subsetting Distributed Data Frames
Description
Return a subset of a "ddf" object to memory
Usage
drSubset(data, subset = NULL, select = NULL, drop = FALSE,
preTransFn = NULL, maxRows = 500000, params = NULL, packages = NULL,
control = NULL, verbose = TRUE)
46
flatten
Arguments
data
object to be subsetted – an object of class "ddf" or "ddo" - in the latter case, need
to specify preTransFn to coerce each subset into a data frame
subset
logical expression indicating elements or rows to keep: missing values are taken
as false
select
expression, indicating columns to select from a data frame
drop
passed on to [ indexing operator
preTransFn
a transformation function (if desired) to applied to each subset prior to division
- note: this is deprecated - instead use addTransform prior to calling divide
maxRows
the maximum number of rows to return
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
verbose
logical - print messages about what is being done
Value
data frame
Author(s)
Ryan Hafen
Examples
d <- divide(iris, by = "Species")
drSubset(d, Sepal.Length < 5)
flatten
"Flatten" a ddf Subset
Description
Add split variables and BSVs (if any) as columns to a subset of a ddf.
Usage
flatten(x)
getCondCuts
47
Arguments
x
a value of a key-value pair
See Also
getSplitVars, getBsvs
Examples
d <- divide(iris, by = "Species")
# the column "Species" is no longer explicitly in the data
d[[1]]$value
# but it is preserved and can be added back in with flatten()
flatten(d[[1]]$value)
getCondCuts
Get names of the conditioning variable cuts
Description
This is used internally for conditioning variable division. It does not have much use outside of there,
but is exported for convenience.
Usage
getCondCuts(df, splitVars)
Arguments
df
a data frame
splitVars
a vector of variable names to split by
Examples
# see how key names are obtained
getCondCuts(iris, "Species")
48
hdfsConn
getSplitVar
Extract "Split" Variable(s)
Description
For a given key-value pair or value, get a split variable value by name, if present (split variables are
variables that define how the data was divided).
Usage
getSplitVar(x, name)
getSplitVars(x)
Arguments
x
a key-value pair or a value
name
the name of the split variable to get
Examples
d <- divide(iris, by = "Species",
bsvFn = function(x)
list(msl = bsv(mean(x$Sepal.Length))))
getSplitVars(d[[1]]$value)
getSplitVar(d[[1]]$value, "Species")
hdfsConn
Connect to Data Source on HDFS
Description
Connect to a data source on HDFS
Usage
hdfsConn(loc, type = "sequence", autoYes = FALSE, reset = FALSE,
verbose = TRUE)
Arguments
loc
location on HDFS for the data source
type
the type of data ("map", "sequence", "text")
autoYes
automatically answer "yes" to questions about creating a path on HDFS
reset
should existing metadata for this object be overwritten?
verbose
logical - print messages about what is being done
kvApply
49
Details
This simply creates a "connection" to a directory on HDFS (which need not have data in it). To
actually do things with this data, see ddo, etc.
Value
a "kvConnection" object of class "hdfsConn"
Author(s)
Ryan Hafen
See Also
addData, ddo, ddf, localDiskConn
Examples
## Not run:
# connect to empty HDFS directory
conn <- hdfsConn("/test/irisSplit")
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
addData(conn, list(list("3", iris[111:150,])))
# represent it as a distributed data frame
hdd <- ddf(conn)
## End(Not run)
kvApply
Apply Function to Key-Value Pair
Description
Apply a function to a single key-value pair - not a traditional R "apply" function.
Usage
kvApply(kvPair, fn)
Arguments
kvPair
a key-value pair (a list with 2 elements or object created with kvPair)
fn
a function
50
kvPair
Details
Determines how a function should be applied to a key-value pair and then applies it: if the function
has two formals, it applies the function giving it the key and the value as the arguments; if the
function has one formal, it applies the function giving it just the value. The function is assumed to
return a value unless the result is a kvPair object. When the function returns a value the original
key will be returned in the resulting key-value pair.
This provides flexibility and simplicity for when a function is only meant to be applied to the value
(the most common case), but still allows keys to be used if desired.
Examples
kv <- kvPair(1, 2)
kv
kvApply(kv, function(x)
kvApply(kv, function(k,
kvApply(kv, function(k,
kvApply(kv, function(x)
kvPair
x^2)
v) v^2)
v) k + v)
kvPair("new_key", x))
Specify a Key-Value Pair
Description
Specify a key-value pair
Usage
kvPair(k, v)
Arguments
k
key - any R object
v
value - any R object
Value
a list of objects of class "kvPair"
See Also
kvPairs
Examples
kvPair("name", "bob")
kvPairs
kvPairs
51
Specify a Collection of Key-Value Pairs
Description
Specify a collection of key-value pairs
Usage
kvPairs(...)
Arguments
...
key-value pairs (lists with two elements)
Value
a list of objects of class "kvPair"
See Also
kvPair
Examples
kvPairs(kvPair(1, letters), kvPair(2, rnorm(10)))
localDiskConn
Connect to Data Source on Local Disk
Description
Connect to a data source on local disk
Usage
localDiskConn(loc, nBins = 0, fileHashFn = NULL, autoYes = FALSE,
reset = FALSE, verbose = TRUE)
52
localDiskConn
Arguments
loc
location on local disk for the data source
nBins
number of bins (subdirectories) to put data files into - if anticipating a large
number of k/v pairs, it is a good idea to set this to something bigger than 0
fileHashFn
an optional function that operates on each key-value pair to determine the subdirectory structure for where the data should be stored for that subset, or can be
specified "asis" when keys are scalar strings
autoYes
automatically answer "yes" to questions about creating a path on local disk
reset
should existing metadata for this object be overwritten?
verbose
logical - print messages about what is being done
Details
This simply creates a "connection" to a directory on local disk (which need not have data in it). To
actually do things with this connection, see ddo, etc. Typically, you should just use loc to specify
where the data is or where you would like data for this connection to be stored. Metadata for the
object is also stored in this directory.
Value
a "kvConnection" object of class "localDiskConn"
Author(s)
Ryan Hafen
See Also
addData, ddo, ddf, localDiskConn
Examples
# connect to empty localDisk directory
conn <- localDiskConn(file.path(tempdir(), "irisSplit"), autoYes = TRUE)
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
addData(conn, list(list("3", iris[111:150,])))
# represent it as a distributed data frame
irisDdf <- ddf(conn, update = TRUE)
irisDdf
localDiskControl
localDiskControl
53
Specify Control Parameters for MapReduce on a Local Disk Connection
Description
Specify control parameters for a MapReduce on a local disk connection. Currently the parameters
include:
Usage
localDiskControl(cluster = NULL, map_buff_size_bytes = 10485760,
reduce_buff_size_bytes = 10485760, map_temp_buff_size_bytes = 10485760)
Arguments
cluster
a "cluster" object obtained from makeCluster to allow for parallel processing
map_buff_size_bytes
determines how much data should be sent to each map task
reduce_buff_size_bytes
determines how much data should be sent to each reduce task
map_temp_buff_size_bytes
determines the size of chunks written to disk in between the map and reduce
Note
If you have data on a shared drive that multiple nodes can access or a high performance shared
file system like Lustre, you can run a local disk MapReduce job on multiple nodes by creating a
multi-node cluster with makeCluster.
If you are using multiple cores and the input data is very small, map_buff_size_bytes needs to be
small so that the key-value pairs will be split across cores.
Examples
# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)
# create a local disk control object that specifies to use this cluster
# these operations run in parallel
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations
# convert in-memory ddf to local-disk ddf
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)
bySpeciesLD <- convert(divide(iris, by = "Species"), ldConn)
# update attributes using parallel cluster
54
makeExtractable
updateAttributes(bySpeciesLD, control = control)
# remove temporary directories
unlink(ldPath, recursive = TRUE)
# shut down the cluster
parallel::stopCluster(cl)
makeExtractable
Take a ddo/ddf HDFS data object and turn it into a mapfile
Description
Take a ddo/ddf HDFS data object and turn it into a mapfile
Usage
makeExtractable(obj, control = NULL)
Arguments
obj
object of class ’ddo’ or ’ddf’ with an HDFS connection
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
Examples
## Not run:
conn <- hdfsConn("/test/irisSplit")
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:110,])))
addData(conn, list(list("3", iris[111:150,])))
# represent it as a distributed data frame
hdd <- ddf(conn)
# try to extract values by key (this will result in an error)
# (HDFS can only lookup key-value pairs by key if data is in a mapfile)
hdd[["3"]]
# convert hdd into a mapfile
hdd <- makeExtractable(hdd)
# try again
hdd[["3"]]
## End(Not run)
mr-summary-stats
55
mr-summary-stats
Functions to Compute Summary Statistics in MapReduce
Description
Functions that are used to tabulate categorical variables and compute moments for numeric variables
inside through the MapReduce framework. Used in updateAttributes.
Usage
tabulateMap(formula, data)
tabulateReduce(result, reduce.values, maxUnique = NULL)
calculateMoments(y, order = 1, na.rm = TRUE)
combineMoments(m1, m2)
combineMultipleMoments(...)
moments2statistics(m)
Arguments
formula
a formula to be used in xtabs
data
a subset of a ’ddf’ object
result, reduce.values
inconsequential tabulateReduce parameters
maxUnique
y, order, na.rm
the maximum number of unique combinations of variables to obtaion tabulations
for. This is meant to help against cases where a variable in the formula has a
very large number of levels, to the point that it is not meaningful to tabulate and
is too computationally burdonsome. If NULL, it is ignored. If a positive number,
only the top and bottom maxUnique tabulations by frequency are kept.
inconsequential calculateMoments parameters
m1, m2
inconsequential combineMoments parameters
m
inconsequential moments2statistics parameters
...
inconsequential parameters
Examples
d <- divide(iris, by = "Species", update = TRUE)
summary(d)
56
mrExec
mrExec
Execute a MapReduce Job
Description
Execute a MapReduce job
Usage
mrExec(data, setup = NULL, map = NULL, reduce = NULL, output = NULL,
overwrite = FALSE, control = NULL, params = NULL, packages = NULL,
verbose = TRUE)
Arguments
data
a ddo/ddf object, or list of ddo/ddf objects
setup
an expression of R code (created using the R command expression) to be run
before map and reduce
map
an R expression that is evaluated during the map stage. For each task, this
expression is executed multiple times (see details).
reduce
a vector of R expressions with names pre, reduce, and post that is evaluated during the reduce stage. For example reduce = expression(pre = {...}, reduce = {...}, post = {..
reduce is optional, and if not specified the map output key-value pairs will be the
result. If it is not specified, then a default identity reduce is performed. Setting
it to 0 will skip the reduce altogether.
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object. If a character string, it will be treated as a path to be passed to the
same type of connection as data - relative paths will be relative to the working
directory of that back end.
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
params
a named list of objects external to the input data that are needed in the map or
reduce phases
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
verbose
logical - print messages about what is being done
Value
"ddo" object - to keep it simple. It is up to the user to update or cast as "ddf" if that is the desired
result.
print.ddo
57
Author(s)
Ryan Hafen
Examples
# compute min and max Sepal Length by species for iris data
# using a random partitioning of it as input
d <- divide(iris, by = rrDiv(20))
mapExp <- expression({
lapply(map.values, function(r) {
by(r, r$Species, function(x) {
collect(
as.character(x$Species[1]),
range(x$Sepal.Length, na.rm = TRUE)
)
})
})
})
reduceExp <- expression(
pre = {
rng <- c(Inf, -Inf)
}, reduce = {
rx <- unlist(reduce.values)
rng <- c(min(rng[1], rx, na.rm = TRUE), max(rng[2], rx, na.rm = TRUE))
}, post = {
collect(reduce.key, rng)
})
res <- mrExec(d, map = mapExp, reduce = reduceExp)
as.list(res)
print.ddo
Print a "ddo" or "ddf" Object
Description
Print an overview of attributes of distributed data objects (ddo) or distributed data frames (ddf)
Usage
## S3 method for class 'ddo'
print(x, ...)
Arguments
x
...
object to be printed
additional arguments
58
print.kvValue
Author(s)
Ryan Hafen
Examples
kv <- kvPairs(kvPair(1, letters), kvPair(2, rnorm(100)))
kvddo <- ddo(kv)
kvddo
print.kvPair
Print a key-value pair
Description
Print a key-value pair
Usage
## S3 method for class 'kvPair'
print(x, ...)
Arguments
x
object to be printed
...
additional arguments
Examples
kvPair(1, letters)
print.kvValue
Print value of a key-value pair
Description
Print value of a key-value pair
Usage
## S3 method for class 'kvValue'
print(x, ...)
Arguments
x
object to be printed
...
additional arguments
readHDFStextFile
59
Examples
kvPair(1, letters)
readHDFStextFile
Experimental HDFS text reader helper function
Description
Experimental helper function for reading text data on HDFS into a HDFS connection
Usage
readHDFStextFile(input, output = NULL, overwrite = FALSE, fn = NULL,
keyFn = NULL, linesPerBlock = 10000, control = NULL, update = FALSE)
Arguments
input
a ddo / ddf connection to a text input directory on HDFS, created with hdfsConn
- ensure the text files are within a directory and that type = "text" is specified
output
an output connection such as those created with localDiskConn, and hdfsConn
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
fn
function to be applied to each chunk of lines (input to function is a vector of
strings)
keyFn
optional function to determine the value of the key for each block
linesPerBlock
how many lines at a time to read
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
update
should a MapReduce job be run to obtain additional attributes for the result data
prior to returning?
Examples
## Not run:
res <- readHDFStextFile(
input = Rhipe::rhfmt("/path/to/input/text", type = "text"),
output = hdfsConn("/path/to/output"),
fn = function(x) {
read.csv(textConnection(paste(x, collapse = "\n")), header = FALSE)
}
)
## End(Not run)
60
readTextFileByChunk
readTextFileByChunk
Experimental sequential text reader helper function
Description
Experimental helper function for reading text data sequentially from a file on disk and adding to
connection using addData
Usage
readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000,
fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL,
cl = NULL)
Arguments
input
output
overwrite
the path to an input text file
an output connection such as those created with localDiskConn, and hdfsConn
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
linesPerBlock how many lines at a time to read
fn
function to be applied to each chunk of lines (see details)
header
does the file have a header
skip
number of lines to skip before reading
recordEndRegex an optional regular expression that finds lines in the text file that indicate the end
of a record (for multi-line records)
cl
a "cluster" object to be used for parallel processing, created using makeCluster
Details
The function fn should have one argument, which should expect to receive a vector of strings, each
element of which is a line in the file. It is also possible for fn to take two arguments, in which case
the second argument is the header line from the file (some parsing methods might need to know the
header).
Examples
csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE)
a <- readTextFileByChunk(csvFile,
output = myoutput, linesPerBlock = 10,
fn = function(x, header) {
colNames <- strsplit(header, ",")[[1]]
read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE)
})
a[[1]]
recombine
recombine
61
Recombine
Description
Apply an analytic recombination method to a ddo/ddf object and combine the results
Usage
recombine(data, combine = NULL, apply = NULL, output = NULL,
overwrite = FALSE, params = NULL, packages = NULL, control = NULL,
verbose = TRUE)
Arguments
data
an object of class "ddo" of "ddf"
combine
the method to combine the results. See, for example, combCollect, combDdf,
combDdo, combRbind, etc. If combine = NULL, combCollect will be used if
output = NULL and combDdo is used if output is specified.
apply
a function specifying the analytic method to apply to each subset, or a predefined apply function (see drBLB, drGLM, for example). NOTE: This argument
is now deprecated in favor of addTransform
output
a "kvConnection" object indicating where the output data should reside (see
localDiskConn, hdfsConn). If NULL (default), output will be an in-memory
"ddo" object
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in the distributed
computing (most should be taken care of automatically such that this is rarely
necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be
taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl
verbose
logical - print messages about what is being done
Value
Depends on combine: this could be a distributed data object, a data frame, a key-value list, etc. See
examples.
Author(s)
Ryan Hafen
62
recombine
References
• http://deltarho.org
• Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large
complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67.
See Also
divide, ddo, ddf, drGLM, drBLB, combMeanCoef, combMean, combCollect, combRbind, drLapply
Examples
## in-memory example
##--------------------------------------------------------# begin with an in-memory ddf (backed by kvMemory)
bySpecies <- divide(iris, by = "Species")
# create a function to calculate the mean for each variable
colMean <- function(x) data.frame(lapply(x, mean))
# apply the transformation
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# recombination with no 'combine' argument and no argument to output
# produces the key-value list produced by 'combCollect()'
recombine(bySpeciesTransformed)
# but we can also preserve the distributed data frame, like this:
recombine(bySpeciesTransformed, combine = combDdf)
# or we can recombine using 'combRbind()' and produce a data frame:
recombine(bySpeciesTransformed, combine = combRbind)
## local disk connection example with parallelization
##--------------------------------------------------------# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)
# create the control object we'll pass into local disk datadr operations
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations
# create local disk connection to hold bySpecies data
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)
# convert in-memory bySpecies to local-disk ddf
bySpeciesLD <- convert(bySpecies, ldConn)
removeData
63
# apply the transformation
bySpeciesTransformed <- addTransform(bySpeciesLD, colMean)
# recombine the data using the transformation
bySpeciesMean <- recombine(bySpeciesTransformed,
combine = combRbind, control = control)
bySpeciesMean
# remove temporary directories
unlink(ldPath, recursive = TRUE)
# shut down the cluster
parallel::stopCluster(cl)
removeData
Remove Key-Value Pairs from a Data Connection
Description
Remove key-value pairs from a data connection
Usage
removeData(conn, keys)
Arguments
conn
a kvConnection object
keys
a list of keys indicating which k/v pairs to remove
Note
This is generally not recommended for HDFS as it writes a new file each time it is called, and can
result in more individual files than Hadoop likes to deal with.
Author(s)
Ryan Hafen
See Also
removeData, localDiskConn, hdfsConn
64
rhipeControl
Examples
# connect to empty localDisk directory
conn <- localDiskConn(file.path(tempdir(), "irisSplit"), autoYes = TRUE)
# add some data
addData(conn, list(list("1", iris[1:10,])))
addData(conn, list(list("2", iris[11:90,])))
addData(conn, list(list("3", iris[91:110,])))
addData(conn, list(list("4", iris[111:150,])))
# represent it as a distributed data frame
irisDdf <- ddf(conn, update = TRUE)
irisDdf
# remove data for keys "1" and "2"
removeData(conn, list("1", "2"))
# look at result with updated attributes (reset = TRUE removes previous attrs)
irisDdf <- ddf(conn, reset = TRUE, update = TRUE)
irisDdf
rhipeControl
Specify Control Parameters for RHIPE Job
Description
Specify control parameters for a RHIPE job. See rhwatch for details about each of the parameters.
Usage
rhipeControl(mapred = NULL, setup = NULL, combiner = FALSE,
cleanup = NULL, orderby = "bytes", shared = NULL, jarfiles = NULL,
zips = NULL, jobname = "")
Arguments
mapred, setup, combiner, cleanup, orderby, shared, jarfiles, zips, jobname
arguments to rhwatch in RHIPE
Examples
## Not run:
# input data on HDFS
d <- ddf(hdfsConn("/path/to/big/data/on/hdfs"))
# set RHIPE / Hadoop parameters
# buffer sizes control how many k/v pairs are sent to map / reduce tasks at a time
# mapred.reduce.tasks is a Hadoop config parameter that controls # of reduce tasks
rhctl <- rhipeControl(mapred = list(
rhipe_map_buff_size = 10000,
mapred.reduce.tasks = 72,
rhipe_reduce_buff_size = 1)
rrDiv
65
# divide input data using these control parameters
divide(d, by = "var", output = hdfsConn("/path/to/output"), control = rhctl)
## End(Not run)
rrDiv
Random Replicate Division
Description
Specify random replicate division parameters for data division
Usage
rrDiv(nrows = NULL, seed = NULL)
Arguments
nrows
number of rows each subset should have
seed
the random seed to use (experimental)
Details
The random replicate division method currently gets the total number of rows of the input data and
divides it by nrows to get the number of subsets. Then it randomly assigns each row of the input data
to one of the subsets, resulting in subsets with approximately nrows rows. A future implementation
will make each subset have exactly nrows rows.
Value
a list to be used for the "by" argument to divide
Author(s)
Ryan Hafen
References
• http://deltarho.org
• Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large
complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67.
See Also
divide, recombine, condDiv
66
setupTransformEnv
Examples
# divide iris data into random subsets with ~20 records per subset
irisRR <- divide(iris, by = rrDiv(20), update = TRUE)
irisRR
# look at the actual distribution of number of rows per subset
plot(splitRowDistn(irisRR))
setupTransformEnv
Set up transformation environment
Description
This is called internally in the map phase of datadr MapReduce jobs. It is not meant for use outside
of there, but is exported for convenience. Given an environment and collection of transformations,
it populates the environment with the global variables in the transformations.
Usage
setupTransformEnv(transFns, env = NULL)
Arguments
transFns
from the "transforms" attribute of a ddo object
env
the environment in which to evaluate the transformations
Examples
# Create a distributed data frame using the iris data set, backed by the
# kvMemory (in memory) connection
bySpecies <- divide(iris, by = "Species")
bySpecies
# Note a tranformation is not present in the attributes
names(attributes(bySpecies))
## A transform that operates only on values of the key-value pairs
##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in
# in a subset. The calls to 'as.data.frame()' and 't()' convert the
# vector output of 'apply()' into a data.frame with a single row
colMean <- function(x) as.data.frame(t(apply(x, 2, mean)))
# Test on a subset
colMean(bySpecies[[1]][[2]])
# Add a tranformation that will calculate the mean of each variable
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# Note how 'before transformation' appears to describe the values of
# several of the attributes
bySpeciesTransformed
# Note the addition of the transformation to the attributes
names(attributes(bySpeciesTransformed))
# We can see the result of the transformation by looking at one of
to_ddf
67
# the subsets:
bySpeciesTransformed[[1]]
# The transformation is automatically applied when calling any data
# operation. For example, if can call 'recombine()' with 'combRbind'
# we will get a data frame of the column means for each subset:
varMeans <- recombine(bySpeciesTransformed, combine = combRbind)
varMeans
## A transform that operates on both keys and values
##--------------------------------------------------------# We can also create a transformation that uses both the keys and values
# It will select the first row of the value, and append '-firstRow' to
# the key
aTransform <- function(key, val) {
newKey <- paste(key, "firstRow", sep = "-")
newVal <- val[1,]
kvPair(newKey, newVal)
}
# Apply the transformation
recombine(addTransform(bySpecies, aTransform))
to_ddf
Convert dplyr grouped_df to ddf
Description
Convert dplyr grouped_df to ddf
Usage
to_ddf(x)
Arguments
x
a grouped_df object from dplyr
Examples
## Not run:
library(dplyr)
bySpecies <- iris %>%
group_by(Species) %>%
to_ddf()
## End(Not run)
68
updateAttributes
updateAttributes
Update Attributes of a ’ddo’ or ’ddf’ Object
Description
Update attributes of a ’ddo’ or ’ddf’ object
Usage
updateAttributes(obj, control = NULL)
Arguments
obj
an object of class ’ddo’ or ’ddf’
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl
Details
This function looks for missing attributes related to a ddo or ddf (distributed data object or data
frame) object and runs MapReduce to update them. These attributes include "splitSizeDistn",
"keys", "nDiv", "nRow", and "splitRowDistn". These attributes are useful for subsequent computations that might rely on them. The result is the input modified to reflect the updated attributes,
and thus it should be used as obj <- updateAttributes(obj).
Value
an object of class ’ddo’ or ’ddf’
Author(s)
Ryan Hafen
References
Bennett, Janine, et al. "Numerically stable, single-pass, parallel statistics algorithms.’ Cluster
Computing and Workshops", 2009. CLUSTER09. IEEE International Conference on. IEEE, 2009
See Also
ddo, ddf, divide
%>%
69
Examples
d <- divide(iris, by = "Species")
# some attributes are missing:
d
summary(d)
d <- updateAttributes(d)
# now all attributes are available:
d
summary(d)
%>%
Pipe data
Description
Pipe data from one datadr operation to another
Usage
lhs %>% rhs
Arguments
lhs
a data object
rhs
a function to apply to the data
Note
Pipes are great if you know the exact sequence of operations you would like to perform and fully
understand what the intermediate data structures will look like. But often with very large or complex
data sets it can be a good idea to do each step independently and examine the intermediate results.
This can be better for debugging, particularly when each step takes some time to evaluate.
Examples
# Suppose we wish to do the following:
bySpecies <- divide(iris, by = "Species")
bySpeciesTransformed <- addTransform(bySpecies, function(x) mean(x$Sepal.Length))
recombine(bySpeciesTransformed, combine = combRbind)
# We can do it more consely using the pipe: '%>%'
divide(iris, by = "Species") %>%
addTransform(function(x) mean(x$Sepal.Length)) %>%
recombine(combRbind)
Index
ddo-ddf-attributes, 25
dfSplit (divide-internals), 29
digestFileHash, 12, 26
divide, 5, 11, 13, 15–17, 19, 20, 27, 32, 35,
39, 62, 65, 68
divide-internals, 29
drAggregate, 29
drBLB, 31, 61, 62
drFilter, 5, 32, 37, 38
drGetGlobals, 33
drGLM, 17, 34, 61, 62
drHexbin, 35
drJoin, 5, 33, 36, 38
drLapply, 5, 33, 37, 38, 62
drLM, 17, 39
drPersist, 40
drQuantile, 36, 41
drRead.csv (drRead.table), 43
drRead.csv2 (drRead.table), 43
drRead.delim (drRead.table), 43
drRead.delim2 (drRead.table), 43
drRead.table, 43
drSample, 5, 45
drSubset, 45
∗Topic data
adult, 6
∗Topic package
datadr-package, 3
%>%, 69
addData, 3, 49, 52, 60
addSplitAttrs (divide-internals), 29
addTransform, 4, 27, 30, 34, 39–41, 46, 61
adult, 6
aggregate, 30
applyTransform, 8
as.data.frame.ddf, 9
as.list.ddo, 10
bsv, 10
bsvInfo, 11
bsvInfo (ddo-ddf-accessors), 24
calculateMoments (mr-summary-stats), 55
charFileHash, 11, 26
combCollect, 12, 13, 15–17, 19, 61, 62
combDdf, 13, 13, 16, 17, 19, 61
combDdo, 13, 14, 16, 17, 19, 38, 61
combineMoments (mr-summary-stats), 55
combineMultipleMoments
(mr-summary-stats), 55
combMean, 13, 15, 16, 17, 19, 62
combMeanCoef, 13, 15, 16, 17, 19, 34, 39, 62
combRbind, 13, 15–17, 18, 38, 61, 62
condDiv, 19, 28, 65
convert, 20
counters (ddo-ddf-accessors), 24
findGlobals, 34
flatten, 46
formula, 30
getAttribute (ddo-ddf-attributes), 25
getAttributes (ddo-ddf-attributes), 25
getBsv (bsv), 10
getBsvs, 11, 47
getBsvs (bsv), 10
getCondCuts, 47
getKeys (ddo-ddf-accessors), 24
getSplitVar, 20, 48
getSplitVars, 20, 47
getSplitVars (getSplitVar), 48
glm, 34
datadr (datadr-package), 3
datadr-package, 3
ddf, 21, 28, 49, 52, 62, 68
ddf-accessors, 22
ddo, 23, 28, 49, 52, 62, 68
ddo-ddf-accessors, 24
70
INDEX
hasAttributes (ddo-ddf-attributes), 25
hasExtractableKV (ddo-ddf-accessors), 24
hdfsConn, 4, 20, 21, 23, 27, 30, 32, 37, 38, 40,
43–45, 48, 56, 59–61, 63
kvApply, 49
kvExample (ddo-ddf-accessors), 24
kvPair, 5, 49, 50, 50, 51
kvPairs, 50, 51
length.ddo (ddo-ddf-accessors), 24
lm, 39
localDiskConn, 4, 11, 12, 20, 21, 23, 26, 27,
30, 32, 37, 38, 40, 44, 45, 49, 51, 52,
56, 59–61, 63
localDiskControl, 21, 24, 27, 30, 33, 36–38,
40, 42, 44–46, 53, 54, 56, 59, 61
makeCluster, 53
makeExtractable, 54
moments2statistics (mr-summary-stats),
55
mr-summary-stats, 55
mrExec, 56
names.ddf (ddf-accessors), 22
NCOL (ddf-accessors), 22
ncol (ddf-accessors), 22
NCOL,ddf-method (ddf-accessors), 22
ncol,ddf-method (ddf-accessors), 22
NROW (ddf-accessors), 22
nrow (ddf-accessors), 22
NROW,ddf-method (ddf-accessors), 22
nrow,ddf-method (ddf-accessors), 22
print.ddo, 57
print.kvPair, 58
print.kvValue, 58
rbind, 13
read.table, 43, 44
readHDFStextFile, 59
readTextFileByChunk, 60
recombine, 5, 12, 13, 15–17, 19, 28, 32, 34,
35, 38–40, 61, 65
removeData, 4, 63, 63
rhipeControl, 21, 24, 27, 30, 33, 36–38, 40,
42, 44–46, 54, 56, 59, 61, 64, 68
rrDiv, 17, 28, 35, 39, 65
71
setAttributes (ddo-ddf-attributes), 25
setupTransformEnv, 8, 66
splitRowDistn (ddf-accessors), 22
splitSizeDistn (ddo-ddf-accessors), 24
summary.ddf (ddf-accessors), 22
summary.ddo (ddf-accessors), 22
tabulateMap (mr-summary-stats), 55
tabulateReduce (mr-summary-stats), 55
to_ddf, 67
updateAttributes, 21, 23, 31, 42, 55, 68
xtabs, 30, 31, 55