Package ‘datadr’ October 2, 2016 Type Package Title Divide and Recombine for Large, Complex Data Version 0.8.6 Date 2016-09-22 Maintainer Ryan Hafen <[email protected]> Description Methods for dividing data into subsets, applying analytical methods to the subsets, and recombining the results. Comes with a generic MapReduce interface as well. Works with key-value pairs stored in memory, on local disk, or on HDFS, in the latter case using the R and Hadoop Integrated Programming Environment (RHIPE). License BSD_3_clause + file LICENSE URL http://deltarho.org/docs-datadr LazyLoad yes LazyData yes NeedsCompilation no Imports data.table (>= 1.9.6), digest, codetools, hexbin, parallel, magrittr, dplyr, methods Suggests testthat (>= 0.11.0), roxygen2 (>= 5.0.1), Rhipe RoxygenNote 5.0.1 Additional_repositories http://ml.stat.purdue.edu/packages Author Ryan Hafen [aut, cre], Landon Sego [ctb] Repository CRAN Date/Publication 2016-10-02 15:51:50 R topics documented: datadr-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . addData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . addTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 R topics documented: 2 adult . . . . . . . applyTransform . as.data.frame.ddf as.list.ddo . . . . bsv . . . . . . . . charFileHash . . combCollect . . . combDdf . . . . combDdo . . . . combMean . . . combMeanCoef . combRbind . . . condDiv . . . . . convert . . . . . . ddf . . . . . . . . ddf-accessors . . ddo . . . . . . . ddo-ddf-accessors ddo-ddf-attributes digestFileHash . divide . . . . . . divide-internals . drAggregate . . . drBLB . . . . . . drFilter . . . . . drGetGlobals . . drGLM . . . . . drHexbin . . . . drJoin . . . . . . drLapply . . . . . drLM . . . . . . drPersist . . . . . drQuantile . . . . drRead.table . . . drSample . . . . drSubset . . . . . flatten . . . . . . getCondCuts . . getSplitVar . . . hdfsConn . . . . kvApply . . . . . kvPair . . . . . . kvPairs . . . . . localDiskConn . localDiskControl makeExtractable mr-summary-stats mrExec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 9 10 10 11 12 13 14 16 17 18 19 20 21 22 23 24 25 26 27 29 29 31 32 33 34 35 36 38 39 40 41 43 45 45 46 47 48 48 49 50 51 51 53 54 55 56 datadr-package 3 print.ddo . . . . . . . print.kvPair . . . . . print.kvValue . . . . readHDFStextFile . . readTextFileByChunk recombine . . . . . . removeData . . . . . rhipeControl . . . . . rrDiv . . . . . . . . . setupTransformEnv . to_ddf . . . . . . . . updateAttributes . . . %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index datadr-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 58 58 59 60 61 63 64 65 66 67 68 69 70 datadr Description datadr: Divide and Recombine for Large, Complex Data Details http://deltarho.org/docs-datadr/ Author(s) Ryan Hafen Maintainer: Ryan Hafen <[email protected]> Examples help(package = datadr) addData . . . . . . . . . . . . . Add Key-Value Pairs to a Data Connection Description Add key-value pairs to a data connection Usage addData(conn, data, overwrite = FALSE) 4 addTransform Arguments conn a kvConnection object data a list of key-value pairs (list of lists where each sub-list has two elements, the key and the value) overwrite if data with the same key is already present in the data, should it be overwritten? (does not work for HDFS connections) Note This is generally not recommended for HDFS as it writes a new file each time it is called, and can result in more individual files than Hadoop likes to deal with. Author(s) Ryan Hafen See Also removeData, localDiskConn, hdfsConn Examples ## Not run: # connect to empty HDFS directory conn <- hdfsConn("/test/irisSplit") # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) addData(conn, list(list("3", iris[111:150,]))) # represent it as a distributed data frame hdd <- ddf(conn) ## End(Not run) addTransform Add a Transformation Function to a Distributed Data Object Description Add a transformation function to be applied to each subset of a distributed data object Usage addTransform(obj, fn, name = NULL, params = NULL, packages = NULL) addTransform 5 Arguments obj a distributed data object fn a function to be applied to each subset of obj - see details name optional name of the transformation params a named list of objects external to obj that are needed in the transformation function (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) Details When you add a transformation to a distributed data object, the transformation is not applied immediately, but is deferred until a function that kicks off a computation is done. These include divide, recombine, drJoin, drLapply, drFilter, drSample, drSubset. When any of these are invoked on an object with a transformation attached to it, the transformation will be applied in the map phase of the MapReduce computation prior to any other computation. The transformation will also be applied any time a subset of the data is requested. Although the data has not been physically transformed after a call of addTransform, we can think of it conceptually as already being transformed. To force the transformation to be immediately calculated on all subsets use: drPersist(dat, output = ...). The function provided by fn can either accept one or two parameters. If it accepts one parameter, the value of a key-value pair is passed in. It if accepts two parameters, it is passed the key as the first parameter and the value as the second parameter. The return value of fn is treated as a value of a key-value pair unless the return type comes from kvPair. When addTransform is called, it is tested on a subset of the data to make sure we have all of the necessary global variables and packages loaded necessary to portably perform the transformation. It is possible to add multiple transformations to a distributed data object, in which case they are applied in the order supplied, but only one transform should be necessary. The transformation function must not return NULL on any data subset, although it can return an empty object of the correct shape to match othersubsets (e.g. a data.frame with the correct columns but zero rows). Value The distributed data object provided by obj, with the tranformation included as one of the attributes of the returned object. Examples # Create a distributed data frame using the iris data set, backed by the # kvMemory (in memory) connection bySpecies <- divide(iris, by = "Species") bySpecies # Note a tranformation is not present in the attributes names(attributes(bySpecies)) 6 adult ## A transform that operates only on values of the key-value pairs ##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in # in a subset. The calls to 'as.data.frame()' and 't()' convert the # vector output of 'apply()' into a data.frame with a single row colMean <- function(x) as.data.frame(t(apply(x, 2, mean))) # Test on a subset colMean(bySpecies[[1]][[2]]) # Add a tranformation that will calculate the mean of each variable bySpeciesTransformed <- addTransform(bySpecies, colMean) # Note how 'before transformation' appears to describe the values of # several of the attributes bySpeciesTransformed # Note the addition of the transformation to the attributes names(attributes(bySpeciesTransformed)) # We can see the result of the transformation by looking at one of # the subsets: bySpeciesTransformed[[1]] # The transformation is automatically applied when calling any data # operation. For example, if can call 'recombine()' with 'combRbind' # we will get a data frame of the column means for each subset: varMeans <- recombine(bySpeciesTransformed, combine = combRbind) varMeans ## A transform that operates on both keys and values ##--------------------------------------------------------# We can also create a transformation that uses both the keys and values # It will select the first row of the value, and append '-firstRow' to # the key aTransform <- function(key, val) { newKey <- paste(key, "firstRow", sep = "-") newVal <- val[1,] kvPair(newKey, newVal) } # Apply the transformation recombine(addTransform(bySpecies, aTransform)) adult "Census Income" Dataset Description "Census Income" dataset from UCI machine learning repository Usage adult Format (From UCI machine learning repository) adult 7 • age. continuous • workclass. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Withoutpay, Never-worked • fnlwgt. continuous • education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education-num: continuous • marital. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouseabsent, Married-AF-spouse • occupation. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces • relationship. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried • race. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black • sex. Female, Male • capgain. continuous • caploss. continuous • hoursperweek. continuous • nativecountry. United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands • income. <=50K, >50K • incomebin. 0 if income<=50K, 1 if income>50K Source (From UCI machine learning repository) Link: http://archive.ics.uci.edu/ml/datasets/ Adult Donor: Ronny Kohavi and Barry Becker Data Mining and Visualization Silicon Graphics. e-mail: [email protected] for questions. Data Set Information: Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) References Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics. uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 8 applyTransform applyTransform Apply transformation function(s) Description This is called internally in the map phase of datadr MapReduce jobs. It is not meant for use outside of there, but is exported for convenience. Usage applyTransform(transFns, x, env = NULL) Arguments transFns from the "transforms" attribute of a ddo object x a subset of the object env the environment in which to evaluate the function (should be instantiated from calling setupTransformEnv) - if NULL, the environment will be set up for you Examples # Create a distributed data frame using the iris data set, backed by the # kvMemory (in memory) connection bySpecies <- divide(iris, by = "Species") bySpecies # Note a tranformation is not present in the attributes names(attributes(bySpecies)) ## A transform that operates only on values of the key-value pairs ##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in # in a subset. The calls to 'as.data.frame()' and 't()' convert the # vector output of 'apply()' into a data.frame with a single row colMean <- function(x) as.data.frame(t(apply(x, 2, mean))) # Test on a subset colMean(bySpecies[[1]][[2]]) # Add a tranformation that will calculate the mean of each variable bySpeciesTransformed <- addTransform(bySpecies, colMean) # Note how 'before transformation' appears to describe the values of # several of the attributes bySpeciesTransformed # Note the addition of the transformation to the attributes names(attributes(bySpeciesTransformed)) # We can see the result of the transformation by looking at one of # the subsets: bySpeciesTransformed[[1]] # The transformation is automatically applied when calling any data # operation. For example, if can call 'recombine()' with 'combRbind' # we will get a data frame of the column means for each subset: varMeans <- recombine(bySpeciesTransformed, combine = combRbind) as.data.frame.ddf 9 varMeans ## A transform that operates on both keys and values ##--------------------------------------------------------# We can also create a transformation that uses both the keys and values # It will select the first row of the value, and append '-firstRow' to # the key aTransform <- function(key, val) { newKey <- paste(key, "firstRow", sep = "-") newVal <- val[1,] kvPair(newKey, newVal) } # Apply the transformation recombine(addTransform(bySpecies, aTransform)) as.data.frame.ddf Turn ’ddf’ Object into Data Frame Description Rbind all the rows of a ’ddf’ object into a single data frame Usage ## S3 method for class 'ddf' as.data.frame(x, row.names = NULL, optional = FALSE, keys = TRUE, splitVars = TRUE, bsvs = FALSE, ...) Arguments x a ’ddf’ object row.names passed to as.data.frame optional passed to as.data.frame keys should the key be added as a variable in the resulting data frame? (if key is not a character, it will be replaced with a md5 hash) splitVars should the values of the splitVars be added as variables in the resulting data frame? bsvs should the values of bsvs be added as variables in the resulting data frame? ... additional arguments passed to as.data.frame Examples d <- divide(iris, by = "Species") as.data.frame(d) 10 bsv as.list.ddo Turn ’ddo’ / ’ddf’ Object into a list Description Turn ’ddo’ / ’ddf’ Object into a list Usage ## S3 method for class 'ddo' as.list(x, ...) Arguments x ... a ’ddo’ / ’ddf’ object additional arguments passed to as.list Examples d <- divide(iris, by = "Species") as.list(d) bsv Construct Between Subset Variable (BSV) Description Construct between subset variable (BSV) For a given key-value pair, get a BSV variable value by name (if present) Usage bsv(val = NULL, desc = "") getBsv(x, name) getBsvs(x) Arguments val desc x name a scalar character, numeric, or date a character string describing the BSV a key-value pair or a value the name of the BSV to get d <- divide(iris, by = "Species", bsvFn = function(x) list(msl = bsv(mean(x$Sepal.Length)))) getBsvs(d[[1]]$value) getBsv(d[[1]]$value, "msl") charFileHash 11 Details Should be called inside the bsvFn argument to divide used for constructing a BSV list for each subset of a division. Author(s) Ryan Hafen See Also divide, getBsvs, bsvInfo Examples irisDdf <- ddf(iris) bsvFn <- function(dat) { list( meanSL = bsv(mean(dat$Sepal.Length), desc = "mean sepal length"), meanPL = bsv(mean(dat$Petal.Length), desc = "mean petal length") ) } # divide the data by species bySpecies <- divide(irisDdf, by = "Species", bsvFn = bsvFn) # see BSV info attached to the result bsvInfo(bySpecies) # get BSVs for a specified subset of the division getBsvs(bySpecies[[1]]) charFileHash Character File Hash Function Description Function to be used to specify the file where key-value pairs get stored for local disk connections, useful when keys are scalar strings. Should be passed as the argument fileHashFn to localDiskConn. Usage charFileHash(keys, conn) Arguments keys keys to be hashed conn a "localDiskConn" object 12 combCollect Details You shouldn’t need to call this directly other than to experiment with what the output looks like or to get ideas on how to write your own custom hash. Author(s) Ryan Hafen See Also localDiskConn, digestFileHash Examples # connect to empty localDisk directory path <- file.path(tempdir(), "irisSplit") unlink(path, recursive = TRUE) conn <- localDiskConn(path, autoYes = TRUE, fileHashFn = charFileHash) # add some data addData(conn, list(list("key1", iris[1:10,]))) addData(conn, list(list("key2", iris[11:110,]))) addData(conn, list(list("key3", iris[111:150,]))) # see that files were stored by their key list.files(path) combCollect "Collect" Recombination Description "Collect" recombination - collect the results into a local list of key-value pairs Usage combCollect(...) Arguments ... Additional list elements that will be added to the returned object Details combCollect is passed to the argument combine in recombine Author(s) Ryan Hafen combDdf 13 See Also divide, recombine, combDdo, combDdf, combMeanCoef, combRbind, combMean Examples # Create a distributed data frame using the iris data set bySpecies <- divide(iris, by = "Species") # Function to calculate the mean of the petal widths meanPetal <- function(x) mean(x$Petal.Width) # Combine the results using rbind combined <- recombine(addTransform(bySpecies, meanPetal), combine = combCollect) class(combined) combined # A more concise (and readable) way to do it bySpecies %>% addTransform(meanPetal) %>% recombine(combCollect) combDdf "DDF" Recombination Description "DDF" recombination - results into a "ddf" object, rbinding if necessary Usage combDdf(...) Arguments ... additional attributes to define the combiner (currently only used internally) Details combDdf is passed to the argument combine in recombine. If the value of the "ddo" object that will be recombined is a list, then the elements in the list will be collapsed together via rbind. Author(s) Ryan Hafen See Also divide, recombine, combCollect, combMeanCoef, combRbind, combDdo, combDdf 14 combDdo Examples # Divide the iris data bySpecies <- divide(iris, by = "Species") ## Simple combination to form a ddf ##--------------------------------------------------------# Add a transform that selects the petal width and length variables selVars <- function(x) x[,c("Petal.Width", "Petal.Length")] # Apply the transform and combine using combDdo combined <- recombine(addTransform(bySpecies, selVars), combine = combDdf) combined combined[[1]] # A more concise (and readable) way to do it bySpecies %>% addTransform(selVars) %>% recombine(combDdf) ## Combination that involves rbinding to give the ddf ##--------------------------------------------------------# A transformation that returns a list listTrans <- function(x) { list(meanPetalWidth = mean(x$Petal.Width), maxPetalLength = max(x$Petal.Length)) } # Apply the transformation and look at the result bySpeciesTran <- addTransform(bySpecies, listTrans) bySpeciesTran[[1]] # And if we rbind the "value" of the first subset: out1 <- rbind(bySpeciesTran[[1]]$value) out1 # Note how the combDdf method row binds the two data frames combined <- recombine(bySpeciesTran, combine = combDdf) out2 <- combined[[1]] out2 # These are equivalent identical(out1, out2$value) combDdo "DDO" Recombination combDdo 15 Description "DDO" recombination - simply collect the results into a "ddo" object Usage combDdo(...) Arguments ... additional attributes to define the combiner (currently only used internally) Details combDdo is passed to the argument combine in recombine Author(s) Ryan Hafen See Also divide, recombine, combCollect, combMeanCoef, combRbind, combMean Examples # Divide the iris data bySpecies <- divide(iris, by = "Species") # Add a transform that returns a list for each subset listTrans <- function(x) { list(meanPetalWidth = mean(x$Petal.Width), maxPetalLength = max(x$Petal.Length)) } # Apply the transform and combine using combDdo combined <- recombine(addTransform(bySpecies, listTrans), combine = combDdo) combined combined[[1]] # A more concise (and readable) way to do it bySpecies %>% addTransform(listTrans) %>% recombine(combDdo) 16 combMean combMean Mean Recombination Description Mean recombination – Calculate the elementwise mean of a vector in each value Usage combMean(...) Arguments ... additional attributes to define the combiner (currently only used internally) Details combMean is passed to the argument combine in recombine This method assumes that the values of the key-value pairs each consist of a numeric vector (with the same length). The mean is calculated elementwise across all the keys. Author(s) Ryan Hafen See Also divide, recombine, combCollect, combDdo, combDdf, combRbind, combMeanCoef Examples # Create a distributed data frame using the iris data set bySpecies <- divide(iris, by = "Species") # Add a transformation that returns a vector of sums for each subset, one # mean for each variable bySpeciesTrans <- addTransform(bySpecies, function(x) apply(x, 2, sum)) bySpeciesTrans[[1]] # Calculate the elementwise mean of the vector of sums produced by # the transform, across the keys out1 <- recombine(bySpeciesTrans, combine = combMean) out1 # A more concise (and readable) way to do it bySpecies %>% addTransform(function(x) apply(x, 2, sum)) %>% recombine(combMean) combMeanCoef 17 # This manual, non-datadr approach illustrates the above computation # This step mimics the transformation above sums <- aggregate(. ~ Species, data = iris, sum) sums # And this step mimics the mean recombination out2 <- apply(sums[,-1], 2, mean) out2 # These are the same identical(out1, out2) combMeanCoef Mean Coefficient Recombination Description Mean coefficient recombination – Calculate the weighted average of parameter estimates for a model fit to each subset Usage combMeanCoef(...) Arguments ... additional attributes to define the combiner (currently only used internally) Details combMeanCoef is passed to the argument combine in recombine This method is designed to calculate the mean of each model coefficient, where the same model has been fit to subsets via a transformation. The mean is a weighted average of each coefficient, where the weights are the number of observations in each subset. In particular, drLM and drGLM functions should be used to add the transformation to the ddo that will be recombined using combMeanCoef. Author(s) Ryan Hafen See Also divide, recombine, rrDiv, combCollect, combDdo, combDdf, combRbind, combMean 18 combRbind Examples # Create an irregular number of observations for each species indexes <- sort(c(sample(1:50, 40), sample(51:100, 37), sample(101:150, 46))) irisIrr <- iris[indexes,] # Create a distributed data frame using the irregular iris data set bySpecies <- divide(irisIrr, by = "Species") # Fit a linear model of Sepal.Length vs. Sepal.Width for each species # using 'drLM()' (or we could have used 'drGLM()' for a generlized linear model) lmTrans <- function(x) drLM(Sepal.Length ~ Sepal.Width, data = x) bySpeciesFit <- addTransform(bySpecies, lmTrans) # Average the coefficients from the linear model fits of each species, weighted # by the number of observations in each species out1 <- recombine(bySpeciesFit, combine = combMeanCoef) out1 # A more concise (and readable) way to do it bySpecies %>% addTransform(lmTrans) %>% recombine(combMeanCoef) # The following illustrates an equivalent, but more tedious approach lmTrans2 <- function(x) t(c(coef(lm(Sepal.Length ~ Sepal.Width, data = x)), n = nrow(x))) res <- recombine(addTransform(bySpecies, lmTrans2), combine = combRbind) colnames(res) <- c("Species", "Intercept", "Sepal.Width", "n") res out2 <- c("(Intercept)" = with(res, sum(Intercept * n) / sum(n)), "Sepal.Width" = with(res, sum(Sepal.Width * n) / sum(n))) # These are the same identical(out1, out2) combRbind "rbind" Recombination Description "rbind" recombination - Combine ddf divisions by row binding Usage combRbind(...) Arguments ... additional attributes to define the combiner (currently only used internally) condDiv 19 Details combRbind is passed to the argument combine in recombine Author(s) Ryan Hafen See Also divide, recombine, combDdo, combDdf, combCollect, combMeanCoef, combMean Examples # Create a distributed data frame using the iris data set bySpecies <- divide(iris, by = "Species") # Create a function that will calculate the standard deviation of each # variable in in a subset. The calls to 'as.data.frame()' and 't()' # convert the vector output of 'apply()' into a data.frame with a single row sdCol <- function(x) as.data.frame(t(apply(x, 2, sd))) # Combine the results using rbind combined <- recombine(addTransform(bySpecies, sdCol), combine = combRbind) class(combined) combined # A more concise (and readable) way to do it bySpecies %>% addTransform(sdCol) %>% recombine(combRbind) condDiv Conditioning Variable Division Description Specify conditioning variable division parameters for data division Usage condDiv(vars) Arguments vars a character string or vector of character strings specifying the variables of the input data across which to divide 20 convert Details Currently each unique combination of values of vars constitutes a subset. In the future, specifying shingles for numeric conditioning variables will be implemented. Value a list to be used for the "by" argument to divide Author(s) Ryan Hafen References • http://deltarho.org • Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67. See Also divide, getSplitVars, getSplitVar Examples d <- divide(iris, by = "Species") # equivalent: d <- divide(iris, by = condDiv("Species")) convert Convert ’ddo’ / ’ddf’ Objects Description Convert ’ddo’ / ’ddf’ objects between different storage backends Usage convert(from, to, overwrite = FALSE) Arguments from a ’ddo’ or ’ddf’ object to a ’kvConnection’ object (created with localDiskConn or hdfsConn) or NULL if an in-memory ’ddo’ / ’ddf’ is desired overwrite should the data in the location pointed to in to be overwritten? ddf 21 Examples d <- divide(iris, by = "Species") # convert in-memory ddf to one stored on disk dl <- convert(d, localDiskConn(tempfile(), autoYes = TRUE)) dl ddf Instantiate a Distributed Data Frame (’ddf’) Description Instantiate a distributed data frame (’ddf’) Usage ddf(conn, transFn = NULL, update = FALSE, reset = FALSE, control = NULL, verbose = TRUE) Arguments conn an object pointing to where data is or will be stored for the ’ddf’ object - can be a ’kvConnection’ object created from localDiskConn or hdfsConn, or a data frame or list of key-value pairs transFn transFn a function to be applied to the key-value pairs of this data prior to doing any processing, that transform the data into a data frame if it is not stored as such update should the attributes of this object be updated? See updateAttributes for more details. reset should all persistent metadata about this object be removed and the object created from scratch? This setting does not effect data stored in the connection location. control parameters specifying how the backend should handle things if attributes are updated (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl verbose logical - print messages about what is being done Examples # in-memory ddf d <- ddf(iris) d # local disk ddf conn <- localDiskConn(tempfile(), autoYes = TRUE) addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) 22 ddf-accessors addData(conn, list(list("3", iris[111:150,]))) dl <- ddf(conn) dl # hdfs ddf (requires RHIPE / Hadoop) ## Not run: # connect to empty HDFS directory conn <- hdfsConn("/tmp/irisSplit") # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) addData(conn, list(list("3", iris[111:150,]))) # represent it as a distributed data frame hdd <- ddf(conn) ## End(Not run) ddf-accessors Accessor methods for ’ddf’ objects Description Accessor methods for ’ddf’ objects Usage splitRowDistn(x) ## S3 method for class 'ddo' summary(object, ...) ## S3 method for class 'ddf' summary(object, ...) nrow(x) NROW(x) ncol(x) NCOL(x) ## S4 method for signature 'ddf' nrow(x) ## S4 method for signature 'ddf' NROW(x) ddo 23 ## S4 method for signature 'ddf' ncol(x) ## S4 method for signature 'ddf' NCOL(x) ## S3 method for class 'ddf' names(x) Arguments x a ’ddf’ object object a ’ddf’/’ddo’ object ... additional arguments Examples d <- divide(iris, by = "Species", update = TRUE) nrow(d) ncol(d) length(d) names(d) summary(d) getKeys(d) ddo Instantiate a Distributed Data Object (’ddo’) Description Instantiate a distributed data object (’ddo’) Usage ddo(conn, update = FALSE, reset = FALSE, control = NULL, verbose = TRUE) Arguments conn an object pointing to where data is or will be stored for the ’ddf’ object - can be a ’kvConnection’ object created from localDiskConn or hdfsConn, or a data frame or list of key-value pairs update should the attributes of this object be updated? See updateAttributes for more details. reset should all persistent metadata about this object be removed and the object created from scratch? This setting does not effect data stored in the connection location. 24 ddo-ddf-accessors control verbose parameters specifying how the backend should handle things if attributes are updated (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl logical - print messages about what is being done Examples kv <- kvPairs(kvPair(1, letters), kvPair(2, rnorm(100))) kvddo <- ddo(kv) kvddo ddo-ddf-accessors Accessor Functions Description Accessor functions for attributes of ddo/ddf objects. Methods also include nrow and ncol for ddf objects. Usage kvExample(x) bsvInfo(x) counters(x) splitSizeDistn(x) getKeys(x) hasExtractableKV(x) ## S3 method for class 'ddo' length(x) Arguments x a ’ddf’/’ddo’ object Examples d <- divide(iris, by = "Species", update = TRUE) nrow(d) ncol(d) length(d) names(d) summary(d) getKeys(d) ddo-ddf-attributes 25 ddo-ddf-attributes Managing attributes of ’ddo’ or ’ddf’ objects Description These are called internally in various datadr functions. They are not meant for use outside of there, but are exported for convenience, and can be useful for better understanding ddo/ddf objects. Usage setAttributes(obj, attrs) ## S3 method for class 'ddf' setAttributes(obj, attrs) ## S3 method for class 'ddo' setAttributes(obj, attrs) getAttribute(obj, attrName) getAttributes(obj, attrNames) ## S3 method for class 'ddf' getAttributes(obj, attrNames) ## S3 method for class 'ddo' getAttributes(obj, attrNames) hasAttributes(obj, ...) ## S3 method for class 'ddf' hasAttributes(obj, attrNames) Arguments obj ’ddo’ or ’ddf’ object attrs a named list of attributes to set attrName name of the attribute to get attrNames vector of names of the attributes to get ... additional arguments Examples d <- divide(iris, by = "Species") getAttribute(d, "keys") 26 digestFileHash digestFileHash Digest File Hash Function Description Function to be used to specify the file where key-value pairs get stored for local disk connections, useful when keys are arbitrary objects. File names are determined using a md5 hash of the object. This is the default argument for fileHashFn in localDiskConn. Usage digestFileHash(keys, conn) Arguments keys keys to be hashed conn a "localDiskConn" object Details You shouldn’t need to call this directly other than to experiment with what the output looks like or to get ideas on how to write your own custom hash. Author(s) Ryan Hafen See Also localDiskConn, charFileHash Examples # connect to empty localDisk directory path <- file.path(tempdir(), "irisSplit") unlink(path, recursive = TRUE) conn <- localDiskConn(path, autoYes = TRUE, fileHashFn = digestFileHash) # add some data addData(conn, list(list("key1", iris[1:10,]))) addData(conn, list(list("key2", iris[11:110,]))) addData(conn, list(list("key3", iris[111:150,]))) # see that files were stored by their key list.files(path) divide divide 27 Divide a Distributed Data Object Description Divide a ddo/ddf object into subsets based on different criteria Usage divide(data, by = NULL, spill = 1000000, filterFn = NULL, bsvFn = NULL, output = NULL, overwrite = FALSE, preTransFn = NULL, postTransFn = NULL, params = NULL, packages = NULL, control = NULL, update = FALSE, verbose = TRUE) Arguments data an object of class "ddf" or "ddo" - in the latter case, need to specify preTransFn to coerce each subset into a data frame by specification of how to divide the data - conditional (factor-level or shingles), random replicate, or near-exact replicate (to come) – see details spill integer telling the division method how many lines of data should be collected until spilling over into a new key-value pair filterFn a function that is applied to each candidate output key-value pair to determine whether it should be (if returns TRUE) part of the resulting division bsvFn a function to be applied to each subset that returns a list of between subset variables (BSVs) output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) preTransFn a transformation function (if desired) to applied to each subset prior to division - note: this is deprecated - instead use addTransform prior to calling divide postTransFn a transformation function (if desired) to apply to each post-division subset params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl update should a MapReduce job be run to obtain additional attributes for the result data prior to returning? verbose logical - print messages about what is being done 28 divide Details The division methods this function will support include conditioning variable division for factors (implemented – see condDiv), conditioning variable division for numerical variables through shingles, random replicate (implemented – see rrDiv), and near-exact replicate. If by is a vector of variable names, the data will be divided by these variables. Alternatively, this can be specified by e.g. condDiv(c("var1", "var2")). Value an object of class "ddf" if the resulting subsets are data frames. Otherwise, an object of class "ddo". Author(s) Ryan Hafen References • http://deltarho.org • Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67. See Also recombine, ddo, ddf, condDiv, rrDiv Examples # divide iris data by Species by passing in a data frame bySpecies <- divide(iris, by = "Species") bySpecies # divide iris data into random partitioning of ~30 rows per subset irisRR <- divide(iris, by = rrDiv(30)) irisRR # any ddf can be passed into divide: irisRR2 <- divide(bySpecies, by = rrDiv(30)) irisRR2 bySpecies2 <- divide(irisRR2, by = "Species") bySpecies2 # splitting on multiple columns byEdSex <- divide(adult, by = c("education", "sex")) byEdSex byEdSex[[1]] # splitting on a numeric variable bySL <- ddf(iris) %>% addTransform(function(x) { x$slCut <- cut(x$Sepal.Length, 10) x divide-internals 29 }) %>% divide(by = "slCut") bySL bySL[[1]] divide-internals Functions used in divide() Description Functions used in divide() Usage dfSplit(curDF, by, seed) addSplitAttrs(curSplit, bsvFn, by, postTransFn = NULL) Arguments curDF, seed arguments curSplit, bsvFn, by, postTransFn arguments Note These functions can be ignored. They are only exported to make their use in a distributed setting more convenient. drAggregate Division-Agnostic Aggregation Description Aggregates data by cross-classifying factors, with a formula interface similar to xtabs Usage drAggregate(data, formula, by = NULL, output = NULL, preTransFn = NULL, maxUnique = NULL, params = NULL, packages = NULL, control = NULL) 30 drAggregate Arguments data formula by output preTransFn maxUnique params packages control a "ddf" containing the variables in the formula formula a formula object with the cross-classifying variables (separated by +) on the right hand side (or an object which can be coerced to a formula). Interactions are not allowed. On the left hand side, one may optionally give a variable name in the data representing counts; in the latter case, the columns are interpreted as corresponding to the levels of a variable. This is useful if the data have already been tabulated. an optional variable name or vector of variable names by which to split up tabulations (i.e. tabulate independently inside of each unique "by" variable value). The only difference between specifying "by" and placing the variable(s) in the right hand side of the formula is how the computation is done and how the result is returned. "kvConnection" object indicating where the output data should reside in the case of by being specified (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. an optional function to apply to each subset prior to performing tabulation. The output from this function should be a data frame containing variables with names that match that of the formula provided. Note: this is deprecated - instead use addTransform prior to calling divide. the maximum number of unique combinations of variables to obtain tabulations for. This is meant to help against cases where a variable in the formula has a very large number of levels, to the point that it is not meaningful to tabulate and is too computationally burdonsome. If NULL, it is ignored. If a positive number, only the top and bottom maxUnique tabulations by frequency are kept. a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Value a data frame of the tabulations. When "by" is specified, it is a ddf with each key-value pair corresponding to a unique "by" value, containing a data frame of tabulations. Note The interface is similar to xtabs, but instead of returning a full contingency table, data is returned in the form of a data frame only with rows for which there were positive counts. This result is more similar to what is returned by aggregate. Author(s) Ryan Hafen drBLB 31 See Also xtabs, updateAttributes Examples drAggregate(Sepal.Length ~ Species, data = ddf(iris)) drBLB Bag of Little Bootstraps Transformation Method Description Bag of little bootstraps transformation method Usage drBLB(x, statistic, metric, R, n) Arguments x a subset of a ddf statistic a function to apply to the subset specifying the statistic to compute. Must have arguments ’data’ and ’weights’ - see details). Must return a vector, where each element is a statistic of interest. metric a function specifying the metric to be applied to the R bootstrap samples of each statistic returned by statistic. Expects an input vector and should output a vector. R the number of bootstrap samples n the total number of observations in the data Details It is necessary to specify weights as a parameter to the statistic function because for BLB to work efficiently, it must resample each time with a sample of size n. To make this computationally possible for very large n, we can use weights (see reference for details). Therefore, only methods with a weights option can legitimately be used here. Author(s) Ryan Hafen References Kleiner, Ariel, et al. "A scalable bootstrap for massive data." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.4 (2014): 795-816. 32 drFilter See Also divide, recombine Examples ## Not run: # BLB is meant to run on random replicate divisions rrAdult <- divide(adult, by = rrDiv(1000), update = TRUE) adultBlb <- rrAdult %>% addTransform(function(x) { drBLB(x, statistic = function(x, weights) coef(glm(incomebin ~ educationnum + hoursperweek + sex, data = x, weights = weights, family = binomial())), metric = function(x) quantile(x, c(0.05, 0.95)), R = 100, n = nrow(rrAdult) ) }) # compute the mean of the resulting CI limits # (this will take a little bit of time because of resampling) coefs <- recombine(adultBlb, combMean) matrix(coefs, ncol = 2, byrow = TRUE) ## End(Not run) drFilter Filter a ’ddo’ or ’ddf’ Object Description Filter a ’ddo’ or ’ddf’ object by selecting key-value pairs that satisfy a logical condition Usage drFilter(x, filterFn, output = NULL, overwrite = FALSE, params = NULL, packages = NULL, control = NULL) Arguments x an object of class ’ddo’ or ’ddf’ filterFn function that takes either a key-value pair (as two arguments) or just a value (as a single argument) and returns either TRUE or FALSE - if TRUE, that key-value pair will be present in the result. See examples for details. output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. drGetGlobals 33 overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in filterFn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Value a ’ddo’ or ’ddf’ object Author(s) Ryan Hafen See Also drJoin, drLapply Examples # Create a ddf using the iris data bySpecies <- divide(iris, by = "Species") # Filter using only the 'value' of the key/value pair drFilter(bySpecies, function(v) mean(v$Sepal.Width) < 3) # Filter using both the key and value drFilter(bySpecies, function(k,v) k != "Species=virginica" & mean(v$Sepal.Width) < 3) drGetGlobals Get Global Variables and Package Dependencies Description Get global variables and package dependencies for a function Usage drGetGlobals(f) Arguments f function 34 drGLM Details This traverses the parent environments of the supplied function and finds all global variables using findGlobals and retrieves their values. All package function calls are also found and a list of required packages is also returned. Value a list of variables (named by variable) and a vector of package names Author(s) Ryan Hafen Examples a <- 1 f <- function(x) x + a drGetGlobals(f) drGLM GLM Transformation Method Description GLM transformation method – Fit a generalized linear model to each subset Usage drGLM(...) Arguments ... arguments you would pass to the glm function Details This provides a transformation function to be called for each subset in a recombination MapReduce job that applies R’s glm method and outputs the coefficients in a way that combMeanCoef knows how to deal with. It can be applied to a ddf with addTransform prior to calling recombine. Value An object of class drCoef that contains the glm coefficients and other data needed by combMeanCoef Author(s) Ryan Hafen drHexbin 35 See Also divide, recombine, rrDiv Examples # Artificially dichotomize the Sepal.Lengths of the iris data to # demonstrate a GLM model irisD <- iris irisD$Sepal <- as.numeric(irisD$Sepal.Length > median(irisD$Sepal.Length)) # Divide the data bySpecies <- divide(irisD, by = "Species") # A function to fit a logistic regression model to each species logisticReg <- function(x) drGLM(Sepal ~ Sepal.Width + Petal.Length + Petal.Width, data = x, family = binomial()) # Apply the transform and combine using 'combMeanCoef' bySpecies %>% addTransform(logisticReg) %>% recombine(combMeanCoef) drHexbin HexBin Aggregation for Distributed Data Frames Description Create "hexbin" object of hexagonally binned data for a distributed data frame. This computation is division agnostic - it does not matter how the data frame is split up. Usage drHexbin(data, xVar, yVar, by = NULL, xTransFn = identity, yTransFn = identity, xRange = NULL, yRange = NULL, xbins = 30, shape = 1, params = NULL, packages = NULL, control = NULL) Arguments data a distributed data frame xVar, yVar names of the variables to use by an optional variable name or vector of variable names by which to group hexbin computations xTransFn, yTransFn a transformation function to apply to the x and y variables prior to binning xRange, yRange range of x and y variables (can be left blank if summaries have been computed) 36 drJoin xbins the number of bins partitioning the range of xbnds shape the shape = yheight/xwidth of the plotting regions params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Value a "hexbin" object Author(s) Ryan Hafen References Carr, D. B. et al. (1987) Scatterplot Matrix Techniques for Large N . JASA 83, 398, 424–436. See Also drQuantile Examples # create dummy data and divide it dat <- data.frame( xx = rnorm(1000), yy = rnorm(1000), by = sample(letters, 1000, replace = TRUE)) d <- divide(dat, by = "by", update = TRUE) # compute hexbins on divided object dhex <- drHexbin(d, xVar = "xx", yVar = "yy") # dhex is equivalent to running on undivided data: hexbin(dat$xx, dat$yy) drJoin Join Data Sources by Key Description Outer join of two or more distributed data object (DDO) sources by key drJoin 37 Usage drJoin(..., output = NULL, overwrite = FALSE, postTransFn = NULL, params = NULL, packages = NULL, control = NULL) Arguments output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) postTransFn an optional function to be applied to the each final key-value pair after joining params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl ... Input data sources: two or more named DDO objects that will be joined, separated by commas (see Examples for syntax). Specifically, each input object should inherit from the ’ddo’ class. It is assumed that all input sources are of same type (all HDFS, all localDisk, all in-memory). Value a ’ddo’ object stored in the output connection, where the values are named lists with names according to the names given to the input data objects, and values are the corresponding data. The ’ddo’ object contains the union of all the keys contained in the input ’ddo’ objects specified in .... Author(s) Ryan Hafen See Also drFilter, drLapply Examples bySpecies <- divide(iris, by = "Species") # get independent lists of just SW and SL sw <- drLapply(bySpecies, function(x) x$Sepal.Width) sl <- drLapply(bySpecies, function(x) x$Sepal.Length) drJoin(Sepal.Width = sw, Sepal.Length = sl, postTransFn = as.data.frame) 38 drLapply drLapply Apply a function to all key-value pairs of a ddo/ddf object Description Apply a function to all key-value pairs of a ddo/ddf object and get a new ddo object back, unless a different combine strategy is specified. Usage drLapply(X, FUN, combine = combDdo(), output = NULL, overwrite = FALSE, params = NULL, packages = NULL, control = NULL, verbose = TRUE) Arguments X an object of class "ddo" of "ddf" FUN a function to be applied to each subset combine optional method to combine the results output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl verbose logical - print messages about what is being done Value depends on combine Author(s) Ryan Hafen See Also recombine, drFilter, drJoin, combDdo, combRbind drLM 39 Examples bySpecies <- divide(iris, by = "Species") drLapply(bySpecies, function(x) x$Sepal.Width) drLM LM Transformation Method Description LM transformation method – – Fit a linear model to each subset Usage drLM(...) Arguments ... arguments you would pass to the lm function Details This provides a transformation function to be called for each subset in a recombination MapReduce job that applies R’s lm method and outputs the coefficients in a way that combMeanCoef knows how to deal with. It can be applied to a ddf with addTransform prior to calling recombine. Value An object of class drCoef that contains the lm coefficients and other data needed by combMeanCoef Author(s) Landon Sego See Also divide, recombine, rrDiv Examples # Divide the data bySpecies <- divide(iris, by = "Species") # A function to fit a multiple linear regression model to each species linearReg <- function(x) drLM(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = x) # Apply the transform and combine using 'combMeanCoef' bySpecies %>% 40 drPersist addTransform(linearReg) %>% recombine(combMeanCoef) drPersist Persist a Transformed ’ddo’ or ’ddf’ Object Description Persist a transformed ’ddo’ or ’ddf’ object by making a deferred transformation permanent Usage drPersist(x, output = NULL, overwrite = FALSE, control = NULL) Arguments x an object of class ’ddo’ or ’ddf’ output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Details When a transformation is added to a ddf/ddo via addTransform, the transformation is deferred until the some action is taken with the data (e.g. a call to recombine). See the documentation of addTransform for more information about the nature of transformations. Calling drPersist() on the ddo/ddf makes the transformation permanent (persisted). In the case of a local disk connection (via localDiskConn) or HDFS connection (via hdfsConn), the transformed data are written to disk. Value a ’ddo’ or ’ddf’ object with the transformation evaluated on the data Author(s) Ryan Hafen See Also addTransform drQuantile 41 Examples bySpecies <- divide(iris, by = "Species") # Create the transformation and add it to bySpecies bySpeciesSepal <- addTransform(bySpecies, function(x) x[,c("Sepal.Length", "Sepal.Width")]) # Note the transformation is 'pending' a data action bySpeciesSepal # Make the tranformation permanent (persistent) bySpeciesSepalPersisted <- drPersist(bySpeciesSepal) # The transformation no longer pending--but a permanent part of the new ddo bySpeciesSepalPersisted bySpeciesSepalPersisted[[1]] drQuantile Sample Quantiles for ’ddf’ Objects Description Compute sample quantiles for ’ddf’ objects Usage drQuantile(x, var, by = NULL, probs = seq(0, 1, 0.005), preTransFn = NULL, varTransFn = identity, varRange = NULL, nBins = 10000, tails = 100, params = NULL, packages = NULL, control = NULL, ...) Arguments x a ’ddf’ object var the name of the variable to compute quantiles for by an optional variable name or vector of variable names by which to group quantile computations probs numeric vector of probabilities with values in [0-1] preTransFn a transformation function (if desired) to applied to each subset prior to computing quantiles (here it may be useful for adding a "by" variable that is not present) - note: this transformation should not modify var (use varTransFn for that) also note: this is deprecated - instead use addTransform prior to calling divide varTransFn transformation to apply to variable prior to computing quantiles varRange range of x (can be left blank if summaries have been computed) nBins how many bins should the range of the variable be split into? tails how many exact values at each tail should be retained? 42 drQuantile params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl ... additional arguments Details This division-agnostic quantile calculation algorithm takes the range of the variable of interest and splits it into nBins bins, tabulates counts for those bins, and reconstructs a quantile approximation from them. nBins should not get too large, but larger nBins gives more accuracy. If tails is positive, the first and last tails ordered values are attached to the quantile estimate - this is useful for long-tailed distributions or distributions with outliers for which you would like more detail in the tails. Value data frame of quantiles q and their associated f-value fval. If by is specified, then also a variable group. Author(s) Ryan Hafen See Also updateAttributes Examples # break the iris data into k/v pairs irisSplit <- list( list("1", iris[1:10,]), list("2", iris[11:110,]), list("3", iris[111:150,]) ) # represent it as ddf irisSplit <- ddf(irisSplit, update = TRUE) # approximate quantiles over the divided data set probs <- seq(0, 1, 0.005) iq <- drQuantile(irisSplit, var = "Sepal.Length", tails = 0, probs = probs) plot(iq$fval, iq$q) # compare to the all-data quantile "type 1" result plot(probs, quantile(iris$Sepal.Length, probs = probs, type = 1)) drRead.table drRead.table 43 Data Input Description Reads a text file in table format and creates a distributed data frame from it, with cases corresponding to lines and variables to fields in the file. Usage ## S3 method for class 'table' drRead(file, header = FALSE, sep = "", quote = "\"'", dec = ".", skip = 0, fill = !blank.lines.skip, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, encoding = "unknown", autoColClasses = TRUE, rowsPerBlock = 50000, postTransFn = identity, output = NULL, overwrite = FALSE, params = NULL, packages = NULL, control = NULL, ...) ## S3 method for class 'csv' drRead(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) ## S3 method for class 'csv2' drRead(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...) ## S3 method for class 'delim' drRead(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) ## S3 method for class 'delim2' drRead(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...) Arguments file input text file - can either be character string pointing to a file on local disk, or an hdfsConn object pointing to a text file on HDFS (see output argument below) header this and parameters other parameters below are passed to read.table for each chunk being processed - see read.table for more info. Most all have defaults or appropriate defaults are set through other format-specific functions such as drRead.csv and drRead.delim. sep see read.table for more info quote see read.table for more info dec see read.table for more info skip see read.table for more info fill see read.table for more info blank.lines.skip see read.table for more info comment.char see read.table for more info 44 drRead.table allowEscapes see read.table for more info encoding see read.table for more info autoColClasses should column classes be determined automatically by reading in a sample? This can sometimes be problematic because of strange ways R handles quotes in read.table, but keeping the default of TRUE is advantageous for speed. rowsPerBlock how many rows of the input file should make up a block (key-value pair) of output? postTransFn a function to be applied after a block is read in to provide any additional processingn before the block is stored output a "kvConnection" object indicating where the output data should reside. Must be a localDiskConn object if input is a text file on local disk, or a hdfsConn object if input is a text file on HDFS. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) params a named list of objects external to the input data that are needed in postTransFn packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl ... see read.table for more info Value an object of class "ddf" Note For local disk, the file is actually read in sequentially instead of in parallel. This is because of possible performance issues when trying to read from the same disk in parallel. Note that if skip is positive and/or if header is TRUE, it will first read these in as they only occur once in the data, and we then check for these lines in each block and remove those lines if they appear. Also note that if you supply "Factor" column classes, they will be converted to character. Author(s) Ryan Hafen Examples ## Not run: csvFile <- file.path(tempdir(), "iris.csv") write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE) irisTextConn <- localDiskConn(file.path(tempdir(), "irisText2"), autoYes = TRUE) a <- drRead.csv(csvFile, output = irisTextConn, rowsPerBlock = 10) ## End(Not run) drSample drSample 45 Take a Sample of Key-Value Pairs Take a sample of key-value Pairs Description Take a Sample of Key-Value Pairs Take a sample of key-value Pairs Usage drSample(x, fraction, output = NULL, overwrite = FALSE, control = NULL) Arguments x a ’ddo’ or ’ddf’ object fraction fraction of key-value pairs to keep (between 0 and 1) output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Examples bySpecies <- divide(iris, by = "Species") set.seed(234) sampleRes <- drSample(bySpecies, fraction = 0.25) drSubset Subsetting Distributed Data Frames Description Return a subset of a "ddf" object to memory Usage drSubset(data, subset = NULL, select = NULL, drop = FALSE, preTransFn = NULL, maxRows = 500000, params = NULL, packages = NULL, control = NULL, verbose = TRUE) 46 flatten Arguments data object to be subsetted – an object of class "ddf" or "ddo" - in the latter case, need to specify preTransFn to coerce each subset into a data frame subset logical expression indicating elements or rows to keep: missing values are taken as false select expression, indicating columns to select from a data frame drop passed on to [ indexing operator preTransFn a transformation function (if desired) to applied to each subset prior to division - note: this is deprecated - instead use addTransform prior to calling divide maxRows the maximum number of rows to return params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl verbose logical - print messages about what is being done Value data frame Author(s) Ryan Hafen Examples d <- divide(iris, by = "Species") drSubset(d, Sepal.Length < 5) flatten "Flatten" a ddf Subset Description Add split variables and BSVs (if any) as columns to a subset of a ddf. Usage flatten(x) getCondCuts 47 Arguments x a value of a key-value pair See Also getSplitVars, getBsvs Examples d <- divide(iris, by = "Species") # the column "Species" is no longer explicitly in the data d[[1]]$value # but it is preserved and can be added back in with flatten() flatten(d[[1]]$value) getCondCuts Get names of the conditioning variable cuts Description This is used internally for conditioning variable division. It does not have much use outside of there, but is exported for convenience. Usage getCondCuts(df, splitVars) Arguments df a data frame splitVars a vector of variable names to split by Examples # see how key names are obtained getCondCuts(iris, "Species") 48 hdfsConn getSplitVar Extract "Split" Variable(s) Description For a given key-value pair or value, get a split variable value by name, if present (split variables are variables that define how the data was divided). Usage getSplitVar(x, name) getSplitVars(x) Arguments x a key-value pair or a value name the name of the split variable to get Examples d <- divide(iris, by = "Species", bsvFn = function(x) list(msl = bsv(mean(x$Sepal.Length)))) getSplitVars(d[[1]]$value) getSplitVar(d[[1]]$value, "Species") hdfsConn Connect to Data Source on HDFS Description Connect to a data source on HDFS Usage hdfsConn(loc, type = "sequence", autoYes = FALSE, reset = FALSE, verbose = TRUE) Arguments loc location on HDFS for the data source type the type of data ("map", "sequence", "text") autoYes automatically answer "yes" to questions about creating a path on HDFS reset should existing metadata for this object be overwritten? verbose logical - print messages about what is being done kvApply 49 Details This simply creates a "connection" to a directory on HDFS (which need not have data in it). To actually do things with this data, see ddo, etc. Value a "kvConnection" object of class "hdfsConn" Author(s) Ryan Hafen See Also addData, ddo, ddf, localDiskConn Examples ## Not run: # connect to empty HDFS directory conn <- hdfsConn("/test/irisSplit") # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) addData(conn, list(list("3", iris[111:150,]))) # represent it as a distributed data frame hdd <- ddf(conn) ## End(Not run) kvApply Apply Function to Key-Value Pair Description Apply a function to a single key-value pair - not a traditional R "apply" function. Usage kvApply(kvPair, fn) Arguments kvPair a key-value pair (a list with 2 elements or object created with kvPair) fn a function 50 kvPair Details Determines how a function should be applied to a key-value pair and then applies it: if the function has two formals, it applies the function giving it the key and the value as the arguments; if the function has one formal, it applies the function giving it just the value. The function is assumed to return a value unless the result is a kvPair object. When the function returns a value the original key will be returned in the resulting key-value pair. This provides flexibility and simplicity for when a function is only meant to be applied to the value (the most common case), but still allows keys to be used if desired. Examples kv <- kvPair(1, 2) kv kvApply(kv, function(x) kvApply(kv, function(k, kvApply(kv, function(k, kvApply(kv, function(x) kvPair x^2) v) v^2) v) k + v) kvPair("new_key", x)) Specify a Key-Value Pair Description Specify a key-value pair Usage kvPair(k, v) Arguments k key - any R object v value - any R object Value a list of objects of class "kvPair" See Also kvPairs Examples kvPair("name", "bob") kvPairs kvPairs 51 Specify a Collection of Key-Value Pairs Description Specify a collection of key-value pairs Usage kvPairs(...) Arguments ... key-value pairs (lists with two elements) Value a list of objects of class "kvPair" See Also kvPair Examples kvPairs(kvPair(1, letters), kvPair(2, rnorm(10))) localDiskConn Connect to Data Source on Local Disk Description Connect to a data source on local disk Usage localDiskConn(loc, nBins = 0, fileHashFn = NULL, autoYes = FALSE, reset = FALSE, verbose = TRUE) 52 localDiskConn Arguments loc location on local disk for the data source nBins number of bins (subdirectories) to put data files into - if anticipating a large number of k/v pairs, it is a good idea to set this to something bigger than 0 fileHashFn an optional function that operates on each key-value pair to determine the subdirectory structure for where the data should be stored for that subset, or can be specified "asis" when keys are scalar strings autoYes automatically answer "yes" to questions about creating a path on local disk reset should existing metadata for this object be overwritten? verbose logical - print messages about what is being done Details This simply creates a "connection" to a directory on local disk (which need not have data in it). To actually do things with this connection, see ddo, etc. Typically, you should just use loc to specify where the data is or where you would like data for this connection to be stored. Metadata for the object is also stored in this directory. Value a "kvConnection" object of class "localDiskConn" Author(s) Ryan Hafen See Also addData, ddo, ddf, localDiskConn Examples # connect to empty localDisk directory conn <- localDiskConn(file.path(tempdir(), "irisSplit"), autoYes = TRUE) # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) addData(conn, list(list("3", iris[111:150,]))) # represent it as a distributed data frame irisDdf <- ddf(conn, update = TRUE) irisDdf localDiskControl localDiskControl 53 Specify Control Parameters for MapReduce on a Local Disk Connection Description Specify control parameters for a MapReduce on a local disk connection. Currently the parameters include: Usage localDiskControl(cluster = NULL, map_buff_size_bytes = 10485760, reduce_buff_size_bytes = 10485760, map_temp_buff_size_bytes = 10485760) Arguments cluster a "cluster" object obtained from makeCluster to allow for parallel processing map_buff_size_bytes determines how much data should be sent to each map task reduce_buff_size_bytes determines how much data should be sent to each reduce task map_temp_buff_size_bytes determines the size of chunks written to disk in between the map and reduce Note If you have data on a shared drive that multiple nodes can access or a high performance shared file system like Lustre, you can run a local disk MapReduce job on multiple nodes by creating a multi-node cluster with makeCluster. If you are using multiple cores and the input data is very small, map_buff_size_bytes needs to be small so that the key-value pairs will be split across cores. Examples # create a 2-node cluster that can be used to process in parallel cl <- parallel::makeCluster(2) # create a local disk control object that specifies to use this cluster # these operations run in parallel control <- localDiskControl(cluster = cl) # note that setting options(defaultLocalDiskControl = control) # will cause this to be used by default in all local disk operations # convert in-memory ddf to local-disk ddf ldPath <- file.path(tempdir(), "by_species") ldConn <- localDiskConn(ldPath, autoYes = TRUE) bySpeciesLD <- convert(divide(iris, by = "Species"), ldConn) # update attributes using parallel cluster 54 makeExtractable updateAttributes(bySpeciesLD, control = control) # remove temporary directories unlink(ldPath, recursive = TRUE) # shut down the cluster parallel::stopCluster(cl) makeExtractable Take a ddo/ddf HDFS data object and turn it into a mapfile Description Take a ddo/ddf HDFS data object and turn it into a mapfile Usage makeExtractable(obj, control = NULL) Arguments obj object of class ’ddo’ or ’ddf’ with an HDFS connection control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl Examples ## Not run: conn <- hdfsConn("/test/irisSplit") # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:110,]))) addData(conn, list(list("3", iris[111:150,]))) # represent it as a distributed data frame hdd <- ddf(conn) # try to extract values by key (this will result in an error) # (HDFS can only lookup key-value pairs by key if data is in a mapfile) hdd[["3"]] # convert hdd into a mapfile hdd <- makeExtractable(hdd) # try again hdd[["3"]] ## End(Not run) mr-summary-stats 55 mr-summary-stats Functions to Compute Summary Statistics in MapReduce Description Functions that are used to tabulate categorical variables and compute moments for numeric variables inside through the MapReduce framework. Used in updateAttributes. Usage tabulateMap(formula, data) tabulateReduce(result, reduce.values, maxUnique = NULL) calculateMoments(y, order = 1, na.rm = TRUE) combineMoments(m1, m2) combineMultipleMoments(...) moments2statistics(m) Arguments formula a formula to be used in xtabs data a subset of a ’ddf’ object result, reduce.values inconsequential tabulateReduce parameters maxUnique y, order, na.rm the maximum number of unique combinations of variables to obtaion tabulations for. This is meant to help against cases where a variable in the formula has a very large number of levels, to the point that it is not meaningful to tabulate and is too computationally burdonsome. If NULL, it is ignored. If a positive number, only the top and bottom maxUnique tabulations by frequency are kept. inconsequential calculateMoments parameters m1, m2 inconsequential combineMoments parameters m inconsequential moments2statistics parameters ... inconsequential parameters Examples d <- divide(iris, by = "Species", update = TRUE) summary(d) 56 mrExec mrExec Execute a MapReduce Job Description Execute a MapReduce job Usage mrExec(data, setup = NULL, map = NULL, reduce = NULL, output = NULL, overwrite = FALSE, control = NULL, params = NULL, packages = NULL, verbose = TRUE) Arguments data a ddo/ddf object, or list of ddo/ddf objects setup an expression of R code (created using the R command expression) to be run before map and reduce map an R expression that is evaluated during the map stage. For each task, this expression is executed multiple times (see details). reduce a vector of R expressions with names pre, reduce, and post that is evaluated during the reduce stage. For example reduce = expression(pre = {...}, reduce = {...}, post = {.. reduce is optional, and if not specified the map output key-value pairs will be the result. If it is not specified, then a default identity reduce is performed. Setting it to 0 will skip the reduce altogether. output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object. If a character string, it will be treated as a path to be passed to the same type of connection as data - relative paths will be relative to the working directory of that back end. overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl params a named list of objects external to the input data that are needed in the map or reduce phases packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) verbose logical - print messages about what is being done Value "ddo" object - to keep it simple. It is up to the user to update or cast as "ddf" if that is the desired result. print.ddo 57 Author(s) Ryan Hafen Examples # compute min and max Sepal Length by species for iris data # using a random partitioning of it as input d <- divide(iris, by = rrDiv(20)) mapExp <- expression({ lapply(map.values, function(r) { by(r, r$Species, function(x) { collect( as.character(x$Species[1]), range(x$Sepal.Length, na.rm = TRUE) ) }) }) }) reduceExp <- expression( pre = { rng <- c(Inf, -Inf) }, reduce = { rx <- unlist(reduce.values) rng <- c(min(rng[1], rx, na.rm = TRUE), max(rng[2], rx, na.rm = TRUE)) }, post = { collect(reduce.key, rng) }) res <- mrExec(d, map = mapExp, reduce = reduceExp) as.list(res) print.ddo Print a "ddo" or "ddf" Object Description Print an overview of attributes of distributed data objects (ddo) or distributed data frames (ddf) Usage ## S3 method for class 'ddo' print(x, ...) Arguments x ... object to be printed additional arguments 58 print.kvValue Author(s) Ryan Hafen Examples kv <- kvPairs(kvPair(1, letters), kvPair(2, rnorm(100))) kvddo <- ddo(kv) kvddo print.kvPair Print a key-value pair Description Print a key-value pair Usage ## S3 method for class 'kvPair' print(x, ...) Arguments x object to be printed ... additional arguments Examples kvPair(1, letters) print.kvValue Print value of a key-value pair Description Print value of a key-value pair Usage ## S3 method for class 'kvValue' print(x, ...) Arguments x object to be printed ... additional arguments readHDFStextFile 59 Examples kvPair(1, letters) readHDFStextFile Experimental HDFS text reader helper function Description Experimental helper function for reading text data on HDFS into a HDFS connection Usage readHDFStextFile(input, output = NULL, overwrite = FALSE, fn = NULL, keyFn = NULL, linesPerBlock = 10000, control = NULL, update = FALSE) Arguments input a ddo / ddf connection to a text input directory on HDFS, created with hdfsConn - ensure the text files are within a directory and that type = "text" is specified output an output connection such as those created with localDiskConn, and hdfsConn overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) fn function to be applied to each chunk of lines (input to function is a vector of strings) keyFn optional function to determine the value of the key for each block linesPerBlock how many lines at a time to read control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl update should a MapReduce job be run to obtain additional attributes for the result data prior to returning? Examples ## Not run: res <- readHDFStextFile( input = Rhipe::rhfmt("/path/to/input/text", type = "text"), output = hdfsConn("/path/to/output"), fn = function(x) { read.csv(textConnection(paste(x, collapse = "\n")), header = FALSE) } ) ## End(Not run) 60 readTextFileByChunk readTextFileByChunk Experimental sequential text reader helper function Description Experimental helper function for reading text data sequentially from a file on disk and adding to connection using addData Usage readTextFileByChunk(input, output, overwrite = FALSE, linesPerBlock = 10000, fn = NULL, header = TRUE, skip = 0, recordEndRegex = NULL, cl = NULL) Arguments input output overwrite the path to an input text file an output connection such as those created with localDiskConn, and hdfsConn logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) linesPerBlock how many lines at a time to read fn function to be applied to each chunk of lines (see details) header does the file have a header skip number of lines to skip before reading recordEndRegex an optional regular expression that finds lines in the text file that indicate the end of a record (for multi-line records) cl a "cluster" object to be used for parallel processing, created using makeCluster Details The function fn should have one argument, which should expect to receive a vector of strings, each element of which is a line in the file. It is also possible for fn to take two arguments, in which case the second argument is the header line from the file (some parsing methods might need to know the header). Examples csvFile <- file.path(tempdir(), "iris.csv") write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE) myoutput <- localDiskConn(file.path(tempdir(), "irisText"), autoYes = TRUE) a <- readTextFileByChunk(csvFile, output = myoutput, linesPerBlock = 10, fn = function(x, header) { colNames <- strsplit(header, ",")[[1]] read.csv(textConnection(paste(x, collapse = "\n")), col.names = colNames, header = FALSE) }) a[[1]] recombine recombine 61 Recombine Description Apply an analytic recombination method to a ddo/ddf object and combine the results Usage recombine(data, combine = NULL, apply = NULL, output = NULL, overwrite = FALSE, params = NULL, packages = NULL, control = NULL, verbose = TRUE) Arguments data an object of class "ddo" of "ddf" combine the method to combine the results. See, for example, combCollect, combDdf, combDdo, combRbind, etc. If combine = NULL, combCollect will be used if output = NULL and combDdo is used if output is specified. apply a function specifying the analytic method to apply to each subset, or a predefined apply function (see drBLB, drGLM, for example). NOTE: This argument is now deprecated in favor of addTransform output a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object overwrite logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak) params a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify) packages a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify) control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl verbose logical - print messages about what is being done Value Depends on combine: this could be a distributed data object, a data frame, a key-value list, etc. See examples. Author(s) Ryan Hafen 62 recombine References • http://deltarho.org • Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67. See Also divide, ddo, ddf, drGLM, drBLB, combMeanCoef, combMean, combCollect, combRbind, drLapply Examples ## in-memory example ##--------------------------------------------------------# begin with an in-memory ddf (backed by kvMemory) bySpecies <- divide(iris, by = "Species") # create a function to calculate the mean for each variable colMean <- function(x) data.frame(lapply(x, mean)) # apply the transformation bySpeciesTransformed <- addTransform(bySpecies, colMean) # recombination with no 'combine' argument and no argument to output # produces the key-value list produced by 'combCollect()' recombine(bySpeciesTransformed) # but we can also preserve the distributed data frame, like this: recombine(bySpeciesTransformed, combine = combDdf) # or we can recombine using 'combRbind()' and produce a data frame: recombine(bySpeciesTransformed, combine = combRbind) ## local disk connection example with parallelization ##--------------------------------------------------------# create a 2-node cluster that can be used to process in parallel cl <- parallel::makeCluster(2) # create the control object we'll pass into local disk datadr operations control <- localDiskControl(cluster = cl) # note that setting options(defaultLocalDiskControl = control) # will cause this to be used by default in all local disk operations # create local disk connection to hold bySpecies data ldPath <- file.path(tempdir(), "by_species") ldConn <- localDiskConn(ldPath, autoYes = TRUE) # convert in-memory bySpecies to local-disk ddf bySpeciesLD <- convert(bySpecies, ldConn) removeData 63 # apply the transformation bySpeciesTransformed <- addTransform(bySpeciesLD, colMean) # recombine the data using the transformation bySpeciesMean <- recombine(bySpeciesTransformed, combine = combRbind, control = control) bySpeciesMean # remove temporary directories unlink(ldPath, recursive = TRUE) # shut down the cluster parallel::stopCluster(cl) removeData Remove Key-Value Pairs from a Data Connection Description Remove key-value pairs from a data connection Usage removeData(conn, keys) Arguments conn a kvConnection object keys a list of keys indicating which k/v pairs to remove Note This is generally not recommended for HDFS as it writes a new file each time it is called, and can result in more individual files than Hadoop likes to deal with. Author(s) Ryan Hafen See Also removeData, localDiskConn, hdfsConn 64 rhipeControl Examples # connect to empty localDisk directory conn <- localDiskConn(file.path(tempdir(), "irisSplit"), autoYes = TRUE) # add some data addData(conn, list(list("1", iris[1:10,]))) addData(conn, list(list("2", iris[11:90,]))) addData(conn, list(list("3", iris[91:110,]))) addData(conn, list(list("4", iris[111:150,]))) # represent it as a distributed data frame irisDdf <- ddf(conn, update = TRUE) irisDdf # remove data for keys "1" and "2" removeData(conn, list("1", "2")) # look at result with updated attributes (reset = TRUE removes previous attrs) irisDdf <- ddf(conn, reset = TRUE, update = TRUE) irisDdf rhipeControl Specify Control Parameters for RHIPE Job Description Specify control parameters for a RHIPE job. See rhwatch for details about each of the parameters. Usage rhipeControl(mapred = NULL, setup = NULL, combiner = FALSE, cleanup = NULL, orderby = "bytes", shared = NULL, jarfiles = NULL, zips = NULL, jobname = "") Arguments mapred, setup, combiner, cleanup, orderby, shared, jarfiles, zips, jobname arguments to rhwatch in RHIPE Examples ## Not run: # input data on HDFS d <- ddf(hdfsConn("/path/to/big/data/on/hdfs")) # set RHIPE / Hadoop parameters # buffer sizes control how many k/v pairs are sent to map / reduce tasks at a time # mapred.reduce.tasks is a Hadoop config parameter that controls # of reduce tasks rhctl <- rhipeControl(mapred = list( rhipe_map_buff_size = 10000, mapred.reduce.tasks = 72, rhipe_reduce_buff_size = 1) rrDiv 65 # divide input data using these control parameters divide(d, by = "var", output = hdfsConn("/path/to/output"), control = rhctl) ## End(Not run) rrDiv Random Replicate Division Description Specify random replicate division parameters for data division Usage rrDiv(nrows = NULL, seed = NULL) Arguments nrows number of rows each subset should have seed the random seed to use (experimental) Details The random replicate division method currently gets the total number of rows of the input data and divides it by nrows to get the number of subsets. Then it randomly assigns each row of the input data to one of the subsets, resulting in subsets with approximately nrows rows. A future implementation will make each subset have exactly nrows rows. Value a list to be used for the "by" argument to divide Author(s) Ryan Hafen References • http://deltarho.org • Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1(1), 53-67. See Also divide, recombine, condDiv 66 setupTransformEnv Examples # divide iris data into random subsets with ~20 records per subset irisRR <- divide(iris, by = rrDiv(20), update = TRUE) irisRR # look at the actual distribution of number of rows per subset plot(splitRowDistn(irisRR)) setupTransformEnv Set up transformation environment Description This is called internally in the map phase of datadr MapReduce jobs. It is not meant for use outside of there, but is exported for convenience. Given an environment and collection of transformations, it populates the environment with the global variables in the transformations. Usage setupTransformEnv(transFns, env = NULL) Arguments transFns from the "transforms" attribute of a ddo object env the environment in which to evaluate the transformations Examples # Create a distributed data frame using the iris data set, backed by the # kvMemory (in memory) connection bySpecies <- divide(iris, by = "Species") bySpecies # Note a tranformation is not present in the attributes names(attributes(bySpecies)) ## A transform that operates only on values of the key-value pairs ##---------------------------------------------------------------# Create a function that will calculate the mean of each variable in # in a subset. The calls to 'as.data.frame()' and 't()' convert the # vector output of 'apply()' into a data.frame with a single row colMean <- function(x) as.data.frame(t(apply(x, 2, mean))) # Test on a subset colMean(bySpecies[[1]][[2]]) # Add a tranformation that will calculate the mean of each variable bySpeciesTransformed <- addTransform(bySpecies, colMean) # Note how 'before transformation' appears to describe the values of # several of the attributes bySpeciesTransformed # Note the addition of the transformation to the attributes names(attributes(bySpeciesTransformed)) # We can see the result of the transformation by looking at one of to_ddf 67 # the subsets: bySpeciesTransformed[[1]] # The transformation is automatically applied when calling any data # operation. For example, if can call 'recombine()' with 'combRbind' # we will get a data frame of the column means for each subset: varMeans <- recombine(bySpeciesTransformed, combine = combRbind) varMeans ## A transform that operates on both keys and values ##--------------------------------------------------------# We can also create a transformation that uses both the keys and values # It will select the first row of the value, and append '-firstRow' to # the key aTransform <- function(key, val) { newKey <- paste(key, "firstRow", sep = "-") newVal <- val[1,] kvPair(newKey, newVal) } # Apply the transformation recombine(addTransform(bySpecies, aTransform)) to_ddf Convert dplyr grouped_df to ddf Description Convert dplyr grouped_df to ddf Usage to_ddf(x) Arguments x a grouped_df object from dplyr Examples ## Not run: library(dplyr) bySpecies <- iris %>% group_by(Species) %>% to_ddf() ## End(Not run) 68 updateAttributes updateAttributes Update Attributes of a ’ddo’ or ’ddf’ Object Description Update attributes of a ’ddo’ or ’ddf’ object Usage updateAttributes(obj, control = NULL) Arguments obj an object of class ’ddo’ or ’ddf’ control parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl Details This function looks for missing attributes related to a ddo or ddf (distributed data object or data frame) object and runs MapReduce to update them. These attributes include "splitSizeDistn", "keys", "nDiv", "nRow", and "splitRowDistn". These attributes are useful for subsequent computations that might rely on them. The result is the input modified to reflect the updated attributes, and thus it should be used as obj <- updateAttributes(obj). Value an object of class ’ddo’ or ’ddf’ Author(s) Ryan Hafen References Bennett, Janine, et al. "Numerically stable, single-pass, parallel statistics algorithms.’ Cluster Computing and Workshops", 2009. CLUSTER09. IEEE International Conference on. IEEE, 2009 See Also ddo, ddf, divide %>% 69 Examples d <- divide(iris, by = "Species") # some attributes are missing: d summary(d) d <- updateAttributes(d) # now all attributes are available: d summary(d) %>% Pipe data Description Pipe data from one datadr operation to another Usage lhs %>% rhs Arguments lhs a data object rhs a function to apply to the data Note Pipes are great if you know the exact sequence of operations you would like to perform and fully understand what the intermediate data structures will look like. But often with very large or complex data sets it can be a good idea to do each step independently and examine the intermediate results. This can be better for debugging, particularly when each step takes some time to evaluate. Examples # Suppose we wish to do the following: bySpecies <- divide(iris, by = "Species") bySpeciesTransformed <- addTransform(bySpecies, function(x) mean(x$Sepal.Length)) recombine(bySpeciesTransformed, combine = combRbind) # We can do it more consely using the pipe: '%>%' divide(iris, by = "Species") %>% addTransform(function(x) mean(x$Sepal.Length)) %>% recombine(combRbind) Index ddo-ddf-attributes, 25 dfSplit (divide-internals), 29 digestFileHash, 12, 26 divide, 5, 11, 13, 15–17, 19, 20, 27, 32, 35, 39, 62, 65, 68 divide-internals, 29 drAggregate, 29 drBLB, 31, 61, 62 drFilter, 5, 32, 37, 38 drGetGlobals, 33 drGLM, 17, 34, 61, 62 drHexbin, 35 drJoin, 5, 33, 36, 38 drLapply, 5, 33, 37, 38, 62 drLM, 17, 39 drPersist, 40 drQuantile, 36, 41 drRead.csv (drRead.table), 43 drRead.csv2 (drRead.table), 43 drRead.delim (drRead.table), 43 drRead.delim2 (drRead.table), 43 drRead.table, 43 drSample, 5, 45 drSubset, 45 ∗Topic data adult, 6 ∗Topic package datadr-package, 3 %>%, 69 addData, 3, 49, 52, 60 addSplitAttrs (divide-internals), 29 addTransform, 4, 27, 30, 34, 39–41, 46, 61 adult, 6 aggregate, 30 applyTransform, 8 as.data.frame.ddf, 9 as.list.ddo, 10 bsv, 10 bsvInfo, 11 bsvInfo (ddo-ddf-accessors), 24 calculateMoments (mr-summary-stats), 55 charFileHash, 11, 26 combCollect, 12, 13, 15–17, 19, 61, 62 combDdf, 13, 13, 16, 17, 19, 61 combDdo, 13, 14, 16, 17, 19, 38, 61 combineMoments (mr-summary-stats), 55 combineMultipleMoments (mr-summary-stats), 55 combMean, 13, 15, 16, 17, 19, 62 combMeanCoef, 13, 15, 16, 17, 19, 34, 39, 62 combRbind, 13, 15–17, 18, 38, 61, 62 condDiv, 19, 28, 65 convert, 20 counters (ddo-ddf-accessors), 24 findGlobals, 34 flatten, 46 formula, 30 getAttribute (ddo-ddf-attributes), 25 getAttributes (ddo-ddf-attributes), 25 getBsv (bsv), 10 getBsvs, 11, 47 getBsvs (bsv), 10 getCondCuts, 47 getKeys (ddo-ddf-accessors), 24 getSplitVar, 20, 48 getSplitVars, 20, 47 getSplitVars (getSplitVar), 48 glm, 34 datadr (datadr-package), 3 datadr-package, 3 ddf, 21, 28, 49, 52, 62, 68 ddf-accessors, 22 ddo, 23, 28, 49, 52, 62, 68 ddo-ddf-accessors, 24 70 INDEX hasAttributes (ddo-ddf-attributes), 25 hasExtractableKV (ddo-ddf-accessors), 24 hdfsConn, 4, 20, 21, 23, 27, 30, 32, 37, 38, 40, 43–45, 48, 56, 59–61, 63 kvApply, 49 kvExample (ddo-ddf-accessors), 24 kvPair, 5, 49, 50, 50, 51 kvPairs, 50, 51 length.ddo (ddo-ddf-accessors), 24 lm, 39 localDiskConn, 4, 11, 12, 20, 21, 23, 26, 27, 30, 32, 37, 38, 40, 44, 45, 49, 51, 52, 56, 59–61, 63 localDiskControl, 21, 24, 27, 30, 33, 36–38, 40, 42, 44–46, 53, 54, 56, 59, 61 makeCluster, 53 makeExtractable, 54 moments2statistics (mr-summary-stats), 55 mr-summary-stats, 55 mrExec, 56 names.ddf (ddf-accessors), 22 NCOL (ddf-accessors), 22 ncol (ddf-accessors), 22 NCOL,ddf-method (ddf-accessors), 22 ncol,ddf-method (ddf-accessors), 22 NROW (ddf-accessors), 22 nrow (ddf-accessors), 22 NROW,ddf-method (ddf-accessors), 22 nrow,ddf-method (ddf-accessors), 22 print.ddo, 57 print.kvPair, 58 print.kvValue, 58 rbind, 13 read.table, 43, 44 readHDFStextFile, 59 readTextFileByChunk, 60 recombine, 5, 12, 13, 15–17, 19, 28, 32, 34, 35, 38–40, 61, 65 removeData, 4, 63, 63 rhipeControl, 21, 24, 27, 30, 33, 36–38, 40, 42, 44–46, 54, 56, 59, 61, 64, 68 rrDiv, 17, 28, 35, 39, 65 71 setAttributes (ddo-ddf-attributes), 25 setupTransformEnv, 8, 66 splitRowDistn (ddf-accessors), 22 splitSizeDistn (ddo-ddf-accessors), 24 summary.ddf (ddf-accessors), 22 summary.ddo (ddf-accessors), 22 tabulateMap (mr-summary-stats), 55 tabulateReduce (mr-summary-stats), 55 to_ddf, 67 updateAttributes, 21, 23, 31, 42, 55, 68 xtabs, 30, 31, 55
© Copyright 2026 Paperzz