GRTS for the Average Joe: A GRTS Sampler for Windows 1

GRTS for the Average Joe: A GRTS Sampler for Windows
Trent McDonald,
WEST, Inc. 2003 Central Avenue, Cheyenne, WY 82001 (307) 634-1756;
e-mail: [email protected]
Abstract: Generalized random tessellation stratified (GRTS) samples are useful spatial sampling
designs for a number of reasons. But, actually drawing a GRTS sample can be so complicated
that some practitioners opt for a simpler design. In this paper, I describe a computer program
designed to draw GRTS samples of discrete sample units that are located in either 1-dimension
or 2-dimension. This program, S-Draw, is a Fortran-based application that will run on any
computer running a Windows operating system. The GRTS sampling program reads a sample
frame from a standard ASCII file, and writes the sample in another ASCII file. While the
existence of this program will not illuminate the theoretical details and justification of GRTS
samples, it will, I hope, make drawing a GRTS sample accessible to non-statistically inclined
researchers.
Keywords: generalized random tessellation stratified designs; programs; environmental
sampling; S-Draw
1. INTRODUCTION
For good reasons, general randomized tessellation stratified (GRTS) samples (Stevens and Olsen
1999; Stevens and Olsen 2003; Stevens and Olsen 2004) are gaining popularity as a sampling
scheme for large-scale long-term environmental surveys. GRTS samples with reverse
hierarchical ordering are designed such that for any sample size, say n, the first n units in the
sample will be spatially balanced (i.e., “spread out”). In fact, any contiguous set of n units in a
reverse hierarchically ordered GRTS sample constitutes a spatially balanced set of sample units
(Stevens and Olsen 2004). A spatially balanced GRTS sample makes it easy to both add units in
a way that does not compromise spatial balance and to maximize the overlap (co-location) of
multiple studies such that all sample sizes are spread out. The ease with which spatially balanced
units are added to single or multiple studies is the chief advantage of GRTS samples over the
next-most popular design, systematic sampling. Another advantage, although rarely realized in
practice, of GRTS samples is that they avoid the alignment problems and subsequent adverse
effects on estimates that can occur with systematic sampling (Stevens and Olsen 2003).
However, even though GRTS sampling is a good idea statistically, in practice the theory
behind GRTS samples and the details required to actually draw one are difficult to understand.
This difficulty, in part, stems from the flexibility of GRTS. The GRTS methodology can be
applied to discrete units located in 1-dimension, discrete units located in 2-dimensions, and
points in continuous 2-dimensional areas, all with either equal or unequal inclusion probabilities.
Even when GRTS sampling is understood conceptually, the programming requirements to
actually implement the procedure can be overwhelming.
In this paper, I describe a Windows-based computer program (that I call S-Draw) that
will draw a GRTS sample of discrete units located in 1 or 2 dimensions. This program can be
used to closely approximate a point sample in a continuous 2-dimesional area by defining a fine
grid of points over the area and inputting grid locations into the program as a discrete frame. My
hope is that this program will make drawing a GRTS sample accessible to the average scientist,
even if they do not fully understand the details of the methodology.
This paper is intended to describe the methods programmed into S-Draw, and as such I
do not cover the theoretical details of GRTS samples (see Stevens and Olsen 2004). In the next
section, I outline the methods used in the program and its basic capabilities. Various formats of
the discrete sample frame are also described. In Examples, I present a few examples for
illustration. I close with a short discussion of the program’s performance, planned enhancements,
availability, and where suggestions for future versions can be submitted.
2. METHODS
Due to the increased popularity of GRTS samples, and the fact that I was being called upon to
actually draw them, I desired an easy-to-use computer program that implemented the GRTS
methodology with the reverse hierarchical ordering described in Stevens and Olsen (2004). To
be useful, I wanted the program to handle very large sampling problems, run quickly, and have a
simple graphical user interface containing a minimum of input parameter. I also wanted a
command-line interface for the program (that I eventually called S-DrawB) that would facilitate
batch processing and simulation.
I choose to write such a program for Windows operating systems using the Fortran 95
language. I choose Fortran because of the speed with which it manipulates large arrays of
numbers, and because Fortran produces stand-alone applications that do not require another
program to run (such as R, S-Plus, or SAS). I used the well-established Fortran 95 compiler
available from Lahey Computer Systems, Inc. (version 7.1), and the Windows API routines
available in their Wisk library, to implement the program. I call the program S-Draw because
samples are often denoted by “S”, and this program draws S’s.
I designed the S-Draw program to draw samples of discrete units that are located in either
1 or 2 dimensions. An example of a discrete 1-dimensional unit is a river segment identified by
its’ distance from the mouth. Examples of discrete 2-dimensional units include grid cells
identified by the location of their centers, and river segments identified by the UTM (or latitude
and longitude) coordinates of their lower endpoints. The coordinates of all units in the
population to be sampled are input into S-Draw by specifying the name of a text file containing
the coordinates and other information, such as identifiers and weights. This text file I call the
sample frame. It contains one row per population unit, and varying numbers of fields (columns)
on each line depending on the type of sample being drawn. These fields are summarized in
Table 1. All fields on a line in the sample frame are separate by 1 or more spaces or commas.
I also designed S-Draw to produce simple samples that do not have an explicit sampling
frame. If a sampling frame, in the form of a text file, is not specified, S-Draw will accept an
arbitrary population size and draw an equi-probable assuming units are located in 1-dimension.
In this case, units are assumed to be located at coordinates of 1, 2, …., up to the population’s
size, and units are identified by their 1-D coordinate.
Following specification of the input parameters, the first step taken by S-Draw is to map
the identify of all units in the population to a line segment in the interval (0,n] in a way that
preserves some spatial proximity. This is accomplished by mapping either the 1-D or 2-D
coordinates for all units onto the 1-D interval (0,n]. The mapping implemented in S-Draw is the
quadrant–recursive function suggested by Stevens and Olsen (2004). The bounding box for
populations of units located in 2-D consists of the minimum and maximum horizontal and
vertical coordinates. The bounding interval for populations of units located in 1-D consists of
the minimum and maximum coordinate. In the quadrant-recursive map, each unit’s coordinates
are converted into either a base-2 (for units located in 1 dimension) or base-4 (for units located
in 2 dimensions) number of the form x1.x2.x3. … .xK , where xi is a digit representing the quadrant
of the unit’s location at the ith stage of the recursive mapping, and K is the number of recursive
levels used in the hierarchical ordering scheme. If randomization is called for (the default in SDraw), the digits at each level of the hierarchical identifier are randomly permuted. That is, the
frame is randomized by randomly mapping the unique digits at the ith stage of the recursive map
onto themselves in a 1-to-1 way. If randomization is not called for, no permutation of digits is
done. After randomization, the new base-2 or base-4 representation for each unit’s location is
converted back to a base-10 number, and the entire frame is sorted in ascending order according
to this base-10 number. The order of units in the sorting frame is the order of units assigned to
line segments in the interval (0,n]. Because of the way digits in the base-2 or base-4
representations of each unit’s location are constructed and permuted, units that are close (in 1 or
2 dimensional space) also tend to be close in the 1-D order generated by the quadrant-recursive
map.
The parameter K of the quadrant-recursive mapping can be controlled by the pixelsize
parameter in S-Draw. Conceptually, pixelsize is the length of one side of a square quadrant at
the lowest level of the quadrant-recursive map. If pixelsize in S-Draw is set smaller than the
minimum distance between unit locations, quadrant-recursive mapping will continue until all
units occupy quadrants by themselves. In general, S-Draw sets
⎡ ln(range/pixelsize) ⎤
K = ⎢⎡log 2 (range / pixelsize) ⎥⎤ = ⎢
⎥,
ln(2)
⎢
⎥
where range is the maximum extent of the population’s bounding box along a single dimension,
and the “ceiling” function ⎡⎢ x ⎤⎥ returns the smallest integer greater than or equal to x. True size
of the smallest quadrant in the recursive mapping is then range / 2K. S-Draw randomizes the
order of all units in the same lowest-level pixel. This is equivalent to randomizing the order of
all units in the frame with the same base-2 or base-4 representation. To draw a simple random
sample using S-Draw, pixelsize can be set to a value larger than range.
In addition to regular quadrant-recursive mapping, S-Draw can process an arbitrary
hierarchical ordering that has been predefined and stored in the frame by the user. This feature
will allow users, for example, to use triangle-recursive mapping that defines sub-triangles inside
a bounding triangle by connecting the midpoints of each side. In fact, any hierarchical ordering
that produces an identifier of the form k1.k2.k3. … .kK, where ki is a number identifying the subregion of the unit’s location at the ith level of the hierarchy, can be used. Under this option, the
user must construct the hierarchical identifiers outside S-Draw and include them in the frame.
No recursive mapping is done inside S-Draw, but digits at each level of the hierarchy are
randomly permuted, reassigned to the same level, the new identifiers are converted from the
mixed base numbers that were input to base-10 numbers, and the frame is sorted according to
this base-10 number. This has the effect of hierarchically sorting the frame based upon digits in
the first level, then digits in the second level within levels of the first, then digits in the third
level within levels of the first two, and so on. This option was included because for certain
problems it may be easier to construct the hierarchical identifiers k1.k2.k3. … .kK than it is to
construct the coordinates of individual units. This will generally only happen when a geographic
information system (GIS) is not available. For example, a spatially balanced sample of stream
segments in the United States could be drawn by assigning to all segments in the U.S. a
hierarchical identifier of the form state.county.watershed.segment, where state is the number of
the state in which the segment resides (i.e., 1, 2, … 50), county is the number of the county
within the state where the segment resides, watershed is the number of the segment’s watershed
within the county, and segment is the number of the segment within the watershed.
Following the order of units established by the random quandrant-recursive map or predefined hierarchical identifiers, and permutation of units in the same pixel, units are assigned to
a line segment within (0,n] with a length that is a direct function of the unit’s sample weight
specified in the frame. If all weights in the frame are equal, or if weights are not given, the
length of each unit’s line segment is n/N, where N is population size, and an equi-probable
sample is drawn. If weights are not equal, the length of each unit’s line segment is set to
πi = ( wi / ∑ i wi ) N , where wi is the weights value for unit i. If any πi >1.0, they are set equal to
1.0 and the remaining πi are rescaled so that all πi sum to n.
To draw the GRTS sample, a systematic sample of size n is drawn from the ordered line
segments on (0,n] by first choosing a random start, say m, between 0 and 1. Units that are
associated with line segments that contain one of the points in the sequence {m, m+1, m+2, …,
m+(n-1)} are then included in the sample. Because units that are close together in space tend to
have line segments in (0,n] that are close together, the systematic sample across (0,n] assures that
sample locations will be spatially spread out.
If reverse hierarchical ordering (Stevens and Olsen 2004) is called for by the S-Draw user
(the default), S-Draw assigns the integers 1, 2, ..., n to units in the realized sample, and then
converts those numbers to either base-2 (units in 1-dimension) or base-4 (units in 2-dimensions)
numbers. When converted, S-Draw reverses the digits of these base-2 or base-4 numbers,
converts the reversed-digit numbers back to base-10, and sorts the sample according to this base10 number. If n is an integer power of 2, reverse hierarchical reordering of a 1-D sample forces
units from the first half of the bounding interval to be followed immediately by a unit from the
second half of the bounding interval. For 2-D samples and assuming n is a power of 4, the effect
of this reverse hierarchical ordering is to contiguously place one unit from each quadrant in the
re-ordered sample. That is, if n is a power of 4 any four adjacent units in a reordered 2-D sample
will consist of one unit from quadrant 1, one unit from quadrant 2, one unit from quadrant 3, and
one unit from quadrant 4. The same phenomenon is true at lower levels of the quadrantrecursive map. That is, if n is a power of 4, units separated by exactly 4 positions in the
reordered 2-D sample will consist of one unit from the first sub-quadrant of a particular
quadrant, one from the second sub-quadrant of that same quadrant, one from the third subquadrant of that same quadrant, and one from the fourth sub-quadrant of that same quadrant.
This ordering assures that any contiguous set of units in the sample will be spatially balanced
over the population. Simulation (Stevens and Olsen 2004) shows that this spatial balance is
present for general sample sizes, not just powers of 2 or 4.
3. EXAMPLES
S-Draw is controlled by filling in the boxes in its graphical user interface (Figure 1), clicking the
appropriate radio or check boxes, and selecting “OK” to run. All parameters in S-Draw have
default values except sample size, population size, and input frame. The minimum set of
parameters needed to run S-Draw is either sample size and population size, or sample size and
input frame. The default output file is named “sample_yyyymmdd_hhmmss.txt” where yyyy is
the year that this particular run of S-Draw was started, mm is the month, dd is the day, hh is the
hour, mm is the minute, and ss is the second. When finished, this output file will contain a list of
all the input parameters, as well as a list of all units in the sample. The sample listing includes
coordinate(s) of the unit, actual first-order inclusion probability of the unit, and identifier for the
unit (either input or made up inside S-Draw). The sample listing is in reverse hierarchical order
if that option was checked. The seed for the sequence of random draws can be specified by the
user so that particular samples can be replicated on multiple runs. The random seed can be any
value integer between -2 billion and +2 billion, but if it is set to -1, a seed is constructed from the
computer’s clock and written to the output file.
If sample size is set to 20 and population size is set to 100, S-Draw will produce a 1-D
GRTS sample of size 20 with reverse hierarchical ordering assuming the 100 units are located at
coordinates 1, 2, 3, …, 100. If sample size is 20, population size is 100, and pixelsize is 100, SDraw will produce a simple random sample of size 20 in reverse hierarchical order. In this case,
reverse hierarchical ordering has no real effect other than to shuffle the already randomly
ordered sample. If sample size is 20, population size is 100, pixel size is 1, and the randomize
box is unchecked, S-Draw will produce a fixed-size systematic sample assuming units are
ordered from 1 to 100.
S-Draw is capable of drawing randomized or un-randomized variable probability
systematic samples (VPS) (Brewer and Hanif 1982; Sunter 1986; Stehman and Overton 1994;
McDonald 1996). Un-randomized VPS samples are produced by specifying sample weights in
an input frame file, and unchecking the randomize box. Randomized VPS designs are produced
by specifying sample weights in an input frame file, leaving the randomize box checked, and
specifying a pixelsize that is larger than the maximum difference between unit coordinates,
which are assumed to be 1, 2, …, N in the 1-D case These parameters by-pass the quadrantrecursive mapping and the frame is completely randomized. Both types of samples may be
output in reverse hierarchical order, although doing so for the randomized VPS design is
redundant.
In August and September of 2003, the U.S. Fish and Wildlife funded an aerial survey of
golden eagle (Aquila chrysaetos) in the western half of the United States. Aerial survey transects
were 100 km in length, and oriented east-west. A dense grid of potential transect start points
was constructed by overlaying the study area (Figure 2) with points spaced 2 km apart northsouth and 100 km apart east-west. Portions of transects in this initial frame that covered
Department of Defense lands, Department of Energy lands, “no fly” National Parks, large urban
areas, large bodies of water, and lands > 10,000 feet in elevation were removed. The resulting
list of potential transects contained 27,058 starting points. The frame file containing these points
consisted of one line per point, and the following fields on each line (separated by spaces):
STARTPNT_X, STARTPNT_Y, and LINE_ID. A target of 17,500 km of total transect was
desired, and based on the average clipped transect length, resulted in a desired sample size of
208 transects. This sample size was doubled to allow for an adequate list of alternate transects
that could be flown if conditions did not permit sampling of the original. An equi-probable 2dimensional GRTS sample of size 416 was taken using S-Draw, where the first 208 transects in
the reverse hierarchically ordered sample were considered the original sample and the second
208 transects were considered alternates. The original and alternate sample of transects is
plotted in Figure 2. For this sample draw, the parameters of S-Draw were set as follows: 2dimensional sample was checked, coordinates in the frame was checked, ID’s in the frame was
checked, sample size was set to 416, population size was blank, pixelsize was 1, randomize was
checked, output in reverse hierarchical order was checked, and input frame listed the name of
the text file containing coordinates and ID’s of all 27,058 points. The final report on this golden
eagle survey is available on-line at http://mountain-prairie.fws.gov/species/birds/golden_eagle/.
4. DISCUSSION
S-Draw uses dynamic memory allocation for all arrays. Consequently, the number of records in
the sampling frame is only limited by a computer’s memory. I tested S-Draw on my laptop that
contains a Pentium 4 processor and 512 MB of RAM memory. I generated random point
locations in the bounding box of [0,100000] × [0,100000] and asked S-Draw to take an equiprobable GRTS sample. S-Draw was able to draw a 2-dimensional GRTS sample of size 500
from a population of 100,000 units in 4.2 seconds. For this sample, I requested 28 levels of
quadrant-recursive mapping and reverse hierarchical ordering. Drawing a GRTS sample for size
500 from 1,000,000 units with 17 levels of recursive mapping took 44.3 seconds. During this
run I noticed significant hard disk activity, implying that S-Draw was using slow virtual RAM
memory on the hard drive (I had 4 other programs open at the time including Word, S-Plus, and
Eudora). I believe S-Draw would have performed better on a machine with more RAM memory.
From knowledge of the algorithms, plus these and other tests I have not mentioned, I believe the
order of the program to be approximately O(N). That is, I would expect the program to complete
in approximately 4.45e-5N seconds, where N is population size.
In the future, it would be convenient if S-Draw was able to directly read GIS export files
(i.e., .e00 files) containing point coverages as the input sample frame. Enhancing S-Draw to
read .e00 files, and take a GRTS sample of the points therein, should not be difficult and is
planned. If S-Draw were capable of reading .e00 files, another logical extension would to allow
S-Draw to draw true point samples inside a region bounded by polygons. Utilizing publicly
available routines, it is theoretically possible for S-Draw to directly read ArcGIS binary files,
which would by-pass exporting the frame to a text file. In addition to enhancing the frame files
that S-Draw is capable of reading, I plan to produce a Windows dynamic link library (.dll) that
contains the core GRTS sampling routines. When this is accomplished, it will be possible to call
the S-Draw Fortran code from within an S-Plus or R function. When this enhancement is added,
users will no longer be required to export their sampling frame data from S-Plus and R to a text
file.
I will send anyone who contacts me a copy of S-Draw free of charge. The program is also
available on the West-Inc web site, www.west-inc.com. Send any suggestions or bug reports to
[email protected]. Although I have tested it thoroughly on multiple problems, I offer no
warranties or guarantees regarding S-Draw’s performance.
REFERENCES
Brewer, K. R. W. and Hanif, M. (1982), Sampling with unequal probabilities, New York:
Springer-Verlag.
McDonald, T. L. (1996), "Analysis of finite population surveys: sample size and testing
considerations," Unpublished dissertation, Oregon State University.
Stehman, S. V. and Overton, W. S. (1994), "Environmental sampling and monitoring," in
Handbook of Statistics, Patil, G. P. and Rao, C. R. (eds.).
Stevens, Don L. and Olsen, Anthony R. (1999), "Spatially restricted surveys over time for
aquatic resources," Journal of Agricultural, Biological, and Environmental Statistics, 4,
415-428.
----- (2003), "Variance estimation for spatially balanced samples of environmental resources,"
EnvironMetrics, 14, 593-610.
----- (2004), "Spatially balanced sampling of natural resources," Journal of the American
Statistical Association, 99, 262-278.
Sunter, A. (1986), "Solutions to the problem of unequal probability sampling without
replacement," International Statistical Review, 54, 33-50.
Table 1: Order of fields in the sample frame file for all possible combinations of
input parameters to S-Draw. Characters allowed on each line following the fields
listed, but they are ignored. Fields are separated by "white space", which is any
non-numeric character for numbers (including spaces and commas, but not the
decimal and negative sign), and a space for the alphanumeric ID field. These
formats only apply a frame file is specified.
Sample Structure
Pre-defined
1-D
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
2-D
Yes
Yes
Yes
Yes
Yes*
Yes*
Yes*
Yes*
*
Data Included in Frame
Sample
Coordinates ID's
weights
Yes
Yes
Yes
Yes
Yes
No
Yes
No
Yes
Yes
No
No
No
Yes
Yes
No
Yes
No
No
No
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
No
Yes
No
Yes
Yes
Yes
No
No
Yes
No
No
K specified on the first line of the frame file
Order of fields in frame file
x wgt id
x wgt
x id
x
wgt id
wgt
id
[# lines counted]
x y wgt id
x y wgt
x y id
xy
k1 … kK wgt id
k1 … kK wgt
k1 … kK id
k1 … kK
Figure 1: The user interface for S-Draw showing the input parameters.
Figure 2: A map of the GRTS sample locations for 100 km aerial transects flown during the U.
S. Fish and Wildlife’s golden eagle survey in August and September 2003. A grid of 27,058
potential starting points with spacing of 2km north-south and 100km east-west was constructed
and input into S-Draw. Twice the necessary number of transects were selected so that an
adequate list of alternate transects were obtained.