MATLAB Scripts to Partition Multivariate Sedimentary

MATLAB Scripts to Partition Multivariate Sedimentary
Geochemical Data Sets
Nicklas G. Pisias
College of Oceanic and Atmospheric Sciences
Oregon State University
Corvallis OR 97331-5503
[email protected]
Richard W. Murray
Department of Earth Sciences
Boston University
Boston MA 02215
[email protected]
Introduction
Deep-sea sedimentation is important to
the geochemical cycles of many chemical
species. For example, the widespread
accumulation of siliceous and calcareous
biogenic marine sediments plays an
important part in the biogeochemical
cycling of Si, Ca, and C (e.g., Burdige,
2006).
Hydrothermal solutions and
sediments from active spreading centers
play an important role in the cycling of
Mn, Fe and other metals (e.g., Dymond,
1981; Dekov et al., 2010. Dust input
delivers key nutrients and micronutrients
such as Fe to surface waters of nutrientdepleted regimes (e.g., Boyd et al., 2010).
Thus, it is important to understand the
fluxes of elements in deep-sea sediment in
order to define their geochemical cycles
and to evaluate changes in these cycles
through time (Leinen and Pisias, 1984).
The introduction to sedimentary
chemistry of so-called “rapid” analytical
techniques for the analysis of major, trace,
and rare earth elements gained momentum
in the late 1960’s and has continued to the
present day.
Early applications of
instrumental neutron activation analysis
(INAA), and x-ray fluorescence (XRF),
followed by flame- and graphite furnace
atomic absorbtion (AA), have led to the
modern and widespread use of inductively
coupled plasma emission spectrometry and
mass spectrometry (ICP-ES and ICP-MS)
techniques. Whereas publications forty
years ago at best focused on a few
chemical elements, it is not unusual for
research contributions in the new
millenium to include data on 20 or more
elements, including full suites of rare earth
elements, and often complemented by an
array of radiogenic- or stable isotopes.
These
analytical
advances
in
sedimentary chemistry have paralleled
those in other geochemical fields, notably
igneous petrochemistry. The development
of large data sets in the igneous
community led to the pathfinding
relational database, PetDB, which is
widely used by many researchers. Other
relational databases for the igneous
community (NAVDAT, etc.) are also very
powerful and allow easy compilation and
comparison of data between heretofore
disparate publications.
Building on these successes, and
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
1
responding to the increased availability of
large datasets generated by modern
instrumentation, the SedDB database is
rapidly becoming populated with data
from key publications in all major ocean
basins of the world. The breadth of this
growing database highlights the need for a
consistent means by which multivariate
statistical treatments can be made available
to the community of researchers using
SedDB as well as those who may find such
approaches useful for their own standalone
projects. To this end, we here provide
three detailed MATLAB scripts that
address Q-mode factor analysis, linear
programming, and inverse modeling. As
cited below, these approaches have been
used successfully for the past 20-30 years.
Background on Multivariate Statistical
Treatments
As noted by Leinen and Pisias (1984),
there are three steps to evaluating the role
of deep sea sedimentation in geochemical
cycles within a given sample array: 1)
Determination of the number of different
components that are responsible for the
data set being studied: 2) Identification of
the composition of these different sources
(such as hydrothermal and biogenic
material); and 3) The quantification of the
abundances of each of these components
in each sample of the data set being
studied.
These steps collectively are
referred to as the “partitioning problem”
(Leinen and Pisias, 1984).
Multivariate statistical or multivariate
modeling methods look at the "structure"
of multivariate data sets, such as: How are
different variables correlated? And: How
are samples from different locations and
times related?
By making these
determinations of correlations and
relationships, we can then learn about the
processes that produce the data set.
In this SedDB contribution we provide
three MATLAB scripts that can be used to
partition a multivariate geochemical data
set. Note that these methods have been
applied to other multivariate data sets
including microfossil data sets of species
relative abundances where the number of
species may range from 10 to 100 (e.g.,
Imbrie and Kipp, 1973; Pisias et al, 1997).
Q-Mode Factor Analysis
The first script performs what is
known as Q-mode factor analysis (e.g.
Imbrie and van Andel, 1964, Imbrie and
Kipp, 1973; Kovan and Imbrie, 1971;
Leinen and Pisias, 1984). This script
includes extensions of the Q-mode factor
analysis of Klovan and Miesch (1976 and
Klovan (1981).
Q-mode factor analysis is used to first
simplify the multivariate data sets by
describing it with a smaller set of
components (sometimes called endmembers, which are artificial variables)
that, when identified and mapped, can be
used to look at processes that ultimately
produce the observed data. Q-mode factor
analysis helps address the first step in the
partitioning problem by providing an
estimate of the number of different
components that are contained in the
observed data set.
With the extensions to Q-mode factor
analysis of Klovan and Miesch (1976)
and Klovan (1981), this technique further
provides estimates for the second and
third step of the partitioning problem,
namely, estimation of the composition of
the end-members and of the abundance of
the end-members in each sample.
However, because of the constraint in Q-
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
2
mode factor analysis that the endmembers are algebraically orthogonal
(their vector dot products are zero), the
compositions of Q-mode end-members
commonly
contain
negative
compositions. This is not acceptable for
geochemical studies.
A number of
studies have provided strategies to adjust
Q-mode end-members to address the
problem posed by the orthogonality
constraint (e.g. Leinen and Pisias, 1984;
Full et al., 1981), and the MATLAB
script provided here also includes this in
its calculations.
a constrained least squares linear model
whereby the contribution from each endmember is constrained to be greater than
or equal to zero while minimizing the
sum of squares of the residuals. Much of
this CLR theory can be found in Rencher
(2002) and Menke (1989). Where the Qmode factor analysis programs give
estimates of end member compositions,
this script requires that the end members
be specified. Such specification can be
determined by the user from data
gathered from the literature, and/or from
the results of the Q-mode factor analysis.
Constrained Least Squares
Given the composition of the end
members, constrained least squares
techniques are used to calculate the
abundance (percentage) of each end
member in each sample. This partitioning
can be described as a set of linear
equations of the form:
The second script partitions samples
in terms of a specified set of end-member
components. Unlike the Q-mode factor
analysis, this script requires the user to
specify the number of components
contained in each sample and the
composition of each of these endmembers. In practice, the information
gained from the Q-mode factor analysis
(e.g., determination of the number of
end-members that can be used to describe
the data set, and the approximate
composition of these end members), can
be used as starting points for the
constrainted
least
squares
(CLS)
statistical treatments.
This MATLAB script is based on the
approach used by Dymond (1981).
Dymond solved the partitioning problem
using Linear Programming techniques
(e.g., Hadley, 1962) to estimate the
abundance of five end-members in
surface sediments from the Nazca Plate
in the Southeast Pacific. Because all
equations were linear, the model fit was
based on minimizing the sum of the
absolute values of the model residuals.
The Matlab script presented here utilizes
a(i,j) x d(j,k) = e(i,k), with
i=1,2 ... N,
k=1,2 ... m, and
j = 1,2 …p
(1)
where a(i,j) represents the contribution of
the j-th end member in the i-th sample,
d(j,k) is the concentration of the k-th
element in the j-th end member and e(i,k)
is the concentration of the k-th element in
the i-th sample. N is the number of
samples, m the number of elements, and
p the number of end members. In
general, the number of elements analyzed
in each sample is greater than the number
of end members and thus the set of
equations in (1) are underdetermined.
Because we wish to solve these equations
with the constraint that the contributions
of each end member in each sample is
greater than or equal to zero, the
equations are solved using a constrained
least squares approach. The equations
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
3
are solved so that:
SUM (e(i,k) - e'(i,k))2 over k=1...m (2)
is minimized, where e'(i,k) are the
estimated elemental compositions from
the model. Traditional linear regression
techniques, such as many commonly
available in commercially available
software packages, minimize the sum of
squares of the residuals, and thus cannot
be used because the positivity constraint
on the a's is not satisfied. Note that these
equations are solved for each of the N
samples in the data set.
Total Inversion (TI)
This final script is based on the
program of Zhou and Kyte (1992) and
Kyte et al. (1993) and uses total inversion
to partition a multivariate data into a
number of end member components.
Unlike the previous CLS script, which
assumes that the composition of endmembers is fixed, the TI script partitions
the data set while allowing the
composition of each end-member to vary
slightly to maximize the partitioning fit
of any one sample. Such variation in end
member concentrations is more realistic
from several perspectives. First, the
actual end member may differ slightly
from published values (e.g., from a
nearby volcano), and, second, there may
be slight variation(s) in composition with
time. From the complete partitioning
run, the program calculates the same set
of fit statistics as CLS as well as the
mean and standard deviation of the endmember compositions.
The script is based on FORTRAN
code provided by Frank Kyte and his
then graduate student Lei Zhou. It solves
the partitioning problem using "total
nonlinear inversion techniques" outlined
by Tarantola and Vallette (1982).
The basic equations used are:
A sample is taken as an m x 1 matrix
T[m x 1] where m is number of elements
measured. The partitioning problem then
equates this matrix as a product of the
end-member composition matrix C times
the contribution of each end-member in
the sample represented by matrix E.
Thus:
T[m x 1] = C[m x n] * E[n x 1]
where n is the number of end-members,
C[i,j] is the concentration of the i-th
element in the j-th end-member and E[i]
is the relative fraction of the i-th endmember in the sample.
We define the function: f(x) = T - C *
E, where x is a vector of
[T1,... Tm, E1,..En, C11, C12... Cmn]
and we wish to minimize the weighted
sum of squares:
s(x) = (x - xo)t * Co-1 * (x - xo)
where t denotes transpose and Co is the
covariance matrix of all parameters and
data in x
(usually assumed to be
diagonal) and xo is the a priori (initial
guess) vector. The solution to this
equation is given by:
x = xo + Co * Ft * ( F * Co * Ft)-1 *
{ F * (x - xo) - f(x)}
where the matrix F is the partial
derivatives of f(x) such that F[i,j] is the
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
4
partial of {f[i]}/ {x[j]} [ {g} denotes the
partial derivative].
4) Filename for file containing variable
labels. The labels can be text, one
label per line in file.
In the partitioning case, f[i] is the
equation for the i-th element or row in the
matrix T - C * E. The j-th terms refers to
the j-th term in the vector x = [T1,... Tm,
E1,..En, C11, C12... Cmn].
5) Alpha level. What fraction of the
total data variance (in percentage) to
be explained by the retained
eigenvectors.
Description and Use of the Scripts
Q-mode Script
The scripts contain a main routine
that reads in the multivariate data set as
well as other information (variable names
etc.) and seven function routines to
complete the calculations (Appendix 1).
To save memory the program uses two
different algorithms to calculate the Qmode factors. If the number of samples
is less than the number of variables it
calculates the factor analysis matrices by
their definitions (Klovan and Imbrie,
1971). If the number of samples is
greater than the number of variables, then
the routine uses the “CABFAC” (Calgary
- Brown Factor Analysis) routine of
Klovan and Imbrie (1971). This keeps
the amount of computer memory needed
for the calculations to a minimum.
The
main
script
is
named
“qmodemain”. After starting the script,
the program requests:
1) Filename to save text output.
2) Title to label figures and output.
3) Filename for input data matrix
(column one is sample label followed
by column for each variable). The
program determines size of matrix
after reading data.
6) Transformation desired: 0 – none; 1constant mean; 2 = percent max; 3 –
log(x+1).
7) Number of factors
VARIMAX rotation.
to
keep
in
The program determines the number
of variables and sample from the size of
the input data matrix. The program
assumes that the first column in the data
file is a numeric sample identifier. All
entries in the data input file must be
numeric.
In the variable label file, put a
variable label on separate lines of the file.
An example output is given in
Appendix 2.
Constrained Least Squares Script
The scripts contain a main routine
that reads in the multivariate data, the
composition of the end members set as
well as other information (variable names
etc.) and two function routines to
complete the calculations (Appendix 3).
The constrained least squares calculation
is completes using the MATLAB
function lsqnonneg.
The
main
script
is
named
solvel2main. After starting the script the
program asks for:
1) Filename to save text output.
2) Title to label figures and output.
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
5
3) Filename for input data matrix
(column one is sample label followed
by column for each variable). The
program determines the size of matrix
after reading data.
4) Filename for file containing variable
labels. The labels can be text, one
label per line in file.
5) Filename for file that contains the
end-member
compositions
(Dmatrix).
The program determines the number
of variables and sample from the size of
the input data matrix. The program
assumes that the first column in the data
file is a numeric sample identifier. All
entries in the data input file must be
numeric.
In the variable label file, put a data
label on separate lines of the file.
The end-member compositions are
entered into a data file one matrix
element per line.
Start with the
concentration of the first variable in the
first end-member, the concentration of
the first variable in end-member 2 etc. to
the concentration of the first variable in
the last end-member. Then enter the
concentration of the second variable in
end-member 1, concentration of second
variable in end-member 2, etc.
Example
Appendix 4.
results
are
given
in
Total Inversion (TI) Script
The scripts contain a main routine
that reads in the multivariate data, the
composition of the end members set as
well as other information (variable names
etc.) and two function routines to
complete the calculations (Appendix 5).
The
main
script
is
named
totalinvmain. After starting the script the
program asks for:
1) Filename to save text output.
2) Title to label figures and output.
3) Filename for input data matrix
(column one is sample label followed
by column for each variable.
Program determines size of matrix
after reading data.
4) Filename for file containing variable
labels. The labels can be text, one
label per line in file.
5) Filename for file that contains the
end-member
compositions
(Dmatrix).
6) Filename with the data variance
estimates (diagonal of Co matrix).
The script determines the number of
variables and sample from the size of the
input data matrix. The script assumes
that the first column in the data file is a
numeric sample identifier. All entries in
the data input file must be numeric.
In the variable label file, put a data
label on separate lines of the file.
The end-member compositions are
entered into a data file one matrix
element per line.
Start with the
concentration of the first variable in the
first end-member, the concentration of
the first variable in end-member 2 etc. to
the concentration of the first variable in
the last end-member. Then enter the
concentration of the second variable in
end-member 1, concentration of second
variable in end-member 2, etc.
The variance vector Co starts with the
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
6
variances of the variables, then variances
of the estimated abundances of each endmember in the samples, then the variance
of the variable concentrations in the endmembers, that is, in the same order as the
model parameter vectors:
x=[T1,... Tm, E1,..En, C11, C12... Cmn].
Experimentation is needed to assess
the impact of these estimates on the
model results.
Example results are given in
Appendix 6.
Acknowledgements
We thank NSF for supporting the SedDB
project (OCE04-53958 and OCE082619), and our colleagues Kirsten
Lehnert, Steve Goldstein, and Annika
Johansson, all of Lamont Doherty Earth
Observatory, for their collaboration
during the development of the database.
We thank R. Scudder and A. Dunlea for
their comments on the manuscript.
RWM thanks his recent graduate students
(C. Ziegler, N. Martinez, and R. Scudder)
for their contributions and research that
helped usher in use of these MATLAB
scripts to his research group.
References
Boyd, P. W., Mackie, D.S., and Hunter,
K.A., 2010. Aerosol iron deposition
to the surface ocean--Modes of iron
supply and biological responses.
Marine Chemistry, 120, 128-143.
Burdige, D. J., 2006, Geochemistry of
Marine
Sediments.
Princeton
University Press. 609 pp.
Dekov V. M., Cuadros J., Kamenov G.
D., Weiss D., Arnold T., Basak C.,
and Rochette P., 2010. Metalliferous
sediments from the H. M. S.
Challenger voyage (1872-1876).
Geochimica et Cosmochimica Acta,
74, 5019-5038.
Dymond, J., The geochemistry of Nazca
Plate
surface
sediments:
An
evaluation of hydrothermal, biogenic,
detrital and hydrogenous sources,
Geol. Soc. Am. Mem, 154, 133-174,
1981.
Full et al., Extended Qmodel – Objective
definition of external end members in
the
analysis
of
mixtures,
Mathematical Geology 13(4):331344, 1981.
Imbrie. J. and van Andel, T., Vector
analysis of heavy-mineral data, Geol.
Soc. Of Am. Bull. 75(11):1131-1156,
1964.
Imbrie, J. and Kipp, A new
micropaleontological method for
quantitative
paleoclimatology:
Application to a Late Quaternary
Caribban core,
Klovan, J. E., A generalization of
extended Q-Mode factor analysis to
data matrices with variable row sums,
Mathematical Geology, 13(3):217224, 1981
Klovan, J. E. and Imbrie, J., An
algorithm
and
FORTRAN-IV
Program for large-scale Q-Mode
factor analysis and calculation of
factor scores, Mathematical Geology,
3(1):61-77, 1971.
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
7
Klovan, J.E. and Miesch A.T., Extended
CABFAC
and
QMODEL
computer programs for Q-mode
factor analysis of compositional
data, Computers & Geosciences,
1(3):161-178, 1976.
Kyte, F. T., M. Leinen, G. R. Heath, and
L. Zhou, Cenozoic sedimentation
history of the central North Pacific:
Inferences from the elemental
geochemistry of Core LL44-GPC3,
Geochim. Cosmochim. Acta, 57,
1719-1740, 1993.
Leinen M. and Pisias N. G., An objective
technique for determining endmember compositions and for
partitioning sediments according to
their sources, Geochimical et
Cosmochimica Acta 48:47-62, 1984.
Miesch, A. T., Q-mode factor analysis of
geochemical and petrologic data
matrices with constant row-sums,
USGC Professional Paper 574-G,
1976.
Tarantola, A. and Vallette, B.,
Generalized
nonlinear
inverse
problem solved using the least
squares criterion. Rev. Geophys.
Space Phys., 20:219-232, 1982.
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
8
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
9
Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets
10