MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets Nicklas G. Pisias College of Oceanic and Atmospheric Sciences Oregon State University Corvallis OR 97331-5503 [email protected] Richard W. Murray Department of Earth Sciences Boston University Boston MA 02215 [email protected] Introduction Deep-sea sedimentation is important to the geochemical cycles of many chemical species. For example, the widespread accumulation of siliceous and calcareous biogenic marine sediments plays an important part in the biogeochemical cycling of Si, Ca, and C (e.g., Burdige, 2006). Hydrothermal solutions and sediments from active spreading centers play an important role in the cycling of Mn, Fe and other metals (e.g., Dymond, 1981; Dekov et al., 2010. Dust input delivers key nutrients and micronutrients such as Fe to surface waters of nutrientdepleted regimes (e.g., Boyd et al., 2010). Thus, it is important to understand the fluxes of elements in deep-sea sediment in order to define their geochemical cycles and to evaluate changes in these cycles through time (Leinen and Pisias, 1984). The introduction to sedimentary chemistry of so-called “rapid” analytical techniques for the analysis of major, trace, and rare earth elements gained momentum in the late 1960’s and has continued to the present day. Early applications of instrumental neutron activation analysis (INAA), and x-ray fluorescence (XRF), followed by flame- and graphite furnace atomic absorbtion (AA), have led to the modern and widespread use of inductively coupled plasma emission spectrometry and mass spectrometry (ICP-ES and ICP-MS) techniques. Whereas publications forty years ago at best focused on a few chemical elements, it is not unusual for research contributions in the new millenium to include data on 20 or more elements, including full suites of rare earth elements, and often complemented by an array of radiogenic- or stable isotopes. These analytical advances in sedimentary chemistry have paralleled those in other geochemical fields, notably igneous petrochemistry. The development of large data sets in the igneous community led to the pathfinding relational database, PetDB, which is widely used by many researchers. Other relational databases for the igneous community (NAVDAT, etc.) are also very powerful and allow easy compilation and comparison of data between heretofore disparate publications. Building on these successes, and Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 1 responding to the increased availability of large datasets generated by modern instrumentation, the SedDB database is rapidly becoming populated with data from key publications in all major ocean basins of the world. The breadth of this growing database highlights the need for a consistent means by which multivariate statistical treatments can be made available to the community of researchers using SedDB as well as those who may find such approaches useful for their own standalone projects. To this end, we here provide three detailed MATLAB scripts that address Q-mode factor analysis, linear programming, and inverse modeling. As cited below, these approaches have been used successfully for the past 20-30 years. Background on Multivariate Statistical Treatments As noted by Leinen and Pisias (1984), there are three steps to evaluating the role of deep sea sedimentation in geochemical cycles within a given sample array: 1) Determination of the number of different components that are responsible for the data set being studied: 2) Identification of the composition of these different sources (such as hydrothermal and biogenic material); and 3) The quantification of the abundances of each of these components in each sample of the data set being studied. These steps collectively are referred to as the “partitioning problem” (Leinen and Pisias, 1984). Multivariate statistical or multivariate modeling methods look at the "structure" of multivariate data sets, such as: How are different variables correlated? And: How are samples from different locations and times related? By making these determinations of correlations and relationships, we can then learn about the processes that produce the data set. In this SedDB contribution we provide three MATLAB scripts that can be used to partition a multivariate geochemical data set. Note that these methods have been applied to other multivariate data sets including microfossil data sets of species relative abundances where the number of species may range from 10 to 100 (e.g., Imbrie and Kipp, 1973; Pisias et al, 1997). Q-Mode Factor Analysis The first script performs what is known as Q-mode factor analysis (e.g. Imbrie and van Andel, 1964, Imbrie and Kipp, 1973; Kovan and Imbrie, 1971; Leinen and Pisias, 1984). This script includes extensions of the Q-mode factor analysis of Klovan and Miesch (1976 and Klovan (1981). Q-mode factor analysis is used to first simplify the multivariate data sets by describing it with a smaller set of components (sometimes called endmembers, which are artificial variables) that, when identified and mapped, can be used to look at processes that ultimately produce the observed data. Q-mode factor analysis helps address the first step in the partitioning problem by providing an estimate of the number of different components that are contained in the observed data set. With the extensions to Q-mode factor analysis of Klovan and Miesch (1976) and Klovan (1981), this technique further provides estimates for the second and third step of the partitioning problem, namely, estimation of the composition of the end-members and of the abundance of the end-members in each sample. However, because of the constraint in Q- Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 2 mode factor analysis that the endmembers are algebraically orthogonal (their vector dot products are zero), the compositions of Q-mode end-members commonly contain negative compositions. This is not acceptable for geochemical studies. A number of studies have provided strategies to adjust Q-mode end-members to address the problem posed by the orthogonality constraint (e.g. Leinen and Pisias, 1984; Full et al., 1981), and the MATLAB script provided here also includes this in its calculations. a constrained least squares linear model whereby the contribution from each endmember is constrained to be greater than or equal to zero while minimizing the sum of squares of the residuals. Much of this CLR theory can be found in Rencher (2002) and Menke (1989). Where the Qmode factor analysis programs give estimates of end member compositions, this script requires that the end members be specified. Such specification can be determined by the user from data gathered from the literature, and/or from the results of the Q-mode factor analysis. Constrained Least Squares Given the composition of the end members, constrained least squares techniques are used to calculate the abundance (percentage) of each end member in each sample. This partitioning can be described as a set of linear equations of the form: The second script partitions samples in terms of a specified set of end-member components. Unlike the Q-mode factor analysis, this script requires the user to specify the number of components contained in each sample and the composition of each of these endmembers. In practice, the information gained from the Q-mode factor analysis (e.g., determination of the number of end-members that can be used to describe the data set, and the approximate composition of these end members), can be used as starting points for the constrainted least squares (CLS) statistical treatments. This MATLAB script is based on the approach used by Dymond (1981). Dymond solved the partitioning problem using Linear Programming techniques (e.g., Hadley, 1962) to estimate the abundance of five end-members in surface sediments from the Nazca Plate in the Southeast Pacific. Because all equations were linear, the model fit was based on minimizing the sum of the absolute values of the model residuals. The Matlab script presented here utilizes a(i,j) x d(j,k) = e(i,k), with i=1,2 ... N, k=1,2 ... m, and j = 1,2 …p (1) where a(i,j) represents the contribution of the j-th end member in the i-th sample, d(j,k) is the concentration of the k-th element in the j-th end member and e(i,k) is the concentration of the k-th element in the i-th sample. N is the number of samples, m the number of elements, and p the number of end members. In general, the number of elements analyzed in each sample is greater than the number of end members and thus the set of equations in (1) are underdetermined. Because we wish to solve these equations with the constraint that the contributions of each end member in each sample is greater than or equal to zero, the equations are solved using a constrained least squares approach. The equations Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 3 are solved so that: SUM (e(i,k) - e'(i,k))2 over k=1...m (2) is minimized, where e'(i,k) are the estimated elemental compositions from the model. Traditional linear regression techniques, such as many commonly available in commercially available software packages, minimize the sum of squares of the residuals, and thus cannot be used because the positivity constraint on the a's is not satisfied. Note that these equations are solved for each of the N samples in the data set. Total Inversion (TI) This final script is based on the program of Zhou and Kyte (1992) and Kyte et al. (1993) and uses total inversion to partition a multivariate data into a number of end member components. Unlike the previous CLS script, which assumes that the composition of endmembers is fixed, the TI script partitions the data set while allowing the composition of each end-member to vary slightly to maximize the partitioning fit of any one sample. Such variation in end member concentrations is more realistic from several perspectives. First, the actual end member may differ slightly from published values (e.g., from a nearby volcano), and, second, there may be slight variation(s) in composition with time. From the complete partitioning run, the program calculates the same set of fit statistics as CLS as well as the mean and standard deviation of the endmember compositions. The script is based on FORTRAN code provided by Frank Kyte and his then graduate student Lei Zhou. It solves the partitioning problem using "total nonlinear inversion techniques" outlined by Tarantola and Vallette (1982). The basic equations used are: A sample is taken as an m x 1 matrix T[m x 1] where m is number of elements measured. The partitioning problem then equates this matrix as a product of the end-member composition matrix C times the contribution of each end-member in the sample represented by matrix E. Thus: T[m x 1] = C[m x n] * E[n x 1] where n is the number of end-members, C[i,j] is the concentration of the i-th element in the j-th end-member and E[i] is the relative fraction of the i-th endmember in the sample. We define the function: f(x) = T - C * E, where x is a vector of [T1,... Tm, E1,..En, C11, C12... Cmn] and we wish to minimize the weighted sum of squares: s(x) = (x - xo)t * Co-1 * (x - xo) where t denotes transpose and Co is the covariance matrix of all parameters and data in x (usually assumed to be diagonal) and xo is the a priori (initial guess) vector. The solution to this equation is given by: x = xo + Co * Ft * ( F * Co * Ft)-1 * { F * (x - xo) - f(x)} where the matrix F is the partial derivatives of f(x) such that F[i,j] is the Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 4 partial of {f[i]}/ {x[j]} [ {g} denotes the partial derivative]. 4) Filename for file containing variable labels. The labels can be text, one label per line in file. In the partitioning case, f[i] is the equation for the i-th element or row in the matrix T - C * E. The j-th terms refers to the j-th term in the vector x = [T1,... Tm, E1,..En, C11, C12... Cmn]. 5) Alpha level. What fraction of the total data variance (in percentage) to be explained by the retained eigenvectors. Description and Use of the Scripts Q-mode Script The scripts contain a main routine that reads in the multivariate data set as well as other information (variable names etc.) and seven function routines to complete the calculations (Appendix 1). To save memory the program uses two different algorithms to calculate the Qmode factors. If the number of samples is less than the number of variables it calculates the factor analysis matrices by their definitions (Klovan and Imbrie, 1971). If the number of samples is greater than the number of variables, then the routine uses the “CABFAC” (Calgary - Brown Factor Analysis) routine of Klovan and Imbrie (1971). This keeps the amount of computer memory needed for the calculations to a minimum. The main script is named “qmodemain”. After starting the script, the program requests: 1) Filename to save text output. 2) Title to label figures and output. 3) Filename for input data matrix (column one is sample label followed by column for each variable). The program determines size of matrix after reading data. 6) Transformation desired: 0 – none; 1constant mean; 2 = percent max; 3 – log(x+1). 7) Number of factors VARIMAX rotation. to keep in The program determines the number of variables and sample from the size of the input data matrix. The program assumes that the first column in the data file is a numeric sample identifier. All entries in the data input file must be numeric. In the variable label file, put a variable label on separate lines of the file. An example output is given in Appendix 2. Constrained Least Squares Script The scripts contain a main routine that reads in the multivariate data, the composition of the end members set as well as other information (variable names etc.) and two function routines to complete the calculations (Appendix 3). The constrained least squares calculation is completes using the MATLAB function lsqnonneg. The main script is named solvel2main. After starting the script the program asks for: 1) Filename to save text output. 2) Title to label figures and output. Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 5 3) Filename for input data matrix (column one is sample label followed by column for each variable). The program determines the size of matrix after reading data. 4) Filename for file containing variable labels. The labels can be text, one label per line in file. 5) Filename for file that contains the end-member compositions (Dmatrix). The program determines the number of variables and sample from the size of the input data matrix. The program assumes that the first column in the data file is a numeric sample identifier. All entries in the data input file must be numeric. In the variable label file, put a data label on separate lines of the file. The end-member compositions are entered into a data file one matrix element per line. Start with the concentration of the first variable in the first end-member, the concentration of the first variable in end-member 2 etc. to the concentration of the first variable in the last end-member. Then enter the concentration of the second variable in end-member 1, concentration of second variable in end-member 2, etc. Example Appendix 4. results are given in Total Inversion (TI) Script The scripts contain a main routine that reads in the multivariate data, the composition of the end members set as well as other information (variable names etc.) and two function routines to complete the calculations (Appendix 5). The main script is named totalinvmain. After starting the script the program asks for: 1) Filename to save text output. 2) Title to label figures and output. 3) Filename for input data matrix (column one is sample label followed by column for each variable. Program determines size of matrix after reading data. 4) Filename for file containing variable labels. The labels can be text, one label per line in file. 5) Filename for file that contains the end-member compositions (Dmatrix). 6) Filename with the data variance estimates (diagonal of Co matrix). The script determines the number of variables and sample from the size of the input data matrix. The script assumes that the first column in the data file is a numeric sample identifier. All entries in the data input file must be numeric. In the variable label file, put a data label on separate lines of the file. The end-member compositions are entered into a data file one matrix element per line. Start with the concentration of the first variable in the first end-member, the concentration of the first variable in end-member 2 etc. to the concentration of the first variable in the last end-member. Then enter the concentration of the second variable in end-member 1, concentration of second variable in end-member 2, etc. The variance vector Co starts with the Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 6 variances of the variables, then variances of the estimated abundances of each endmember in the samples, then the variance of the variable concentrations in the endmembers, that is, in the same order as the model parameter vectors: x=[T1,... Tm, E1,..En, C11, C12... Cmn]. Experimentation is needed to assess the impact of these estimates on the model results. Example results are given in Appendix 6. Acknowledgements We thank NSF for supporting the SedDB project (OCE04-53958 and OCE082619), and our colleagues Kirsten Lehnert, Steve Goldstein, and Annika Johansson, all of Lamont Doherty Earth Observatory, for their collaboration during the development of the database. We thank R. Scudder and A. Dunlea for their comments on the manuscript. RWM thanks his recent graduate students (C. Ziegler, N. Martinez, and R. Scudder) for their contributions and research that helped usher in use of these MATLAB scripts to his research group. References Boyd, P. W., Mackie, D.S., and Hunter, K.A., 2010. Aerosol iron deposition to the surface ocean--Modes of iron supply and biological responses. Marine Chemistry, 120, 128-143. Burdige, D. J., 2006, Geochemistry of Marine Sediments. Princeton University Press. 609 pp. Dekov V. M., Cuadros J., Kamenov G. D., Weiss D., Arnold T., Basak C., and Rochette P., 2010. Metalliferous sediments from the H. M. S. Challenger voyage (1872-1876). Geochimica et Cosmochimica Acta, 74, 5019-5038. Dymond, J., The geochemistry of Nazca Plate surface sediments: An evaluation of hydrothermal, biogenic, detrital and hydrogenous sources, Geol. Soc. Am. Mem, 154, 133-174, 1981. Full et al., Extended Qmodel – Objective definition of external end members in the analysis of mixtures, Mathematical Geology 13(4):331344, 1981. Imbrie. J. and van Andel, T., Vector analysis of heavy-mineral data, Geol. Soc. Of Am. Bull. 75(11):1131-1156, 1964. Imbrie, J. and Kipp, A new micropaleontological method for quantitative paleoclimatology: Application to a Late Quaternary Caribban core, Klovan, J. E., A generalization of extended Q-Mode factor analysis to data matrices with variable row sums, Mathematical Geology, 13(3):217224, 1981 Klovan, J. E. and Imbrie, J., An algorithm and FORTRAN-IV Program for large-scale Q-Mode factor analysis and calculation of factor scores, Mathematical Geology, 3(1):61-77, 1971. Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 7 Klovan, J.E. and Miesch A.T., Extended CABFAC and QMODEL computer programs for Q-mode factor analysis of compositional data, Computers & Geosciences, 1(3):161-178, 1976. Kyte, F. T., M. Leinen, G. R. Heath, and L. Zhou, Cenozoic sedimentation history of the central North Pacific: Inferences from the elemental geochemistry of Core LL44-GPC3, Geochim. Cosmochim. Acta, 57, 1719-1740, 1993. Leinen M. and Pisias N. G., An objective technique for determining endmember compositions and for partitioning sediments according to their sources, Geochimical et Cosmochimica Acta 48:47-62, 1984. Miesch, A. T., Q-mode factor analysis of geochemical and petrologic data matrices with constant row-sums, USGC Professional Paper 574-G, 1976. Tarantola, A. and Vallette, B., Generalized nonlinear inverse problem solved using the least squares criterion. Rev. Geophys. Space Phys., 20:219-232, 1982. Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 8 Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 9 Pisias and Murray,2010: MATLAB Scripts to Partition Multivariate Sedimentary Geochemical Data Sets 10
© Copyright 2026 Paperzz