The University of Toledo The University of Toledo Digital Repository Theses and Dissertations 2015 Aggregating form accuracy and percept frequency to optimize Rorschach perceptual accuracy Sandra L. Horn University of Toledo Follow this and additional works at: http://utdr.utoledo.edu/theses-dissertations Recommended Citation Horn, Sandra L., "Aggregating form accuracy and percept frequency to optimize Rorschach perceptual accuracy" (2015). Theses and Dissertations. 1989. http://utdr.utoledo.edu/theses-dissertations/1989 This Dissertation is brought to you for free and open access by The University of Toledo Digital Repository. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of The University of Toledo Digital Repository. For more information, please see the repository's About page. A Dissertation entitled Aggregating Form Accuracy and Percept Frequency to Optimize Rorschach Perceptual Accuracy by Sandra L. Horn Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Clinical Psychology __________________________________________ Gregory J. Meyer, Ph.D., Committee Chair __________________________________________ Jeanne Brockmyer, Ph.D., Committee Member __________________________________________ Joni L. Mihura, Ph.D., Committee Member __________________________________________ Jason P. Rose, Ph.D., Committee Member __________________________________________ Donald J. Viglione, Ph.D., Committee Member __________________________________________ Patricia R. Komuniecki, Ph.D., Dean College of Graduate Studies The University of Toledo December 2015 Copyright 2015, Sandra L. Horn This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of Aggregating Form Accuracy and Percept Frequency to Optimize Rorschach Perceptual Accuracy by Sandra L. Horn Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Clinical Psychology The University of Toledo December 2015 Exner’s (2003) Comprehensive System and Meyer et al.’s (2011) Rorschach Performance Assessment System use Form Quality scores as a method for assessing the accuracy of perceptions on the Rorschach. However, Form Quality is a rather coarse classification method as it is based on just three options along a continuum of perceptual accuracy. There is currently not a fully dimensional Rorschach score that can thoroughly and efficiently tap into both the frequency with which particular objects are reported while taking the test and the perceptual fit of those objects to the cards. This study is focused on exploring the structure of a fit variable, Form Accuracy, in combination with a frequency variable, Percept Frequency, to make progress on a new dimensional method of scoring perceptual accuracy that will improve the ability to identify distorted perceptual processes and impaired reality testing and thus improve validity coefficients in the Rorschach-based identification of psychosis. Percept Frequency tables were developed from six internationally collected samples from Argentina, Brazil, Italy, Japan, Spain, and the U.S. that quantified how often objects were reported while completing the Rorschach task. Form Accuracy ratings were obtained from a database of 13,031 objects that had iii been rated an average of 9.9 times by different judges from eleven countries who were asked to rate the extent to which the object fit the contours of the inkblot at the location where it was seen. A criterion database containing 159 protocols and 3,897 scorable responses was then scored for Form Accuracy and Percept Frequency. Hierarchical Linear Modeling was used to complete structural analyses of Form Accuracy and Percept Frequency scores at the response level, and correlations of these variables were computed at the protocol level with a criterion measure assessing severity of disturbance based on psychiatric diagnoses. Across different levels of aggregation, there was resounding evidence that the structure of each of the ten Rorschach cards and the sequence of first, second, third, or fourth responses given to a card played a large role in determining Form Accuracy and Percept Frequency scores. As such, these scores are strongly influenced by structural features of the Rorschach task that cannot be entirely attributed to stable characteristics of the test-taker. There were consistent clustering effects in the data due to the card number and due to the response within a card. Predicted scores for Form Accuracy and Percept Frequency were highest on Cards 5, 1, and 7, and they were lowest on Cards 9 and 6; scores were also lowered with each subsequent response within a card. Surprisingly, Percept Frequency scores did not correlate with the criterion measure of diagnostic severity, though Form Accuracy did have small correlations. Understanding the structural patterns of the fit and frequency data is an important undertaking in forming the foundation for future research on a dimensional Rorschach perceptual accuracy scoring system. iv Dedicated to my parents, Leah and Terry, and to Grampy. You have my utmost gratitude for your unconditional love and support. You taught me math by helping with homework at the kitchen table and having me calculate measurements in the shop; you helped me develop a love for reading by taking me to pick out books at the library and pretending you didn’t realize I was reading under the covers with a flashlight at night. Whatever the lesson for the day happened to be, you were teaching me the value of hard work and instilling in me a deep appreciation and yearning for education. You gave me the skills necessary to succeed and granted me the space to determine my own path in life; for this I am forever grateful. Acknowledgements I would like to first acknowledge my advisor, Dr. Gregory Meyer. My graduate education has been a long and emotional journey and I am forever grateful that I was able to travel this road with him as my mentor. This dissertation is one of many accomplishments that would not have been possible without his support and guidance. His dedication to my education and professional growth has been unwavering, and I feel incredibly lucky to have had an advisor who is so devoted to helping students learn, grow, and find their path. I would also like to thank my committee members, Dr. Jeanne Brockmyer, Dr. Joni Mihura, Dr. Jason Rose, and Dr. Donald Viglione. They donated significant amounts of time and energy to me on this dissertation, and their insights and suggestions were spot-on, leading to a final product that I feel very proud of. It is an honor to have had their encouragement, feedback, and support. I have felt overwhelming support from so many friends, family members, colleagues, and supervisors, it would be impossible to name everyone here. However, I wholeheartedly thank each and every one of you for the hugs, laughs, talks over dinners and beers, and phone calls that always seemed to come when I needed them most. I deeply appreciate all of your love, support, and encouragement that constantly enveloped me and kept me moving toward my goals. Without you all, I could not have endured through the unavoidable ups and downs of graduate school and a dissertation. Thank you. vi Table of Contents Acknowledgements ............................................................................................................ vi Table of Contents .............................................................................................................. vii List of Tables ..................................................................................................................... xi List of Figures ................................................................................................................... xii List of Abbreviations and Rorschach Scores ................................................................... xiv I. Introduction .................................................................................................................... 1 II. Review of the Literature ................................................................................................ 9 Perceptual Accuracy, Reality Testing, and Psychosis .................................................... 9 Rorschach Form Quality (FQ) ...................................................................................... 10 History of the development of FQ ............................................................................ 10 Comprehensive System (CS) scoring of FQ ............................................................. 14 Rorschach Performance Assessment System (R-PAS) scoring of FQ ..................... 20 Review of FQ validity. .............................................................................................. 25 FQ validity – differentiation of clinical groups .................................................... 25 FQ validity – criterion validity ............................................................................. 29 FQ validity – SCZI, PTI, TP-Comp, & EII........................................................... 33 FQ validity – malingering ..................................................................................... 43 Limitations of FQ ...................................................................................................... 44 Rorschach Form Accuracy (FA) ................................................................................... 47 The development of FA ............................................................................................ 47 vii Scoring of FA ............................................................................................................ 50 Review of FA validity............................................................................................... 52 Rorschach Frequency of Perceptions ............................................................................ 56 Popular responses...................................................................................................... 57 Findings using Rorschach indices of response frequency ........................................ 58 Statement of the Problem .............................................................................................. 59 Purpose of the Present Study ........................................................................................ 61 Principle of Aggregation ............................................................................................... 63 Research Questions ....................................................................................................... 65 III. Method ....................................................................................................................... 66 Participants .................................................................................................................... 66 Percept Frequency samples ....................................................................................... 66 U.S. Sample .......................................................................................................... 66 Argentinean Sample .............................................................................................. 66 Italian Sample ....................................................................................................... 67 Spanish Sample ..................................................................................................... 67 Japanese Sample ................................................................................................... 67 Brazilian Sample ................................................................................................... 67 Criterion Database. ................................................................................................... 67 Measures ....................................................................................................................... 69 Percept Frequency samples measures ....................................................................... 69 Criterion Database measures..................................................................................... 73 Procedures ..................................................................................................................... 76 viii Frequency tables construction................................................................................... 76 Structure of the original FA and PF tables............................................................ 76 Coding the U.S. Sample ........................................................................................ 77 Updating and adding variables to the FA and PF tables ....................................... 79 Criterion Database coding......................................................................................... 81 Coder training and interrater reliability ................................................................ 81 Coding FA and PF ................................................................................................ 83 Statistical Analyses ....................................................................................................... 85 Overview of planned analyses .................................................................................. 85 Hierarchical Linear Modeling (HLM) ...................................................................... 86 Supplemental analysis strategies............................................................................... 93 IV. Results........................................................................................................................ 94 Interrater Reliability ...................................................................................................... 94 Frequency Tables: Descriptives .................................................................................... 94 Criterion Database: Descriptives .................................................................................. 95 Criterion Database: HLM ........................................................................................... 100 HLM models for FA ............................................................................................... 100 HLM models for PFM ............................................................................................ 111 HLM models for PFN1.5 ........................................................................................ 122 Supplemental Analysis Strategies ............................................................................... 132 V. Discussion ................................................................................................................. 152 Updating the PF Tables ............................................................................................... 155 Interrater Reliability .................................................................................................... 156 ix Modeling the Criterion Database ................................................................................ 157 Modeling the Structure of FA ..................................................................................... 158 Modeling the Structure of PFM .................................................................................. 161 Modeling the Structure of PFN1.5 .............................................................................. 164 Summary of Variable Structures Across Modeling Techniques................................. 166 Strengths and Limitations of the Study....................................................................... 169 Expected and Surprising Findings .............................................................................. 171 Conclusions ................................................................................................................. 174 References ....................................................................................................................... 177 x List of Tables Table 1. New Response Objects Derived From the U.S. Frequency Sample ................... 95 Table 2. Descriptive Statistics........................................................................................... 97 Table 3. Mean Values by Card Number and R_InCard .................................................... 99 Table 4. Statistical Summary of FA HLM Models ......................................................... 109 Table 5. Statistical Summary of PFM HLM Models ...................................................... 120 Table 6. Statistical Summary of PFN1.5 HLM Models.................................................. 130 Table 7. Protocol-Level Descriptive Statistics................................................................ 133 Table 8. Protocol-Level Cohen’s d by Card and R_InCard ............................................ 143 Table 9. Response-Level Cohen’s d by Card and R_InCard .......................................... 146 xi List of Figures Figure 1. Card 3 location D3............................................................................................ 17 Figure 2. Image of a butterfly. ......................................................................................... 17 Figure 3. Image of a dumbbell. ........................................................................................ 17 Figure 4. Image of a dragonfly. ....................................................................................... 17 Figure 5. Card 3 location D2............................................................................................ 46 Figure 6. Image of an anchor ........................................................................................... 46 Figure 7. Images of fishhooks.......................................................................................... 47 Figure 8. Card 3 location D3............................................................................................ 49 Figure 9. Image of a bowtie ............................................................................................. 49 Figure 10. Image of an insect ........................................................................................... 50 Figure 11. Image of a werewolf ....................................................................................... 50 Figure 12. Card III location D2........................................................................................ 72 Figure 13. Protocol-Level FA Means by Card Number. ............................................... 135 Figure 14. Protocol-Level PFM Means by Card Number.............................................. 136 Figure 15. Protocol-Level PFN1.5 Means by Card Number. ........................................ 137 Figure 16. Protocol-Level FA Means by R_InCard....................................................... 138 Figure 17. Protocol-Level PFM Means by R_InCard. ................................................... 139 Figure 18. Protocol-Level PFN1.5 Means by R_InCard. .............................................. 140 Figure 19. Protocol-Level Cohen’s d by Card on FA, PFM, and PFN1.5. .................... 144 Figure 20. Protocol-Level Cohen’s d by R_InCard on FA, PFM, and PFN1.5. ............ 145 xii Figure 21. Response-Level Cohen’s d by Card on FA, PFM, and PFN1.5. .................. 147 Figure 22. Response-Level Cohen’s d by R_InCard on FA, PFM, and PFN1.5. .......... 148 xiii List of Abbreviations and Rorschach Scores Rorschach Acronyms, Codes, and Indices CS Comprehensive System R-PAS Rorschach Performance Assessment System FQ Form Quality (CS; R-PAS) + Ordinary-Elaborated Form Quality (CS) o Ordinary Form Quality (CS; R-PAS) u Unusual Form Quality (CS; R-PAS) – Minus Form Quality (CS; R-PAS) W Whole Location (CS; R-PAS) D Common Detail Location (CS; R-PAS) Dd Unusual Detail Location (CS; R-PAS) PTI Perceptual Thinking Index (CS) SCZI Schizophrenia Index (CS) TP-Comp Thought and Perception Composite (R-PAS) WDA% Percentage of responses given to common (W or D) locations that have appropriate form use (i.e. FQ coding of +, o, or u) (CS) WD-% Percentage of responses given to common (W or D) locations that have distorted form use (i.e., FQ coding of -) (R-PAS) X+%; FQo% Percentage of responses that are common and have appropriate form use (i.e. FQ coding of + or o) (CS; R-PAS) XA% Percentage of responses that have appropriate form use (i.e. FQ coding of +, o, or u) (CS) Xu%; FQu% Percentage of responses that are uncommon and have appropriate form use (i.e., FQ coding of u) (CS; R-PAS) X-%; FQ-% Percentage of responses that have distorted form use (i.e., FQ coding of –) (CS; R-PAS) Rorschach Perceptual Accuracy FA Form Accuracy PA Perceptual Accuracy PF Percept Frequency PFM The response-level mean of the object-level averages of the 6 countries’ percentage-based frequency values, for values greater than or equal to 1.5% PFN1.5 The response-level mean of the object-level counts of countries (range 0-6) that had a percentage-based frequency value of greater than or equal to 1.5% xiv Chapter One Introduction The Rorschach Inkblot Task (commonly referred to as “the Rorschach”) was introduced to the mental health professions by Hermann Rorschach (1921/1942), a psychiatrist with an artistic bent. Finding inspiration in a popular game at the time, Klecksographie (Blotto), in which the players made inkblots and then formed associations or told stories about the images, Rorschach began to formulate sophisticated hypotheses about how inkblot images could be used to investigate individual differences on psychological constructs (Exner, 2003). According to Exner, the initial studies and observations made by Rorschach and his colleagues pertained to the use of inkblots in identifying psychosis. Although Rorschach began development of stimuli for his inkblot experiments by designing 40+ images, he soon selected a set of 15–16 blots for his early research, and then settled on a final set of 12 images for later projects. However, when Rorschach sent the 12 blots to press he had to reduce both their size and number and so he selected a set of 10 inkblots due to limitations imposed by the publisher. The 10 blots designed and selected by Rorschach now comprise the standard set of Rorschach cards used in current clinical research and practice. The Rorschach is a commonly used psychological assessment method (e.g., Camara, Nathan, & Puente, 2000; Clemence & Handler, 2001; Sundberg, 1961) in which a person is presented with the standard series of 10 inkblots and is asked to respond to 1 each, answering the question, “What might this be?” The Comprehensive System (CS; Exner, 2003) has been the most commonly used administration and interpretation system for Rorschach assessment for decades (Mihura, Meyer, Dumitrascu, & Bombel, 2013), with 96% of a recently-surveyed international sample of clinicians reporting they use the CS as their primary system when coding and interpreting the Rorschach (Meyer, Hsiao, Viglione, Mihura, & Abraham, 2013). The new Rorschach Performance Assessment System (R-PAS; Meyer, Viglione, Mihura, Erard, & Erdberg, 2011), with its primary foundations in the CS and the current published literature, has also been gaining traction since its publication. The CS and R-PAS have roots in the work of other systematizers that have strived over the years to develop, standardize, and validate various methods for obtaining and scoring Rorschach protocols. After Exner carefully reviewed the existing Rorschach systems, he published the basic CS foundations in 1974. Although he pulled a combination of elements from existing Rorschach systems, Exner (1974) also included some new methodological, scoring, and interpretation guidelines. Similarly, Meyer et al. (2011) completed an extensive review of the CS, previous systems, and the published literature when designing R-PAS. With many familiar CS components and procedures, but also with some significant changes to the CS (e.g., new normative sample; ROptimized administration; the way variables are calculated and presented), R-PAS is presented as an evidence-based and internationally-oriented system, with the authors being focused on “…enhancing the psychometric and international foundation of the test, while allowing examiners to interpret the rich communication, imagery, and interpersonal behavior within that strong psychometric foundation” (Meyer et al., 2011; see Meyer & 2 Eblin, 2012, for a brief overview). Many clinicians find value in the Rorschach as a method of gathering information about an individual that cannot be obtained using other popular assessment methods, and this is likely an important factor in the popularity of the Rorschach in clinical settings (McGrath, 2008). Historically, the Rorschach has been labeled as a projective test. Weiner (1998) wrote that “The basic theory of projective mechanisms holds that the possibility and probability of people attributing their internal characteristics to external objects and events is directly proportional to the lack of structure in these objects and events.” Use of objective/projective terminology for describing personality tests, including the Rorschach (and other tests for that matter), has been challenged in recent years (e.g., Meyer & Kurtz, 2006; Viglione & Rivera, 2003, 2013). A strong argument for retiring the term “projective” as a test descriptor is that the term carries various meanings and connotations. One assumption that directly applies to many facets of Rorschach testing is that the test stimuli are ambiguous and that the task is completely unstructured. As pointed out by many (e.g., Weiner, 1998; Exner, 2003; Meyer et al., 2011), Rorschach cards contain complex structural elements (e.g., form, color, shading) that do provide some boundaries for the test-taker when completing the Rorschach task. However, the presence of some structure does not preclude test-takers making use of the stimulus features in unique ways. Thus, the Rorschach stimuli offer clinicians and researchers an opportunity to explore psychological constructs in a systematic and replicable manner but without imposing strict regulations on the latitude of the test-taker. Viglione and Rivera’s (2003, 2013) discussion of performance-based assessment tests/methods explores the concept of the test-taker responding to the test stimuli (and 3 testing situation) with more freedom of response than would be encountered on typical self-report measures, but with a variety of constraints and influences still present (e.g., critical bits of the inkblot, instructions, examiner variability, reason for referral, individual differences in level of projection, level of defensiveness, etc.). They agree that performance-based tasks such as the Rorschach are not purely “projective” in nature, but argue that they still offer rich behavioral information, in the form of induced observable (and oftentimes scorable) behavioral samples collected under controlled conditions, that may not be available from other sources during the assessment process. These rich samples of complex and real-life behaviors, which are initiated by the stimulus situation, are mediated by the person’s personality. Ideally, the behavioral and personality samples collected through the use of standardized performance-based assessment methods will generalize outside the microcosm of the task, and interpretations can be made about the person’s behavior and personality in daily life. Hermann Rorschach (1921/1942) saw the Rorschach as an intellectual endeavor, requiring the person to concentrate their attention on the inkblots, search their memory, compare the Rorschach images to images in their memory, then verbalize a response that matches the mental image to the blot features. Exner (2003) and Meyer et al. (2011) posit that the Rorschach gives behavioral information about a person, but also contains information about the psychological and cognitive processes that generate the behaviors. Leichtman (1996) considers most theorists to have glossed over development of a thorough conceptualization of Rorschach task demands; he personally describes the Rorschach as a task of visual representation in which “…participants actively search for what stimuli can be made into. What emerges is not any association, but an idea that 4 arises from the effort to find a referent and that, in turn, plays a major role in shaping the medium further.” In other words, the test-taker performs on the Rorschach in a way that is akin to an artist shaping clay — both begin with a raw material that is shaped or described in a way that allows it to serve as a representation of the real object. A clay form of a woman is an artistic expression or representation of a human; a woman seen on Card 3 of the Rorschach does not look exactly like a human, but some test-takers recognize a portion of that blot as woman-like and consider it a good representation of the true object. Balcetis and Dunning (2006; 2007) reviewed research findings and presented their own new data showing that how people perceive the world around them is influenced by their internal state (i.e. their wishes and preferences). Related to this, a series of studies showed how stimuli that are meaningful to a person draw the person’s attention automatically (Koivisto & Revonsuo, 2007). This area of research stems from the New Look approach to perception (e.g., Bruner, 1957) in which a person’s needs, motivation, and expectations were first seen as influences on perception. This line of research can be used in understanding the Rorschach, as Rorschach perception is not a purely “cold” or cognitive process; rather, it is also “hot”, influenced by various factors including the dynamics, needs, and conflicts of an individual (Exner, 1989, 2003; Meyer et al., 2011). In other words, “hot” perception refers to processing visual stimuli under the influence of motivational or affective state. The importance of recognizing the distinction between types of information obtained through different methods is not limited to the field of personality assessment or even clinical psychology; as pointed out by McGrath (2008), social/personality 5 psychologists make similar distinctions using terms such as “implicit” and “explicit,” and “mental process” versus “mental experience.” Whether the Rorschach is described as a “projective test,” “performance-based method,” “implicit test,” etc., the most important matter for a user of the Rorschach is having a well-grounded understanding of how the test functions and the strengths and limitations of its use. As will be reviewed, research has established that the Rorschach can be used to accurately identify psychosis in test-takers by employing scores that demonstrate the accuracy of the test-takers’ perceptions (e.g., Mihura et al., 2013). These scores and related indices have been constructed, evaluated, and revised over time and across Rorschach systems. Within the CS and R-PAS, Form Quality (FQ) is used to assess accuracy of perception on the Rorschach. However, FQ has some important limitations that will be detailed in Chapter 2. The R-PAS version of FQ (Meyer et al., 2011) was developed in an attempt to rectify some of the problems associated with the CS version of FQ, but early validity studies demonstrate additional room for improvement in the detection of psychosis using the Rorschach. Additionally, there is currently no fully dimensional Rorschach score (within the CS, R-PAS, or otherwise) that can thoroughly and efficiently tap into both the conventionality of response objects being spontaneously reported by people completing the task and the perceptual fit of those response objects to the cards at the location where they are perceived. It is believed that such a score could be an important factor in identifying distorted perceptual processes and impaired reality testing of the test-taker, and thus improve validity coefficients in Rorschach identification of psychosis. 6 Prior to beginning work on R-PAS as a formal system, Meyer and Viglione (2008) conceptualized and developed the Form Accuracy (FA) scoring category, which captures the accuracy of perceptual fit between a response object and the features of the inkblot in the location where the object was perceived; FA is assigned to responses by consulting FA tables, with possible object-level FA scores ranging from 1-5 (Meyer et al., 2011 presents an overview). Meyer et al. (2011) followed the development of FA with the initial development of an additional type of scoring category, Percept Frequency (PF), which indicates how frequently a perceived object is given as a response to the inkblot location used by the respondent. In the current study, the table of PF variables and values developed by Meyer et al. (2011) was expanded by adding data from a sixth country (the U.S.) to the existing specific object frequencies from five other countries (Argentina, Brazil, Italy, Japan, and Spain) in order to create international summary PF indices. The PF tables represent a cross-cultural index of how frequently response objects are identified in specific areas of specific cards on the Rorschach. An archival database that includes Rorschach protocols and diagnostic information was then used as a criterion database to explore the structure and predictive capabilities of a selection of FA and PF variables, with the ultimate goal being to better understand the potential functionality of the Rorschach as a method for gathering information about the accuracy of people’s perceptions. This research was also intended to aid in the broader push to identify ideal methods for combining FA and PF to form final Perceptual Accuracy (PA) indices and lookup tables. This is an important issue to explore so that standardized methods of scoring and interpreting PA scores can be applied to future research and ideally, to future 7 clinical practice, with the ultimate goal being to more accurately identify psychosis in test-takers. 8 Chapter Two Review of the Literature Perceptual Accuracy, Reality Testing, and Psychosis The Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-V; APA, 2013) is a compendium of psychological constructs that are organized into and represented as discrete psychological disorders. Many of the DSM-V disorders fall within a spectrum of “Schizophrenia Spectrum and Other Psychotic Disorders,” the common thread being the manifestation of psychosis. Although the classification system displayed in the DSM-V (as well as previous versions of the DSM system) is useful with regard to facilitating communication about mental health and illness, such classification systems also pose problems when it comes to researching psychological constructs. As described by van Os and Tamminga (2007), in discussing the DSM-IV (APA, 1994): Although these categories are meant to refer to broadly defined psychopathological syndromes rather than biologically defined diseases that exist in nature, inevitably they undergo a process of reification and come to be perceived by many as natural disease entities…. they may also confuse the field by imposing arbitrary boundaries in genetic and treatment research and classifying patients into categories that upon closer examination have little to offer in terms of diagnostic specificity. (p. 861). 9 Rorschach research and clinical use are complicated by the classification problem described above, as the Rorschach is a performance-based task that provides the user with behavioral samples that are coded and interpreted as representations of psychological constructs. Such psychological constructs manifest as observed real-world behaviors that are labeled as symptoms, which are then organized into diagnostic categories. In other words, the Rorschach provides information at the level of psychological constructs (e.g., accuracy of perception), as opposed to symptoms (e.g., impaired reality testing) and diagnostic clusters (e.g., disorders involving psychosis). Various Rorschach researchers have made a push for having clear and direct intuitive links between Rorschach variables and the constructs they are hypothesized to represent (e.g., McGrath, 2008; Meyer et al., 2011; Schafer, 1954; Weiner, 2003). Even though the validity of the Rorschach has been intensely debated throughout the years since its development, and some Rorschach scores are not believed to be valid for interpretation, even the toughest critics of the Rorschach attest to the validity of “perceptual accuracy” scores (e.g., Dawes, 1999; Wood, Nezworski, & Garb, 2003; Wood, Garb, Nezworski, Lilienfeld, & Duke, 2015); these scores also serve as an example of variables with a clear relationship to the construct they are intended to assess (McGrath, 2008). Rorschach Form Quality (FQ) History of the development of FQ. Hermann Rorschach devised FQ as a way to describe whether the response object was appropriate for the contours of the inkblot used in the response (Exner, 2003). Rorschach, as well as many followers after his death, believed that the manner in which form was used in constructing a response delivered 10 information about the person’s perceptual accuracy or “reality testing” ability (Exner, 2003). According to guidelines for scoring FQ using the CS (Exner, 2003) and R-PAS (Meyer et al., 2011), an FQ score is assigned to every response that makes use of form. In both systems, FQ scoring is guided by the use of published tables: Response objects are organized by card and by location within a card, and each listed object has a corresponding FQ score (e.g., Exner, 2003, Table A; Meyer et al., 2011, Chapter 6). Prior to the development of R-PAS and the CS, various Rorschach systematizers all agreed on the importance of FQ as a Rorschach score but there was disparity in how each systematizer felt FQ should be coded. Beck, Beck, Levitt, and Molish (1961) and Hertz (1970) created two categories: “Good form” responses were indicated by “+” and “poor form” responses were indicated by “–”, with the assignment of the + or – form quality scores based on how frequently a response was given for a specific location. Beck and Hertz published tables that, much like the current R-PAS and CS tables, indicated FQ scores for lists of response objects at specified locations. However, the tables constructed by Beck, Hertz, and Exner are not entirely the same; some location areas and FQ scores for identical objects at identical locations differ between the tables. In more recent years it has become apparent that many of Beck’s FQ + or – score decisions were more subjective than originally thought (Kinder, Brubaker, Ingram, & Reading, 1982). That is, they were probably based more on Beck's judgment than an actual tally of how frequently a response was given to a specific location. Hertz’s table appears to have been constructed more systematically and objectively; it includes every unique response given by her large sample (n = 1,050) of children and adolescents. 11 Similar to Beck and Hertz, Klopfer also used + and – codes for FQ, though he did not publish frequency tables — He preferred that the scores be based on examiner judgment (Exner, 2003). Like Klopfer, Piotrowski (1957) and Rapaport, Gill, and Schafer (1946) did not develop frequency tables, though they approved of the concept of using frequencies of responses to determine the corresponding FQ scores. When developing the CS, Exner (2003) considered interrater reliability of great importance for each score included in the system, as only FQ scores with acceptable interrater reliability were considered reasonable to use in validation studies. He also wanted to ensure that FQ clearly demonstrated “reality testing operations” (p. 121). However, Exner thought the 2-category method of coding for FQ + or –, as used by Beck, Hertz, Klopfer, and others, resulted in far more limited information than a more complex system would; he saw meaningful variance in the quality of responses that received identical scores. When Mayman (1970) devised a six-category method of scoring, Exner hoped it would prove more diagnostically useful than the existing two-category methods. However, Exner’s pilot study of agreement between coders using the Mayman method revealed discouraging results: Four trained coders independently coded 20 protocols for Mayman’s FQ and agreement among the coders ranged from 41-83%. Not wanting to discard the entire method, Exner revised the method by dropping two of the categories and not having subcategories for one other (Exner, 2003). He also settled on a four category system of ordinary–elaborated (+), ordinary (o), unusual (u), and minus (–) after deciding that FQ scores should be based on the frequency of a response 1. Exner’s Exner’s four category system initially termed ordinary–elaborated FQ as “superior (+)” and unusual FQ as “weak (w).” In current CS scoring, ordinary–elaborated and ordinary FQ are typically combined. 1 12 simplified system resulted in higher agreement (87-95%) between the same four raters than was observed using Mayman's six category approach. In developing the R-PAS FQ tables, Meyer et al. (2011) wanted to retain the essence of FQ as a measure of accuracy of perception that can be used to identify distorted perceptual processes of the test-taker. Included in their operational definition of FQ is the idea that perceptual accuracy encompasses two elements: Fit between the perceived object and the form features of the inkblot where it is seen, and the frequency with which that object is spontaneously reported by respondents completing the task. Thus, they incorporated both elements into their development of the R-PAS FQ reference tables. Working from an initial set of 13,031 unique response objects that were compiled from previous FQ tables and sources, Meyer et al. (2011) developed the R-PAS FQ tables in stages of iterative refinement. The fit and frequency data were used to determine preliminary FQ designations, with the response objects that made use of form falling into the categories of ordinary (o), unusual (u), and minus (–). Determinations about fit were based on FA ratings that had been collected during the Rorschach FA Project (Meyer & Viglione, 2008). Each of the 13,031 FA response objects had been rated by five to 15 judges, and 129,230 ratings were obtained in total. The FA judges had been asked to rate the objects by answering the question “Can you see the response quickly and easily at the designated location?” Their ratings were made on a 5-point Likert-type scale, with the following answer categories: 1) "No. I can't see it at all. Clearly, it's a distortion." 2) “Not really. I don't really see that. Overall, it does not match the blot area.” 3) “A little. If I work at it, I can sort of see that.” 13 4) “Yes. I can see that. It matches the blot pretty well.” 5) "Definitely. I think it looks exactly or almost exactly like that.” The objects were rated an average of 9.9 times by a pool of 569 judges who were from Brazil, China, Finland, Israel, Italy, Japan, Portugal, Romania, Taiwan, Turkey, and the United States. Meyer et al. (2011) had followed the development of FA with the initial development of the PF tables, which indicate how frequently a perceived object is given as a response to the inkblot location used by the respondent. The frequency data had been culled from five international datasets (Argentina, Brazil, Italy, Japan, and Spain). As a next step in their process, Meyer et al. (2011) reduced the number of objects to be classified in the R-PAS FQ tables to 5,060; each of the 5,060 objects were accompanied by an FA score, PF data, and the FQ scores assigned to the objects by other systematizers, which was primarily Exner though also included codes assigned by Beck and Hertz. The final R-PAS FQ code determinations were made after careful examination of all three sources of data. The authors first applied an algorithm to the data using the three sources of information then individually reviewed the FQ code determinations for objects that had seeming discrepancies between data sources (e.g., low FA but an FQ score of ordinary by Exner), making adjustments to the final FQ code determinations as necessary. When the finalized R-PAS FQ tables were compared to the CS tables, using tables that had been slightly revised by Exner’s Rorschach Research Council, 39.9% of the objects had different FQ code designations (kappa = .375). Comprehensive System (CS) scoring of FQ. According to CS guidelines, an FQ score is assigned to each Rorschach response that incorporates the use of form. For example, a response such as “A bunch of smoke” does not use form — the smoke can 14 take any shape, and there is no shape description included in the language of the response. Therefore, such a response would not be assigned an FQ score. However, a response such as “A bunch of smoke — it looks like it is originating from this point down here, and it billows out as it rises” introduces form into the language of the response, and thus an FQ score would be assigned. Similarly, responses that contain objects with inherent form properties (e.g., “a bear”; “two women”; “a mountaintop”) are assigned FQ scores. According to Exner (2003), “The FQ coding provides information about the ‘fit’ of the response, that is, does the area of the blot being used conform to the form requirements of the blot object specified?” (p. 120). The CS method of coding FQ is considered a way of scoring the Rorschach for accuracy of perception (Exner, 2003). Although the articulated definition of CS FQ do not contain a reference to additional factors influencing FQ, the scores are not exclusively based on and indicative of objective accuracy of the test-taker’s perception; they are in part determined by frequency of the percept for the specified location, whether lines are imposed on the inkblot in forming the percept, and at times even word choice. One could, however, make the argument that factors like the frequency of perceptions on the Rorschach do relate to accuracy of perceptions, when considered from an ecological position – Is a person’s objective misperception of a stimulus considered a misperception if it is normative within their culture? It may also be the case that the frequency of perceptions is a proxy for accuracy of fit on the Rorschach. CS FQ scores are assigned using published tables for guidance. Within the FQ tables, response objects are organized by card and location within a card, and each listed 15 object has a corresponding FQ score (e.g., Exner, 2003, Table A). Exner’s (2003) CS FQ tables are based on data from a sample of 9,500 Rorschach protocols, consisting of 205,701 individual responses (Exner, 2003). From these responses, 5,018 items or item classes were reported in the tables. The “o”, or ordinary FQ item, is defined by Exner (2003) as: The common response in which general form features are easily articulated to identify an object… If the item, or class of items, is designated in Table A as ordinary (o), and involves a W or D area 2, this signifies that the object was reported in at least 2% (190 or more) of the 9,500 records, and involves blot contours that do exist and are reasonably consistent with the form of the reported object. There are 865 items or item classes designated as o for W or D locations. If the item listed as o involves a Dd location, this signifies that the area was used by at least 50 people (0.52%), that the object was reported by no fewer than two-thirds of those using the area, and involves blot contours that do exist. Table A includes 146 items classified as o for the Dd locations. (pp. 122-123). An example of an ordinary FQ item given to the Card 3 location depicted in Figure 1 is “butterfly” (see Figure 2 to compare actual object). 2 W indicates a whole response, in which the person uses the entire blot in their response. D indicates a common detail response, in which an area of the blot is used that is common (i.e., used by at least 5% of subjects in the development sample, n = 3,000). Dd indicates an unusual detail response, in which an area of the blot is used that is uncommon (i.e., used by less than 5% of subjects in the standardization sample). See Exner (2003, pp. 76-79) for a review. 16 Figure 1. Card 3 location D3. Figure 2. Image of a butterfly. Figure 3. Image of a dumbbell. Figure 4. Image of a dragonfly. 17 The “u”, or unusual FQ item, is defined as “A low frequency response in which the basic contours involved are appropriate for the response. These are uncommon answers that are seen quickly and easily by the observer” (p. 122). The u responses in the tables occurred in less than 2% of persons for W and D areas, and for Dd areas they occurred in fewer than 50 people but were judged by at least three raters who unanimously deemed the response objects quick and easy to see, and appropriate for the contours of the blot (Exner, 2003). An example of an unusual FQ item is “dumbbell” (see Figure 3) given to the location depicted in Figure 1. The “–”, or minus FQ item, is defined as: The distorted, arbitrary, unrealistic use of form in creating a response. The answer is imposed on the blot structure with total, or near total disregard for the contours of the area used. Often substantial arbitrary lines or contours will be created where none exist. (p. 122). An example of a minus FQ item is “dragonfly” (see Figure 4) given to the location depicted in Figure 1. Although “o,” “u,” and “–” are the three primary FQ designations within the CS, Exner also included a code for a subcategory of the ordinary response: The “+”, or ordinary-elaborated category, is defined by Exner (2003) as: The unusually detailed articulation of form in responses that otherwise would be scored ordinary. It is done in a manner that tends to enrich the quality of the response without sacrificing the appropriateness of the form use. The + answer is not necessarily original or creative but, rather, it 18 stands out by the manner in which form details are used and specified. (p. 122). Ordinary-elaborated responses differ from ordinary responses in that they include extra elaboration of articulated features; they do not necessarily have better fit with the blot, and they occur with less frequency than do ordinary responses. When using the CS, FQ is scored for a response by looking up the response object verbalized by the test taker using the published FQ tables (see Exner, 2003, for review). If the object is listed in the tables under the appropriate card and location then the corresponding FQ score is assigned to the response. If the object is not listed in the FQ tables then the examiner must attempt to extrapolate from the tables by looking for similar objects that might be listed (e.g., “cherry” if “apple” is not listed), or by looking at object listings for a location that is quite similar to that of the response. If no comparable objects are listed, and there are no acceptable object listings in similar locations, then the FQ score determination for the response relies on the examiner’s judgment. In cases when an object is not listed and extrapolation from the tables is not possible, the response is scored as unusual if it meets the following criteria: (1) The response can be quickly and easily identified, (2) it does not involve distortion of the blot contours, (3) no arbitrary lines — imagined lines imposed on the blot — are used in the formation of the response, and (4) the person does not close a broken figure in the formation of the response. If any of these four criteria are not met then the response would be scored as minus. When a response is composed of more than one object, which is a common occurrence, it is oftentimes the case that the FQ scores associated with the multiple response objects will differ according to the FQ tables. Consider the following example 19 response to Card 3: “It looks like 2 people and there is a big butterfly flying in between them.” The response has 3 distinct objects: The 2 people and the butterfly. When responses contain more than one object that is an important part of the response, the object with the lowest FQ score determines the FQ score assigned to the response; there is never more than one FQ score assigned to a single response. The lowest-FQ-score rule only applies to objects that are deemed important to the overall response; if a response object is not important to the overall response, the FQ of that object is not used in determining the FQ score of the response. When working from the CS materials, sometimes the distinction between important and secondary objects within a response is not clear, though published guidelines and example protocols can be helpful in learning how to make such distinctions. Rorschach Performance Assessment System (R-PAS) scoring of FQ. R-PAS FQ is a function of how accurate the response is (i.e., how well the object or objects included in the response fit the inkblot location that was used in constructing the response based on shape), and how common the response is (i.e., how frequently the object or objects reported by the test-taker occur in that particular location). The FQ scores are assigned using published tables for guidance, which are contained within Chapter 6 of the R-PAS manual (Meyer et al., 2011). The 5,060 objects included in the R-PAS tables are organized into sections based on card number and location within the card. Each card begins on a new page and is accompanied by a location chart, which identifies the location numbers for the standard location areas on the card. Within each of the location sections in the FQ tables, the objects are first arranged by card orientation (i.e., which position the card was held in when the response was delivered). Within orientation the 20 objects are alphabetized within clusters that are based on five categories according to the type of object (i.e., objects that are human/human-like; objects that could be either human/human-like or animal/animal-like; objects that are animal/animal-like; objects that are anatomical/biological; and all other types of objects). Many of the object listings also contain clarifying information and elaborations that are intended to help orient the coder to the listed percept. For example, the FQ listings for Card 3 location D3 (see Figure 1) include an entry for “Hearts (Anatomical; 2 in Dd29)”. The appropriate FQ code for each object is listed next to the object entry in the tables. Like in the CS, R-PAS has three FQ codes that can be assigned to responses that incorporate the use of form: ordinary (o), unusual (u), and minus (-). There is an additional category that is used when responses do not contain any objects that use form: none (n). The ordinary FQ code is described as “form fit that is both relatively frequent and accurate,” and in general the responses are “…quickly and easily seen” (Meyer et al., 2011). There are a total of 1,078 ordinary objects in the R-PAS FQ tables. The unusual FQ code is described as “form fit that is of intermediate frequency or accuracy or both,” and although the unusual response objects are generally encountered less often than the ordinary response objects and typically have less accurate fit, “…they are not grossly inconsistent with blot contours. At times FQu responses fit a particular location well, but the fit is not readily obvious so the object is not commonly reported” (Meyer et al., 2011). There are a total of 2,377 unusual object listings. The minus FQ code is described as “form fit that is infrequent and inaccurate”; these responses are “…infrequent, if not rare, and also inaccurate, distorted, or arbitrary. They are difficult to see or only grossly 21 approximate the actual contours and shape of the blot areas” (Meyer et al., 2011). There are 1,605 objects listed that have minus designations. Before consulting the R-PAS FQ tables the coder must determine whether the response contains form. The final R-PAS FQ code of none (n) is described as being applied when a “response does not contain an object with definite form or outline” (Meyer et al., 2011). Responses that are scored with the none designation are typically impressionistic responses based on shading and/or color features of the inkblot that do not include any objects that make use of form. Like in the CS, a response such as “a lot of blood” does not use form. The object itself – the blood – does not have inherent form demand (i.e., it can take any shape), and the respondent did not introduce form into the response language. However, if the respondent had instead reported (or had added to their original response of “a lot of blood”) that the blood was “dripping down,” “smeared across this section,” “splattered across the card”, etc., the response object would then be considered to have form demand and would receive an FQ score based on the FQ table listings. One might notice that the FQ tables have listings for a variety of objects that do not have inherent form demand, but to which form can be injected, as demonstrated by the example. Therefore, the listed FQ code is only applied when form is injected into the response; if form is not specified by the response language, the none code is applied, regardless of the object potentially being listed in the tables with an ordinary, unusual, or minus FQ code assigned to it. If it is determined that form is present in the response, R-PAS FQ is scored by looking up the response object(s) verbalized by the test taker using the published FQ tables (see Meyer et al., 2011 for complete instructions). If the response only contains 22 one object and that object is listed in the tables under the appropriate card and location, then the corresponding FQ score is assigned to the response. If the object is not listed in the FQ tables then the examiner must attempt to extrapolate from the tables. The R-PAS manual offers three basic extrapolation principles to help guide coders: (1) Using the FQ tables to perform systematic extrapolation is preferable to using independent judgments; (2) Shape and spatial orientation of the listed object must be consistent with the response object when extrapolating from the tables; and (3) ideal extrapolation coding captures the entire response as a collective percept, not the various individual elements of the response. When responses contain only one object and extrapolation is needed, Meyer et al. (2011) provide a procedure to follow that can involve up to four steps. Essentially, the coder should first search within the appropriate location for objects with a similar shape to the response object, emphasizing key perceptual features that help delineate the object. If an extrapolation is not obvious at this point then the coder should proceed to the next step. If the response object does not fit a location that is listed in the tables, the coder should extrapolate by consulting object listings in similar location areas (e.g,, a larger area that subsumes the location area used in the response). As the next step, looking up subcomponents of the object in the appropriate sub-locations can help inform the extrapolated coding (e.g., looking up the FQ codes for wings and antennae in the sublocations of the butterfly percept, if butterfly is not listed in the appropriate location and/or near-location). Finally, the accumulated information should be reviewed, with more weight given to the earlier as opposed to later steps, before deciding on the final extrapolated FQ judgement. 23 When a response is comprised of more than one object, the FQ coding procedures are slightly different and more complex. First, the coder must differentiate important from unimportant objects in the response. Meyer et al. (2011) describe important objects as “…the central or focal response objects of a multiple object response. Most often they are mentioned first, and they are typically asserted with more commitment and spontaneity than unimportant objects. It is rare for there to be more than three important objects in a response.” The manual includes elaboration of this concept, as well as several examples to help coders differentiate important and unimportant response objects. This concept is crucial for coders to understand because it can have strong impact on the FQ designations that are assigned to responses during the coding process. As a first step in coding multiple-object responses, the coder should consult the FQ tables to check if the percept is listed in its entirety in the overarching location area. Typically though, the coder must employ additional coding steps. As a next step, the important response objects should be looked up in the tables in the appropriate location areas, and the lowest FQ code (minus < unusual < ordinary) from across the important objects should be assigned to the response. If extrapolation is required because some or all of the important objects are not listed, the general extrapolation process is the same as for single-object responses. Once the listed and/or extrapolated scores have been determined for each individual important object, the lowest object-level FQ code is applied to the response. Beginning the coding of each multi-object response with the seemingly poorest-fitting important object can save coding time; if that object is determined to have a minus FQ code, the response will also be assigned a minus code due 24 to the coding rule that the lowest FQ code from across the important objects is the one that is applied to the response. Review of FQ validity. FQ validity – differentiation of clinical groups. Over the years FQ scores have been shown to significantly differ across groups, with non-psychotic clinical control groups (e.g., Berkowitz & Levine, 1953; Knopf, 1956), non-clinical control groups (e.g., Friedman, 1953; Rickers-Ovsiankina, 1938; Sherman, 1952), and a mixed group (Beck, 1938) responding with better FQ than groups of people with various forms of schizophrenia or psychotic disturbance; researchers have attributed this ability to differentiate between groups to varying capacities for reality-testing. Since the publication of Exner’s CS in 1974, researchers have continued to demonstrate the effectiveness of FQ in identifying psychosis (e.g., Mihura et al., 2013). In a study of perceptual and thought disturbance, data were collected from individuals falling within categories along a schizophrenia spectrum: (1) Those with no Axis I or II diagnosis following SCID-I and SCID-II assessment, and no first- or seconddegree relative with a schizophrenia diagnosis (noDx); (2) first-degree relatives of patients with diagnosed schizophrenia (relatives); (3) undergraduate students who scored at least two standard deviations above the mean on either the Perceptual Aberration, Magical Ideation Scale, or Physical Anhedonia Scale (PerMag/PhysAn); (4) individuals diagnosed with schizotypal personality disorder following assessment with the SCID-I and the Structured Interview for DSM-III-R Personality Disorders (SPD); (5) outpatients diagnosed with schizophrenia following SCID-I assessment (outpatients); and (6) inpatients diagnosed with schizophrenia following SCID-I assessment (inpatients) (Perry, 25 Minassian, Cadenhead, Sprock, & Braff, 2003). Using CS procedures and scoring, the groups differed on the protocol-level average proportion of minus responses (X-%, or distorted form; percentage of responses with an FQ coding of – ), the means generally falling in the pattern expected: The NoDx (M = .25), relatives (M = .23), and SPD (M = .28) groups had the lowest proportion of minus responses, while the PerMag/PhysAn (M = .36), outpatients (M = .33), and inpatients (M = .37) had notably higher X-%. In a meta-analytic review of 48 adult samples using the CS that were published in the Journal of Personality Assessment from 1974-1985, X+% could differentiate clinical and control samples with a large effect size (d = 1.05) (Meyer, 2001). X+%, or conventional form use, refers to the percentage of responses that are common and have appropriate form use (i.e. FQ coding of + or o) (Exner, 2003). In the sample of 9,500 protocols used to assemble the 2003 CS FQ tables, the average proportion of ordinary (and ordinary-elaborated) responses (X+%) is .74 for nonpatients, .64 for outpatients, and .52 for inpatients; the average proportion of unusual responses (Xu%, or unusual form use; percentage of responses with an FQ coding of u) is .15 for nonpatients, .17 for outpatients, and .20 for inpatients; and based on the values reported for X+% and Xu%, the average proportion of minus responses (X-%, or distorted form; percentage of responses with an FQ coding of – ) is approximately .11 for nonpatients, .19 for outpatients, and .28 for inpatients (Exner, 2003). XA% was not used by Meyer et al. (2001), but is included in studies discussed later in this review; XA%, or Extended Form Appropriate, refers to percentage of responses that have appropriate form use (i.e. FQ coding of +, o, or u). 26 It is held that X+% imparts information about how conventional the person is in their responding, with exceptionally high scores indicating an unusual level of commitment to conventionality or preoccupation with social acceptability, and low scores indicating unconventional ways of understanding the inkblots (Exner, 1991). High X-% scores are believed to indicate inaccurate and distorted perception of the blots. Thus, the description of FQ as a measure of “perceptual accuracy” does not refer to a simple measure of good eyesight, but rather to a measure that can detect a person’s ability to recognize objects in a way that is both accurate and in line with social convention (Exner, 2003). Wagner (1998) used the term congruent to describe cases in which the response percept, representing a real-world object, has good fit with the form present in the inkblot. Additional studies not included in Meyer’s (2001) meta-analysis are consistent with his summary of FQ as scored using the CS. In a sample comprised of patients from an inpatient psychiatric unit of a Veterans Affairs hospital, as well as from a private practice, individuals were categorized as either psychotic or nonpsychotic based on DSMIII diagnoses made by staff psychiatrists and clinical psychologists (Peterson & Horowitz, 1990). The groups differed on CS X+% (t = 6.33, p < .05) in the expected direction, though descriptive statistics were not reported. Mason, Cohen, and Exner (1985) observed significant group differences in CS X+% in a comparison of schizophrenic inpatients (M = 0.52; SD = 0.17), inpatients with depressive disorders (M = 0.68; SD = 0.12), and individuals with no psychiatric history (M = 0.83; SD = 0.06), the contrasts producing large effect sizes (|d| = 1.06 to 2.53). Interestingly, the authors noted that some patients with schizophrenia had learned to conceal their symptoms, producing 27 barren Rorschach protocols. Some of these protocols lacked indicators of thought disorder, but still tended to contain poor FQ scores indicating inaccurate perception. Diagnostic category membership had been determined using the Research Diagnostic Criteria, which lists sets of criteria for functional disorders (Spitzer, Endicott, & Robins, 1978). In another study examining clinical samples, Kimhy et al. (2007) compared CS Rorschach protocols of individuals considered to be at high risk for psychosis (assessed using the Structured Interview for Prodromal Symptoms and the Scale of Prodromal Symptoms), patients with recent-onset schizophrenia (assessed using the Diagnostic Interview for Genetic Studies, having onset within past 2 years), and patients with chronic schizophrenia (having onset 3+ years ago). The groups did not differ significantly on X+%, Xu%, or X-%, though all three groups did show elevated levels of perceptual distortion that are consistent with the diagnostic categories. X-% for the groups was higher than would be expected for nonpatients: High risk M = .36; recent onset M = .31; chronic M = .39. Nonpatient M = .11 in Exner’s (2007) sample of 450 adult nonpatients. Effect sizes between the nonpatient sample and Kimhy et al.’s samples were large (|d| = 2.76 to 3.77). Similarly, X+% was lower than for nonpatients: High risk M = .45; recent onset M = .50; chronic M = .45; nonpatient M = .68. Large effect sizes were again observed for contrasts comparing the nonpatient sample to Kimhy et al.’s samples (|d| = 1.64 to 2.10). Kimhy et al. interpreted the results as an indication that deficits in visual processing might be evidenced prior to a person meeting full criteria for a psychotic disorder, and that the FQ indices might have detected an endophenotype of risk for the 28 development of psychosis. In other words, FQ indices may capture a genetically-driven observable trait that could be linked to psychosis proneness. Mihura et al. (2013) completed a systematic and extensive review of the validity of the 65 main CS variables, which included some of the validity studies mentioned throughout this literature review. After thoroughly examining the peer-reviewed published literature and coding and tabulating the information according to clearly outlined procedures, the authors ended up with a total of 1,156 Rorschach validity coefficients that targeted the CS scores’ core constructs. They screened the meta-analytic data for publication and selection bias before presenting validity results, with no concerning findings. In one set of analyses, Mihura et al. (2013) explored whether CS variables could differentiate target psychiatric samples from nonpatient samples, but also whether they could differentiate the target psychiatric samples from other diagnostic samples. They found that both X+% and X-% could differentiate psychotic disorder samples from nonpatient samples (r = .57, p = .01; r = .61, p < .01), as well as from other diagnostic samples (r = .31, p < .01; r = .47, p < .01). However, in follow-up moderator analyses, neither X+% nor X-% could differentiate the nonpatient samples from the comparison psychiatric samples (p = .25; p = .06). FQ validity – criterion validity. In evaluating the performance of FQ indices in relation to other tasks using clinical samples, an important consideration is task difficulty with regards to cognitive load. Minassian, Granholm, Verney, and Perry (2004) reviewed literature indicating that pupil dilation can be used as a measure of attention allocation or cognitive effort. They state that several researchers have observed deficits in pupillary response by individuals with schizophrenia, but the deficit in pupillary response is only 29 observed when the schizophrenic individual is engaging in tasks requiring high levels of cognitive effort; the deficits are not typically found in low-demand tasks, where they have a normal pupillary response. Minassian et al. administered the Rorschach following CS procedures and a 10-picture version of the Boston Naming Test, which is a task requiring the person to name objects depicted by simple line drawings, to a sample of 24 patients with schizophrenia and 15 nonpatient participants (assessed using the SCID-IV). Both groups showed less pupil dilation in response to the Boston Naming Test than the Rorschach. Additionally, the groups did not differ in dilation for the Boston Naming Test, but did differ in dilation during the Rorschach. Taken together, the results indicate that the Rorschach required more cognitive effort from both groups than did the Boston Naming Test, and the groups differed in cognitive load during the Rorschach but not during the Boston Naming Test. In planned analyses, a negative moderate-sized correlation was detected between level of pupillary dilation and X-% on Rorschach Cards 9 and 10 (r = – 0.37, p < .05; r = – 0.39, p < .05, respectively), though the trend in the data was not significant across all 10 cards (r = – 0.31, p = .06). However, in accordance with Cohen’s (1992) guidelines, the small sample size might have resulted in inadequate power to detect a true effect on other cards. At least based on the correlations found in Cards 9 and 10, results were described as consistent with existing hypotheses stating that individuals with schizophrenia are not able to process complex and demanding stimuli at an optimal level due to a limited fund of attentional resources, which is quickly taxed in such a task. The authors further purport that attentional limitations combined with cognitive overload can explain (partially at least), the fragmented thinking and thought disturbance seen in people with schizophrenia. 30 In a criterion-validity study of FQ, X+% and X-% were assessed in conjunction with scores on several criterion measures of perceptual accuracy (Neville, 1995). F+% was also used, which is a no longer used CS score that is quite similar to X+%; the difference between the two scores is that F+% is based only on responses that are pureform responses (i.e., they do not make use of any determinants other than form; other determinants can include the use of shading, color, the perception of depth, or the description of movement in the response). The normal group of participants was an undergraduate and community sample (n = 42), and the clinical group consisted of people in treatment for schizophrenia at a community support program of a mental health center (n = 20). Each participant completed a Rorschach (CS procedures), the Gestalt Completion Test, the Hidden Figures Test, and a signal detection task developed by Neville. The Hidden Figures Test assesses a person’s ability to locate relatively simple geometrical figures within a complex geometric design. The Gestalt Completion Test assesses the ability to identify objects from incomplete figural information. The signal detection task consisted of 75 figures that were each presented for two seconds. After the presentation of an item, the participant was asked to identify which figure out of four options, if any, was the figure previously presented. For this task, scores were standard metric measures of the difference between the number of hits and misses (d’). The only significant correlation within the normal group between the FQ variables and criterion scores was X-% with d’ from the signal detection task (r = .31, p <.05; Neville, 1995). For the clinical group, there were no significant correlations between the Rorschach FQ variables and the criterion variables. Although these results are not encouraging, there are two potentially important considerations. First, the normal 31 and clinical groups were assessed separately, whereas a combined sample would have led to more score variability and increased statistical power. Second, aggregated criterion scores were not used, which could have led to more stable and accurate scores (see discussion of aggregation below); however, the tradeoff with aggregation is a loss of specificity in interpretation of the findings. Given these considerations, the results may be misleading and deserve to be followed up with more research. In a criterion-validity Rorschach study using a child and adolescent sample (Smith, Bistis, Zahka, & Blais, 2007), CS indices of FQ were compared to performance on another task considered to represent perceptual accuracy, the Rey-Osterreith Complex Figure (ROCF). The authors anticipated that good FQ on the Rorschach (high X+% and WDA%; low X-%) would align with better performance copying the ROCF accurately. The hypothesis was supported by the correlations observed between the ROCF and WDA% (r = .56) as well as X-% (r = – 0.45); the correlation was not significant with X+% (r = .26), though a small sample size (n = 27) could once again be a factor in the non–significance (Cohen, 1992). In the reviews of CS literature completed by Mihura et al. (2013), results of the reviews were also categorized according to the type of method for the validity criterion measure, with the categories being “introspectively assessed” or “externally assessed.” Introspectively assessed criteria included only self-report questionnaires and fully structured interviews with results that do not permit clinician judgement to alter the results. Externally assessed criteria included DSM diagnosis, observer/chart ratings, and various performance-based measures. Not entirely surprising, when the authors examined the averages of the effect sizes included in the study across all CS variables, they found 32 stronger results when the criteria were externally assessed (Zr = .28, r = .27; 770 total findings) than when the criteria were introspectively assessed (Zr = .08, r = .08; 386 total findings). When the criteria were limited to those that were externally assessed, across included studies there was excellent validity support for X+% (r = .48, p < .01; 29 total findings), X-% (r = .49, p < .01; 34 total findings), and WDA% (r = .46, p < .01; 7 total findings), and moderate support for Xu% (r = .32, p = .04; 7 total findings). WDA% refers to the percentage of responses that have appropriate form use (i.e. FQ coding of +, o, or u), and is calculated from only those responses that are given to common (W or D) locations. Results were quite different when effects using introspectively assessed criteria were aggregated: The FQ indices had either a zero correlation (X+% r = .00, p < .01; 6 total findings), no significant correlation (X-% r = .03, p = .68; 4 total findings), or no findings in the literature that qualified for inclusion in the meta-analyses (WDA% and Xu%). FQ validity – SCZI, PTI, TP-Comp, & EII. Rorschach found that psychotic disordered individuals tended to have poor form quality, and as discussed above, this finding has been replicated by numerous researchers throughout the years. In more recent literature, studies have included indices partially comprised of FQ scores. Development of the Schizophrenia Index (SCZI) began in the 1970s and the index was finalized as the SCZI in 1984 (Exner, 1984; Exner, 1986). In 1991 the SCZI was modified to reduce the occurrence of false-positives observed with some clinical groups (Exner, 1991). The modified SCZI is comprised of 6 criteria used to determine a score that can range from zero to six. X+%, X-%, and raw FQ sums are among the criteria (Exner, 2003). There is an abundance of support for the ability of the SCZI to detect group differences between 33 psychotic and non-psychotic individuals (see Jorgensen, Andersen, & Dam, 2000; Exner, 2003; Hilsenroth, Fowler, & Padawar, 1998). However, continued problems with false positives and a potentially misleading name for the index led the development of an alternate index, the Perceptual Thinking Index (PTI; Exner, 2000). The PTI has five criteria that result in a score of zero to five (Exner, 2003). Among the criteria are XA% and WDA%, as well as X-% and several indices of cognitive slippage. XA%, or Extended Form Appropriate, refers to percentage of responses that have appropriate form use (i.e. FQ coding of +, o, or u); WDA% is very similar to XA%, the difference being that it is calculated from only those responses that are given to common (W or D) locations (Exner, 2003). W (whole) location scores are assigned when the response is given using the entire contents of the inkblot. D (common detail area) location scores are given to responses that use a part of the inkblot, these locations each appearing in at least 5% of the protocols evaluated in establishing location codes (Exner, 2003). Viglione (1996) completed a criterion-validity study of the Rorschach using a sample of inpatient, outpatient, and nonpatient children and adolescents. Participants completed the Rorschach as well as true-false interview designed to assess atypical beliefs in children — the Child Unusual Belief Scale. X-% had a moderate correlation with the criterion measure (Spearman rho = .45), as did the SCZI (Spearman rho = .36). Archer and Gordon (1988) also used an adolescent sample to explore the diagnostic validity of the Rorschach and MMPI in adolescent populations experiencing psychotic or depressive symptoms, with DSM-III discharge diagnoses based on treatment team clinical judgment (primarily determined by clinical history and the team’s behavioral observations). Teens with schizophrenia had fewer accurate perceptions and more 34 distorted percepts present in their Rorschach protocols (X+% = .46; X-% = .34) than did those with major depression (X+% = .52; X-% = .23), dysthymic disorder (X+% = .58; X-% = .20), personality disorder (X+% = .50; X-% = .27), or conduct disorder (X+% = .61; X-% = .21). The same trend was seen with SCZI scores. The PTI has been shown effective in discriminating psychotic disordered patients from non-patients, as well as from patients that were diagnosed with a Cluster A, Cluster C, or Borderline Personality Disorder (CA/BPD) (Hilsenroth, Eudell-Simmons, DeFife, & Charnas, 2007). The psychotic-disordered group consisted of adult inpatients with DSM-IV psychotic disorder intake diagnosis, which was based on treatment team consensus following a review of all available data. Non-patient protocols were selected from Exner’s (2003) sample of 450. The CA/BPD group were represented by archival files from a university psychological clinic. Files with a personality disorder diagnosis listed were masked and reviewed for presence/absence of a personality disorder by 4 doctoral students. Cases identified as having a personality disorder were reviewed again and rated for all Cluster B symptom criteria using the DSM-IV. When the PTI was broken down into its component scores, the mean FQ scores followed a pattern that would be expected for these groups. The non-patient group had the healthiest scores (XA% = .85, WDA% = .87, X-% = .14, X+% = .61), followed by the CA/BPD group (XA% = .69, WDA% = .71, X-% = .29, X+% = .42), and the psychotic group had the most impaired performance (XA% = .60, WDA% = .62, X-% = .36, X+% = .45). The mean X+% in the CA/BPD group appears to be somewhat lower than in the psychotic group, contrary to what we would expect, but the difference is not statistically significant. At a dimensional level, lower XA%, WDA%, and X+% scores, and higher X-% scores were related to 35 greater diagnostic severity (i.e. higher levels of relative thought and perceptual impairment), producing large effect sizes (|r| = .47 to .64). Dao and Prevatt (2006) similarly found that the PTI was effective in discriminating inpatient individuals with schizophrenia–spectrum disorders (SSD) from inpatients with mood disorder without psychotic features (MD). Both groups had been administered the SCID-CV and the SCID-II at intake, and the primary diagnosis was assigned following consensus by the clinical social worker and psychiatrist after review of their independent interviews with the patients and the SCID’s. As would be expected, the SSD group had higher PTI scores (M = 2.9) than the MD group (M = 0.89). Perhaps more interesting in the context of the current study, all FQ indices also differed by group status. The SSD group (XA% = .55, WDA% = .57, X-% = .42) displayed poorer FQ than did the MD group (XA% = .72, WDA% = .75, X-% = .26). The effect sizes for the FQ indices and PTI were all of large magnitude (d = 1.07 to 1.62). There is also an extensive literature (e.g., Archer & Gordon, 1988; Archer & Krishnamurthy, 1997; Bannatyne, Gacono, & Greene, 1999; Blais, Hilsenroth, Castlebury, Fowler, & Baity, 2001; Ganellen, 1996; Garb, 1984; Meyer, 2000; Ritsher, 2004) aimed at comparing the validity and clinical utility of the Rorschach and MMPI (and MMPI-2), most of which is beyond the scope of this paper. However, Dao, Prevatt, and Horne (2008) published a concise summary of important references and also examined the clinical utility and possible incremental validity of the Rorschach and MMPI-2 with regard to detection of psychosis. Group comparisons were reported for inpatients with either primary psychotic disorder (PPD) or primary mood disorder without psychotic features (PMD). The sample was comprised of 236 patients, and 36 analyses were completed using the primary admission diagnoses. In their comparison of Rorschach and MMPI-2 protocols, the groups differed on mean PTI (PPD = 2.95, PMD = 1.13; d = 1.22) as well as on all three PTI criteria that involve FQ indices (d = 0.92 to 1.30). Additionally, the authors concluded that the PTI was better at psychosis group discrimination than was the MMPI-2. In the CS meta-analyses completed by Mihura et al. (2013), the PTI had excellent validity support when externally assessed criteria (e.g., DSM diagnosis, observer/chart ratings, and various performance-based measures) were used that related to disturbed thinking and distorted perceptions (r = .39, p < .01; 30 total findings). When introspectively assessed criteria were used, the PTI had a statistically significant low level of validity support (r = .10, p < .01; 23 total findings). Additionally, the PTI differentiated target psychotic disorder samples from nonpatient samples (r = .72, p < .01) as well as from other psychiatric samples (r = .47, p < .01). In follow-up moderator analyses, the PTI was also able to differentiate the nonpatient samples from the comparison psychiatric samples (p < .01). Revisions were made to the PTI prior to publication of R-PAS to make the index continuous, and with the hopes of improving its inter-rater reliability and validity (Viglione, Giromini, Gustafson, & Meyer, 2014). In the new Thought and Perception Composite (TP-Comp), the dichotomous PTI cut scores were substituted with a regression-based model that produced continuous scores, as opposed to integer scores. To calculate the regression model used in constructing TP-Comp, the authors used the original PTI as the predicted variable, and loaded the individual variables that were used to calculate the PTI into the regression model as predictors. The beta-weights of the 37 resulting regression equation were used to construct TP-Comp. PTI and TP-Comp were highly correlated in the independent validation sample (r = .87), and TP-Comp was shown to have higher inter-rater reliability and validity (Viglione et al., 2014). The Ego Impairment Index (EII) is another index that is partially comprised of FQ scores. The EII was developed (Perry & Viglione, 1991) as a Rorschach index of the degree of general psychological impairment experienced by the test-taker, and has a strong empirical foundation as a measure of psychopathology severity and thought disturbance (Meyer et al., 2011). It is similar to the PTI, but also contains components related to self and other representations, as well as indices of crude or disturbing thought content and imagery. The original EII is comprised of 5 criteria, and the final EII scores are determined by multiplying the scores for each of the criteria by weights that were determined through factor analysis. Among the 5 criteria are the sum of FQ- scores, and M-, which is the number of responses that contain human movements that were also scored FQ-. The EII underwent slight revision and was renamed the EII-2 (Viglione, Perry, & Meyer, 2003) when one of the component variables was revised (Viglione, Perry, Jansak, Meyer, & Exner, 2003). However, despite the differences in calculation, the EII and the EII-2 have extremely high correlation with each other (r = .99; Viglione, Perry, & Meyer, 2003). In a 2011 meta-analysis examining the EII or EII-2 and its relationship to general psychological disturbance, 14 publications and a total of 13 samples met inclusion criteria (Diener, Hilsenroth, Shaffer, & Sexton). As had been predicted, higher EII scores were associated with greater levels of psychiatric severity (r = .29; p < .01). In the moderator analyses, it was determined that the type of criterion variable impacted the 38 effect sizes. After breaking down the analyses by type of criterion variable, it became clear that effect sizes were larger when the criterion variable was based on researcher ratings (r = .45, p < .01) than when it was based on clinician ratings (r = .19, p < .01), informant ratings (r = .18, p < .01), self-report ratings (r = .10, p = .07), level of treatment or placement status (r = .11, p = .08); however, criterion variables consisting of performance-based measures (r = .37, p = .01) also had larger effect sizes than self-report ratings, or level of treatment or placement status. A more recent study of the EII-2 was completed using a child sample in Tehran, Iran (Mohammadi, Hosseininasab, Borjali, & Mazandarani, 2013). The patient sample consisted of children who had been hospitalized with a diagnosis of childhood-onset schizophrenia (n = 10) and were under outpatient care at the time of the CS Rorschach administration, and a comparison sample consisted of “normal” children (n = 10). The diagnosis of childhood schizophrenia for members of the patient sample was verified by a child psychiatrist with administration of the Structured Clinical Interview for the DSMIV-TR. The authors broke the EII-2 down into its components and examined each component individually. The sum of FQ- scores differed between the two samples (d = 2.62, p < .01), with a higher rate of FQ- scores in the patient group (M = 1.17, SD = 0.45) than in the normal group (M = 0.25, SD = 0.21). The authors also found statistically significant differences between groups on 3 of the other 4 remaining subcomponents of the EII-2. The EII-2 was also used in a study of functional and social skills capacity in adult patients with schizophrenia or schizoaffective disorder (Moore, Viglione, Rosenfarb, Patterson, & Mausbach, 2013). Patients had psychiatrist-assigned chart diagnoses based 39 on the DSM-IV and were considered stable at the time of the assessment. One to two weeks after completing a series of questionnaires, the Rorschach was administered using an early version of the R-PAS manual; FQ was coded using CS FQ tables, as the R-PAS FQ tables were not yet available. Correlations with the EII-2 were observed in the expected direction for structured interview-based indices of positive symptoms (PANSS Positive r = .32, p < .05) and total symptoms (PANSS Total r = .31, p < .05), but there was not a significant correlation with negative symptoms (PANSS Negative r = .01, p > .05). Contrary to expectations, there was no correlation between the EII-2 and performance-based measures of everyday living skills (UPSA r = -.10, p = .40) or social skills capacity (SSPA r = -.00, p = .97). Interestingly, healthier EII-2 scores were associated with higher global cognitive ability (RBANS r = -.33, p < .05). When the number of FQ- responses was broken out from the EII-2 as a standalone score, there were no significant correlations with any of the criterion measures. In 2011, the EII-2 was again revised, becoming the EII-3 (Viglione, Perry, Giromini, & Meyer, 2011), and it is the EII-3 that is included in R-PAS. The EII-3 is based on three revisions: A change in the distribution of R due to R-Optimized administration, removal of food content from the coding process, and transformations to variables to make them follow (as closely as possible) a normal distribution. The same regression procedure that was used to construct the EII-2 was again used to calculate the weights applied to variables in the EII-3. The correlations between the EII-3 and the previous versions are strong (EII-2 r = .98; EII r = .95; Viglione et al., 2011). After publication of the R-PAS manual (Meyer et al., 2011), two studies were published that compared earlier Rorschach indices (the PTI and EII-2) to the R-PAS 40 versions of the indices (the TP-Comp and EII-3). The first study examined the predictive validity of both the older and newer versions of the indices, and explored whether TPComp and the EII-3 had incremental validity over the PTI and EII-2 (DzamonjaIgnjatovic, Smith, Jocic, & Milanovic, 2013). The authors also explored the degree of overlap observed between the TP-Comp and the EII-3. The samples of adult inpatients were drawn from the archives of a psychiatry institute in Serbia, and diagnoses were made by a psychiatrist and further vetted in case conferences prior to and independent of the Rorschach administrations. The final psychotic or schizophrenic sample (n = 100) was comprised of patients who were receiving antipsychotic medications at the time of the Rorschach administration; the nonpsychotic sample (n = 111) was comprised of patients with diagnoses of various anxiety states, depression without psychotic features, or mixed depression and anxiety, and they were receiving anxiolytics, antidepressants, or a combination of both at the time of testing. The Rorschach was administered and scored for FQ following CS guidelines. All four indices were able to effectively discriminate the two samples with p < .01 (PTI d = 1.77; TP-Comp d = 2.16; EII-2 d = 1.58; EII-3 d = 1.92). Using hierarchical logistic regression, the authors found that the TP-Comp and EII3 incremented (with a small increment) the PTI and EII-2 in predicting group membership; the PTI and EII-2 did not contribute any predictive power to the model that contained the TP-Comp and EII-3 as the first step. Dzamonja-Ignjatovic et al. (2013) also found that TP-Comp had a small amount of incremental validity over the EII-3, and inversely the EII-3 also had some incremental validity over TP-Comp in the prediction of psychotic disorder group membership. The reality testing component (based on FQ) of 41 the TP-Comp (d = 1.33, p < .01) and EII-3 (d = 0.92, p < .01) were also able to differentiate the patient groups as standalone indices. The second study comparing the PTI and EII-2 to the TP-Comp and EII-3 was conducted to investigate the international adaptability of R-PAS, and explored the validity of the indices in Taiwan (Su et al., 2015). The sample consisted of culturally Taiwanese adults who were classified as nonpatients (n = 15), outpatients (n = 37), patients from a day-treatment program (n = 11), or inpatients (n = 27). The Rorschach was administered in Taiwanese Mandarin using a translated set of R-PAS administration instructions, and FQ was scored with the CS tables for the earlier indices (PTI and EII-2, and FQ indices X-% and WDA%), and with the R-PAS tables for the more recent versions (TP-Comp and EII-3, and FQ-% and WD-%). Each of the R-PAS indices was highly correlated with its CS counterpart. As hypothesized, all eight Rorschach indices were also correlated with the criterion measures, and in the expected directions. Correlations with the Magical Ideation Scale, a self-report scale used to identify proneness to psychosis, had a range of |r| = .23 to .54. Correlations with the Positive and Negative Syndrome Scale, a semistructured interview used to evaluate schizophrenia symptoms, had a range of |r| = .37 to .54. Correlations with the single-item Clinical Global Impressions-Severity, which is used by clinicians to assess overall mental health by answering a question about the patient’s mental health compared to others in the population, had a range of |r| = .39 to .60. Finally, correlations with Diagnostic Severity, a 1-5 scale indicating severity of the DSM-IV diagnosis(es) of each patient, had a range of |r| = .34 to .50. Using hierarchical regressions, Su et al. (2015) found that the R-PAS indices also incremented the CS indices in predicting each of the criterion measures, but 42 the CS indices did not increment the R-PAS indices in predicting any of the criterion measures. FQ validity – malingering. Netter and Viglione (1994) examined FQ as part of a larger Rorschach malingering study. Malingering is an important consideration in interpretation of FQ indices or other indices on the Rorschach containing FQ components. The CS, unlike many other measures used in clinical practice, does not include formal indicators to assess for valid responding. Thus, it is helpful to have such studies to provide evidence regarding how trustworthy and impermeable Rorschach indices are to malingering efforts of test-takers. After providing 20 participants with an external incentive to motivate successful malingering of schizophrenia (they were told they would only get their movie tickets if they could fool the examiner), their protocols were compared to a control nonpatient group and to a group of patients with diagnosed schizophrenia. The malingering and nonpatient groups were screened with a demographics questionnaire and had below-established-cutoff scores on the Gorham Proverbs Test, which is a test that can be used to assess thinking disturbances in schizophrenia. The schizophrenia group was diagnosed by two psychiatrists using the DSM-III based on clinical interview and histories, and diagnostic status was validated by scores exceeding the established cutoff on the Gorham Proverbs Test. Consistent with the author’s hypotheses, the schizophrenic group had more distorted FQ than the control group (X-% = .32 and .20, respectively), but the malingerers (X-% = .27) did not differ from the schizophrenic group. However, the modified version of X-% proposed and tested by Netter and Viglione did discriminate between malingerers and the schizophrenic group (Modified X-% = .22 and .30, respectively). The specific scoring 43 criteria for the modified scores were not published in the article, but the criteria were broadly based on strange testing behaviors considered somewhat unusual or uncommon for true psychotic individuals. Ganellen, Wasyliw, and Haywood (1996) also explored the possibility that psychosis could be malingered on the Rorschach. A sample of 48 forensic patients completed assessments for fitness to stand trial and/or sanity at the time of the committed crime, most of the crimes carrying either a long prison sentence or death penalty. As part of the assessment, each patient completed both the Minnesota Multiphasic Personality Inventory (MMPI; Hathaway & McKinley, 1967) and the Rorschach. The MMPI validity scales were used to assign each patient to either the honest or malingered group. When compared to patient samples and profiles of identified malingerers in the criminal justice system (using extra-test information), the malingered group produced MMPI profiles containing highly-elevated levels of psychosis indicators and thus appeared to be exerting effort to produce pathological-looking records. However, the honest and malingered groups did not differ on any of the FQ indices included in the study — X+%, Xu%, and X-% — nor did they differ on SCZI or a frequency-based indicator of highlyconventional responding known as the Popular variable. In essence, the malingering group did not produce Rorschach protocols that contained psychotic indicators more frequently than did the honest group. Limitations of FQ. Although there is a strong base of literature supporting the use of FQ in assessing reality testing, there are some noteworthy limitations. CS FQ is largely based on the frequency of response objects within specific locations. Though it is not inherently problematic that CS FQ is in part determined by location and frequency of 44 the response object, this does represent a departure from the CS conceptual definition of FQ as indicating accuracy in perception on the Rorschach. Also, in Exner's CS, the FQ score assigned to a response is based in part on the response verbiage rather than the shape of the object. For example, consider the response of “anchor” to Card 3, the D2 location (see Figure 5): According to the CS FQ tables, “anchor” (see Figure 6 to compare shape of actual objects) is a minus response, while “fishhook” (see Figure 7) is an unusual response. Although the shape of the actual objects are quite similar, “fishhook” is assigned a higher CS FQ score than “anchor” when given to that Rorschach location, and subjectively speaking, it does appear to have better fit with the blot. On Card 3, location D3: “Eye glasses” is a minus response, while “sun glasses” is scored unusual — this might be partially due to the fact that the D3 location is red, and that sunglasses have colored or darker lenses than glasses. For Card 5, location W: “Crow” is an ordinary response, while “bird” is scored as either ordinary or unusual, depending on the location of the beak (Exner, 2003, Table A). The provided examples demonstrate that some response objects have quite similar shapes but nonetheless receive different CS FQ scores due to frequency of word choice and conventionality; using less conventional words and/or more specific language when describing objects can result in lower CS FQ scores. Additionally, some response objects differ in FQ score due to non–form stimulus features of the inkblot (e.g., color of the blot). Thus, it can be observed that FQ scoring following CS guidelines results in variation in scores due to factors other than the form fit of the response object to the location. Again, these statements are not inherently problematic, but they do indicate a departure in CS FQ scoring from the CS definition of FQ. Such conceptual deviations were noted in the development of R-PAS, and the 45 authors of the R-PAS FQ tables went to great lengths to explore and document the interplay of fit and frequency data. In the end, they incorporated both into the R-PAS FQ tables as well as into the FQ variable descriptions (Meyer et al., 2011). They also looked into seeming inconsistencies, such as the examples above, and attempted to bring more consistency to the FQ tables in the R-PAS manual. However, the R-PAS FQ tables retained the CS system of trichotomizing FQ into ordinary, unusual, and minus responses. The seeming next step in the advancement of FQ coding is to dimensionalize FQ, much like other indices were dimensionalized (e.g., the PTI). Figure 5. Card 3 location D2. Figure 6. Image of an anchor. 46 Figure 7. Images of fishhooks. Rorschach Form Accuracy (FA) The development of FA. Rorschach FA was developed as one component of Perceptual Accuracy (PA; Meyer & Viglione, 2008), the ultimate goal of the PA system being to improve coding of perceptual accuracy on the Rorschach by addressing limitations of CS FQ. FA was designed to quantify the goodness of fit between the features of the inkblots and the response object; it can be combined with response object frequency to form the PA scoring system, a planned alternative approach to FQ in quantifying accuracy of perception on the Rorschach. Development of FA began in 2001 (e.g., Meyer, Patton, & Henley, 2003) with the creation of a database of response objects identified by various Rorschach systematizers (including Exner, Beck, Hertz, Thomas, Beizmann, Rorschach, Bohm, Loosli-Usteri, Binder, Klopfer, and others). Subsequently, these response objects were rated by an international sample of judges on how accurately the objects can be perceived in the specified inkblot locations (Meyer & Viglione, 2008; Viglione, Meyer, Ptucha, Horn, & Ozbey, 2008). Once ratings were completed, they were averaged across judges for each response object, resulting in a final numeric score for each response object. 47 Following are some of the details of the Rorschach FA Project (Meyer & Viglione, 2008). Each of the response objects was rated by five to 15 judges. In the end, ratings were obtained from a total of 569 judges, including undergraduate and graduate students, professionals in the field, and student-recruited community members. In addition, a convenience sampling approach generated ratings from non-university and non-professional adult participants in Turkey, Romania, Japan, and Taiwan. The total number of rated objects was 13,031 and 129,230 ratings were obtained in total. Judges were asked to rate the objects by answering the question “Can you see the response quickly and easily at the designated location?” The ratings were made on a 5point Likert-type scale, with the following answer categories: 1) "No. I can't see it at all. Clearly, it's a distortion." 2) “Not really. I don't really see that. Overall, it does not match the blot area.” 3) “A little. If I work at it, I can sort of see that.” 4) “Yes. I can see that. It matches the blot pretty well.” 5) "Definitely. I think it looks exactly or almost exactly like that.” Rating forms were provided, and also included items for indicating the participant's background experience with the Rorschach, gender, age, and country of residence. Rating forms were made available in Portuguese (with slightly different forms for use in Portugal and Brazil), Japanese, Finnish, Traditional Chinese, and English. In general, judges were asked to rate about 250 objects each; the final 13,031 objects were rated an average of 9.9 times by judges who were from Brazil, China, Finland, Israel, Italy, Japan, Portugal, Romania, Taiwan, Turkey, and the United States. Because judges varied considerably in their use of the rating scale, raw scores were ipsatized by 48 conversion to z-scores on a per-rater basis before calculating the median score for each object. Subsequently, the ipsatized median scores were re-expressed on the original 1 to 5 distribution. An example of a relatively well-fitting FA item given to the Card 3 location depicted in Figure 8 is “bowtie” (see Figure 9 to compare shape of actual object); an FA score of 4.5 would be assigned to this response. A moderately well-fitting response object is “insect” (see Figure 10; FA = 2.8). An example of a poor-fitting object for the depicted location is “werewolf” (see Figure 11; FA = 1.3), which illustrates a clear violation of form. Figure 8. Card 3 location D3. Figure 9. Image of a bowtie. 49 Figure 10. Image of an insect. Figure 11. Image of a werewolf. Scoring of FA. During the development of FA, various scoring procedures were considered and piloted. In the end, final FA scoring guidelines were developed as detailed here and in Horn (2009). Each response within a protocol is assigned an FA score. Similar to scoring FQ using the CS or R-PAS tables, FA tables were developed that allow for objective assignment of FA scores for most responses. The FA tables are organized by card and by location within each card. The standard object locations are included in the tables, as well as some locations that are not standard CS or R-PAS locations but that occur with some frequency. Within a location section specific to one of the 10 cards, all rated objects from that card location are listed along with the 50 corresponding FA score. As in the coding of FQ, when a response contains a single object that is listed in the table, the corresponding FA score is assigned to the response. When more than one object comprises the response, all objects important to the response are considered in assigning the FA score. If the gestalt of the response is listed in the table (e.g., “2 people” to the whole location of Card 7), the corresponding FA score is applied to the response. However, it is sometimes the case that there is more than one important object to the response and the gestalt is not listed. In such instances each important object is looked up in the FA tables separately and the lowest FA score is assigned to the response. The decision to assign the lowest FA score from among the important response objects is a decision that was debated throughout the development of FA. Another consideration was to use the mean FA of the important objects within the response. However, one important asset of the Rorschach is its ability to identify pathology; pathology of perception is the construct of interest in the development of FA and it seems most beneficial to identify perceptual lapses rather than ability averaged across the response. Therefore, it was decided that using the lowest FA score from within a response provides the most useful information when assessing for perceptual accuracy. There are times when objects are not listed in the FA tables; such instances require the examiner to attempt extrapolation from the tables when possible. If an unlisted response object is quite similar in shape to a listed object within the same location, the listed object can be used in assigning the FA score. Such decisions should be based exclusively on the shape of the object (e.g., if “apple” was not listed, “cherry” could be an appropriate extrapolation object), not on word choice or simple 51 categorization of objects (e.g., if “ostrich” was not listed, “hawk” would not be an appropriate extrapolation object, even though they are both birds). Another technique for extrapolation is to look up the response object (or objects quite similar in shape) in a location that is very similar to that of the response location. If extrapolation is not possible using the aforementioned techniques, then the examiner must make a judgment about the FA score that seems appropriate for the response. To aid in this decision, the examiner should make use of the criteria used by raters in the development of the tables (i.e., descriptions of what warrants each FA score). Additionally, the examiner should reference the listed objects and their corresponding FA scores for the location(s) used in the response. This is done to acquire benchmarks for what raters considered well- versus poor-fitting objects for that specific blot area. Review of FA validity. FA was developed as one component of a new PA system of scoring for quality of perceptual behaviors on the Rorschach, the goal at the time being to improve upon the existing CS FQ system of scoring. It is now hoped that FA can also be further used to improve the R-PAS FQ system of scoring. However, the concept of scoring the Rorschach for pure accuracy (or inaccuracy) of perception is not a new one; two previous Rorschach studies were found that examined very similar approaches to quantifying perceptual accuracy of responses. Conducted more than 50 years ago, the studies seem to have been largely overlooked by researchers until recently. In one study, 100 adults were provided a list of 329 response objects reported to the whole area of all ten inkblots, and the adults rated the objects dichotomously as perceptible and fitting the inkblot reasonably well, or not (Walker, 1953). These ratings were aggregated to comprise reference tables for W responses. A total of 219 responses 52 given to the W location were scored using the Walker (1953) tables and the Beck FQ tables. Using Chi-Square analyses to compare groups (normal responses vs. schizophrenic responses), the average of the obtained form accuracy ratings significantly differentiated responses given by a normal population (n = 122 responses) from those given by patients with paranoid schizophrenia (n = 97 responses). However, Beck's traditional method of scoring FQ did not. Kimball (1950) selected 10 W-location responses for each card from the Beck and Hertz FQ tables to be rated. Using a 6-point scale of goodness-of-fit, form accuracy ratings were given to each of the 100 responses by 4 sets of raters (total n = 103), with the sets grouped by amount of training and experience with the Rorschach. The form accuracy ratings varied widely across judges, with much of the variation assumed to be due to judges’ projection and lack of clarity about where components of the whole percept were located or how they were positioned within the blot (Kimball, 1950). Key recommendations from Kimball’s (1950) form accuracy methodology were applied to the FA Project (Meyer & Viglione, 2008) in that many of the objects to be rated had parenthesized location aides to orient the judge and ratings of each object were made by several judges with varied amounts of experience with the Rorschach. Several preliminary studies have provided more recent information about the validity of FA compared to FQ. Findings were mixed in the original research exploring the relative validity of CS FQ and FA scores against criteria that assess the ability to correctly interpret the nature of interpersonal relationships and to correctly comprehend nonverbal communication (Horn, Meyer, Viglione, & Ozbey, 2008). Criterion measures included: (1) The Interpersonal Perception Task, in which the test-taker watches a 20- 53 minute video of 15 conversation scenes and selects the type of interaction depicted in each scene, (2) the Profile of Nonverbal Sensitivity–Face and Body, which consists of 40 two-second segments of the face or torso of a woman expressing a sentiment and a multiple-choice question after each segment in which the test-taker is asked to describe the depicted woman’s action, and (3) the Communication of Affect Receiving Ability Test, during which the test-taker watches 32 videotaped instances of a person (the “sender”) watching a scene (each scene either being sexual, scenic, unpleasant, or unusual) that is not visible to the test-taker. From the sender’s reaction to each scene the test-taker has to identify the type of scene being viewed. The relative independence of these criterion measures introduced noise in the results, but both FA-derived and FQderived indices produced associations with the criterion measures; effect sizes ranged from small to moderate. In a second study, researchers examined how FA compared to CS FQ in a validation study using positive psychotic symptomatology as a criterion (Ozbey, Meyer, Viglione, Dean, & Horn, 2008). FA was slightly but nonsignificantly better than CS FQ in predicting a composite measure of disordered thinking in a sample of 61 long-term adult psychiatric patients. The protocol-level mean FA (the average of FA scores assigned to each response within the protocol), the protocol-level mean of the lowest 25% of response-level FA scores, and FQ each predicted the disordered thinking composite with a moderate effect size (r = -.33; -.31; -.34, respectively). When FA ratings were converted to FQ equivalents of -, u, and o at the response level and then converted into analogs of X-%, X+%, XA%, and WDA% at the protocol level, the FA versions of these scores showed slightly larger correlations with criterion measures than the FQ versions 54 (e.g., for X-% and WDA%, the rs were .39 and -.38 for the FA-based scores and .35 and .33 for the FQ-based scores). In a third study, researchers investigated the relative validity of CS FQ indices and FA scores to differentiate patients based on their general degree of psychiatric severity (Ptucha, Saltman, Filizetti, Viglione, & Meyer, 2008). Findings were mixed in that FQ-derived variables predicted diagnostic group (i.e., no diagnosis, non-psychotic, or psychotic) and diagnostic severity (a combination variable of diagnosis and patient status); FA-derived variables predicted patient status (i.e., non-patient, out-patient, or inpatient). However, Ptucha et al. (2008) computed FA indices somewhat differently than the other two aforementioned FA studies. In a more recent study, Rorschach protocols and a variety of criterion measures were collected from 114 adult college students in a comparison of FA and CS FQ validity (Horn, 2009). Criterion variables represented a wide range of perceptual abilities including elemental visual-spatial ability (the Judgment of Line Orientation; Benton, Sivan, deS. Hamsher, Varney, & Spreen, 1983), ability to unite a disparate perceptual field (Gestalt Completion Test and Snowy Pictures Test; Ekstrom, French, Harman, & Dermen, 1976), the ability to make inferences about the mental states of others (Eyes Test–Revised Version; Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001), as well as a test of complex interpersonal perception in which participants rated the IQ and personality characteristics of college students seen in 10 different videos (Carney, Colvin, & Hall, 2007). From the evaluation of convergent validity, Horn (2009) observed that protocol-level FA indices assessed basic perceptual processes, while CS FQ indices aligned more with tasks of interpersonal perception; effect sizes ranged from small to 55 moderate. Results were interpreted as indicating that FA likely assesses perceptual accuracy at a more basic and concrete cognitive-processing level, while FQ seems to indicate a more complex, “warmer” type of processing in which the test-taker is accurate in detecting nuanced interpersonal and personality cues. Such results may be early indications of the importance of factors other than objective accuracy of fit when using the Rorschach to assess such a nuanced construct as perceptual accuracy. Rorschach Frequency of Perceptions As described in the overview of CS FQ, Exner embedded frequency of percepts into the development of the FQ tables: To be considered to have FQ of ordinary, a percept must have been identified by at least 2% of persons in the FQ data pool for W and D areas, or by at least 50 persons (0.52%) in the pool who responded to Dd areas. For an item to have an FQ listing of unusual, the percept must have occurred in less than 2% of persons for W and D areas, and for Dd areas they occurred in fewer than 50 people (Exner, 2003). This rough benchmark reflects the impact of percept frequency on CS FQ determination, but frequency of percepts cannot be formally assessed separately from perceptual accuracy using the CS FQ tables, and the same is true for the R-PAS FQ tables. In the R-PAS FQ tables, the process of balancing fit with frequency in the determination of the final 3-level FQ code assignments was much more iterative and nuanced. A table in the technical manual chapter (see p. 424) summarizes the final FQ determinations with regard to fit and frequency levels, as well as with regard to CS FQ classification. In general, higher fit and frequency values correspond to higher R-PAS FQ 56 codes, and most objects are classified as unusual when fit is high and frequency is low, or when fit is poor but frequency is high. Popular responses. Exner (2003) identified 13 percepts that are labeled as Popular responses in the CS. Popular responses were “defined using a criterion that requires the answer to appear at least once in every three protocols,” the sample consisting of approximately 7,000 protocols (pp. 129-130). Within the category of Popular, response frequencies still vary; Exner (p. 130) found that some Popular responses barely meet the criteria used in development of the score (e.g., animal to D1 location of Card 2; crab to D1 location of Card 10), while others are seen in more than 4/5 of protocols (e.g., human figures to D9 location of Card 3; animal figure to D1 location of Card 8). The CS Populars were also retained in R-PAS (Meyer et al., 2011). The Populars give a rough indication of the conventionality of a test-taker’s perceptions but the index represents only how frequently the person identifies the extreme end of what would be considered “common” responses. The score can also be misleading for clinicians — Weiner (1998) points out that two individuals might have the same number of Popular responses on a protocol, but could be quite different from each other in capacity for recognizing conventional reality. Consider the case in which one individual gives responses containing the four most frequently-seen Popular responses, while another individual could also respond with four Popular responses, but deliver the ones that are least common. At the surface, these two individuals appear to have the same capacity for identifying conventional reality, but they fall at different places along the spectrum of conventional responding. 57 Findings using Rorschach indices of response frequency. A number of imaging studies have explored Rorschach perception in recent years with special attention paid to frequency of responses. Asari et al. (2008) used fMRI scanning to identify cognitive regions that are activated in the production of unique versus conventional visual perceptions. The Rorschach was administered to 217 Japanese participants to form a control sample of protocols. The response percepts were tabulated and used to create frequency benchmarks: Responses were labeled “frequent” if they occurred in at least 2% of the control protocols, “infrequent” if they occurred in the control protocols but did not reach the 2% benchmark, and any responses not encountered in the control group were classified as “unique.” Sixty-eight Japanese volunteers were screened for psychiatric or neurological illness using a structured interview, then delivered verbal responses to the Rorschach cards while in a fMRI scanner. They were prompted to deliver as many responses as they could within the three-minute presentation time for each card. The levels of activity in the right temporopolar regions of the brain differed at the time the response was vocalized based on the response being “unique” versus “frequent.” Asari et al. reviewed literature suggesting that the right temporal pole is an important node that is used when perceptual and emotional signals converge, and that it has also been implicated as part of the system that stores emotional and autobiographical memories. Olson, Plotzker, and Ezzyat (2007) suggest that this brain region is also an integral part of the activation of personal memories when a person becomes emotionally aroused. Asari et al. (2010) used the same experimental and control samples but examined brain structure instead of brain function. They computed a Unique Response Ratio for 58 each experimental participant, the ratio consisting of the sum of “unique” responses in the protocol divided by the total number of responses produced in the protocol. As in the Asari et al. 2008 article, any responses not encountered in the control group were classified as “unique.” Across participants, the mean of the protocol-level sum of “unique” responses was 11.6 (SD = 8.6) and the average number of total responses per protocol was 39.4 (SD = 17.6) (Asari et al., 2010). The volume concentration of both the bilateral amygdalae and the bilateral cingulate gyri correlated with the Unique Response Ratio (p < .05, p < .01, respectively). When examined unilaterally, the volume concentrations of the right and left amygdala and the cingulate gyrus all had medium effect sizes in their relationship with the Unique Response Ratio (r = .34, .30, and .37, respectively). According to the literature review by Asari et al. (2010), the limbic region is critical in the processing of perceptual information and more specifically in emotional processing, and the frequency with which brain structures are activated determines, at least in part, the degree to which that structure becomes enlarged. Thus, it was suggested that the positive association between the volume of limbic structures and unique perception supported the hypothesis that increased activity in the limbic system might underlie unique perceptions. Statement of the Problem Within the field of clinical psychology there are limited resources available for identifying psychotic perception. Research has established that the Rorschach can accurately identify psychosis in test-takers, but it is possible that new Rorschach scores could extend the utility of the Rorschach in identifying such characteristics. Within the CS (Exner, 2003) and R-PAS (Meyer et al., 2011), FQ is currently used to assess 59 perceptual accuracy on the Rorschach and has been demonstrated as a valid indicator in a robust literature. However, FQ does have some important limitations. Perhaps most importantly, FQ is not based solely on the shape of the response object and the frequency of the percept. In the development of the CS tables, clinician judgments played a central role. In the development of the R-PAS tables, the authors worked to remove inconsistencies and base the tables more solidly in empirical data, but there were still many decisions that were made by the development team when the fit and frequency data did not clearly indicate a response object was both frequent and strong in fit, or both infrequent and poor in fit. Part of the struggle in constructing the FQ tables is certainly due to the trichotomization of FQ scores that is seen in both the CS and R-PAS, when in actuality, fit and frequency of objects on the Rorschach is much more nuanced than a 3category system reflects. Dimensionalization of scores elsewhere on the test has been completed (e.g., the PTI was dimensionalized and became the TP-Comp), but dimensional FQ scores have not yet been determined and published. FA was developed (Meyer & Viglione, 2008) in an attempt to rectify some of the problems associated with scoring FQ, but there is not a Rorschach score — CS, R-PAS, or otherwise — that can thoroughly and efficiently tap into the conventionality of a protocol. It is believed that such a score could be an important factor in identifying distorted perceptual processes of the test-taker. It is now time that development of an alternative scoring system for Rorschach perceptual accuracy include a response frequency score that can ultimately be combined with the accuracy of fit score (FA) to form a dimensional Perceptual Accuracy (PA) score. 60 Purpose of the Present Study It is anticipated that the overall PA project (Meyer & Viglione, 2008) will advance our understanding of perception, as well as result in a validated method of assessing reality testing using the Rorschach. Ideally, the final PA system will improve diagnostic accuracy, helping to correctly recognize perceptual aberrations and lead to fewer mistaken psychotic-disordered diagnostic inferences. FA was the first leg of the PA scoring system to be developed and it has been included in some validity evaluations; FA represents how well the shape of the response object fits the blot location. The present study began with expansion and refinement of Meyer et al.’s (2011) frequency tables, and re-calculation of international indices of PF. These indices indicate how frequently each perceived object is given as a response to the location used by the respondent. The preliminary tables of PF values (Meyer et al., 2011) was developed by examining and averaging the specific object frequencies from five international datasets (i.e., Argentina, Brazil, Italy, Japan, and Spain); the tables were expanded by adding data from a U.S. sample, and the PF variables were re-calculated. The resulting international PF indices were intended to represent a cross-culturally generalizable index of how frequently response objects are identified in specific areas of specific cards on the Rorschach. It was anticipated that PF would be an important component of the final PA scoring system because it serves as an indicator of conventionality. It is possible that respondents may see things in an accurate and typical way (i.e., high FA and high PF), an accurate but unusual way (i.e., high FA and low PF), an inaccurate but typical way (i.e., low FA and high PF), or in an inaccurate and unusual way (i.e., low FA and low PF). It was expected that most people who take the Rorschach would have a mix of different types of 61 responses, but that PA would allow for more precise identification of true distortions that are also atypical in the normal population. In addition to compiling frequency of response information to form indices of PF, I explored the structure of FA and PF indices using an archival database that included Rorschach protocols and diagnostic information, as well as a Diagnostic Severity score that served as a criterion measure. Although the performance of FA alone had shown some promise in earlier research, it was believed that if FA was combined with PF to form PA, significant correlations with the criterion measure would be observed, and future research might demonstrate the ability of PA to detect problems in accuracy of perception across a wider range of the “accuracy of perception” spectrum than FQ. It is also hoped that PA might eventually lead to more accurate assessment of the kind of perceptual difficulties that impair functioning and interpersonal interactions. Before PA can be developed, it is essential to understand how FA and PF function independently in predicting constructs of interest, and to understand the structure of the various FA and PF indices across responses and cards within the Rorschach test. Without addressing such questions, it would remain unclear how to best combine FA and PF information within a protocol to maximize the performance of PA. By clarifying the structure and performance of FA and PF, it was hoped that standardized methods of scoring and interpreting PA scores could then be developed and applied to future research and ideally, to future clinical practice. For the current study, a criterion database was selected for exploration of the FA and PF indices. For each response within the database, the important response objects were identified and both an FA score and PF scores were applied to each object, then 62 were averaged at the response level. The response-level FA and PF indices were then explored by modeling how card number, response within card, and the criterion variable contributed to the structure of each variable, and validity coefficients with the criterion measure were calculated. Principle of Aggregation The principle of aggregation holds that single measurements are less stable, more prone to measurement bias, and less accurate in portraying information than are summed or aggregated collections of measurements. Rushton, Brainerd, and Pressley (1983) hypothesize that the weak relationships between variables or measurements so commonly found in the psychology literature are partly the product of failure to apply the principle of aggregation to the methodology used in studies. Using aggregated data helps to average out random error associated with measurement, and multiple non-redundant measures of the same construct provide more substantial sampling of the behavior of interest (Rushton et al., 1983). Behavior can vary drastically as a product of the situation, and behavioral measurements are often based on single sources of information (Epstein, 1979). Therefore, the corresponding results are limited in terms of generalizability and possibility of accurate replication. However, it is also important to consider the range of the construct anticipated in participants as compared to the range of the construct expected to be covered by the measures; weak relationships between variables could result from simply having little variability in the sample compared to relatively great variability in the situations they are exposed to (Epstein, 1979). After conducting four studies addressing aggregation hypotheses, Epstein (1979) found that aggregating data 63 over a number of events led to estimates of personality traits being more stable, and also revealed heteromethod convergent validity. Epstein (1980) described four types of aggregation: aggregation over subjects, over stimuli or stimulus situations, over time, and over modes of measurement. Aggregation over subjects refers to testing many participants and averaging responses over the sample. This can be accomplished by using an appropriate sample size and through appropriate types of data analysis. Aggregation over stimuli or stimulus situations refers to using a variety of stimuli and contexts in addressing the research question, making sure to include a range of stimuli that the researcher is attempting to generalize results to. This method helps to reduce the influence of situation-specific effects in the data. Aggregation over time and over modes of measurement refers to varying the trials or occasions of measurement, and to using multiple measures of the same construct, respectively. These recommended aggregation practices were applied to various components of the research design and data analyses in the following study. For example, care was taken to select statistical analyses that can appropriately model both the hierarchical, as well as the repeated measures nature of Rorschach data (i.e., responses within cards within protocol). The sample used in this study was fairly large, and the Rorschach administrations occurred over a period of years by two different welltrained examiners. The criterion measure was reliably coded and each criterion data point was based on an aggregation of diagnoses, which were in-turn based on the presenting clinical picture. Finally, FQ is a rather coarse classification method as it is based on just three options along a continuum of perceptual accuracy; this study is focused on exploring the structure of FA and PF to make progress on a new dimensional method of 64 scoring perceptual accuracy that will be based on aggregation of dimensional fit and frequency scores. Research Questions How frequently do various Rorschach responses occur? How can the frequency of perceptions be aggregated at the response and protocol levels? What does the structure of the FA and PF indices look like within card and across cards? How can PF indices be combined with FA indices to produce a dimensional scoring system for perceptual accuracy on the Rorschach? 65 Chapter Three Method Participants Percept Frequency samples. U.S. Sample. The Rorschach Performance Assessment System (R-PAS; Meyer et al., 2011) is an effort to evaluate the empirical evidence of Rorschach variables, many of them CS variables, and to develop a standard method of administration, scoring, and interpretation that is based on retaining what has empirical support. Norms for R-PAS were derived from an adult non-patient sample collected from multiple countries (Meyer et al., 2007). The 145 verbatim English-language protocols that are part of the R-PAS normative database comprised one of the data files used to identify the frequency with which various percepts are delivered as responses. An additional data file was employed, which contains Rorschach protocols for 127 college students from a University in Ohio (Horn, 2009). This combined U.S Sample, containing a total of 262 protocols, was aggregated with five other samples during development of the final FA and PF tables. Argentinean Sample. The individuals who comprise this sample were described as 506 well-functioning adult nonpatients from the area of Gran La Plata, Argentina (Lunazzi et al., 2007). This sample’s file provided specific frequencies for all objects reported by subjects during the Rorschach test. Responses were aggregated along with the other frequency samples to form PF tabulations. 66 Italian Sample. This sample consists of 800 non-patient adult Rorschach protocols (Parisi, Pes, & Cicioni, 2005; Rizzo, Parisi, & Pes, 1980). The data file provided specific frequencies for all objects identified by 2% of the subjects or more (i.e., seen by 16 or more people). Spanish Sample. A sample of 470 Spanish adult outpatient Rorschachs (Miralles Sangro, 1997) was also used in tabulating PF. The test-takers in this sample had presented to the Interaction and Personal Dynamic Institute in Spain requesting psychological evaluation. No sample members were of inpatient status at the time of the evaluation or were recommended for inpatient treatment following the evaluation. In total, the data file consisted of 10,562 responses and it provided specific frequencies for all objects reported during Rorschach administration. Japanese Sample. This sample’s data file includes the Rorschach protocols of 400 Japanese nonpatients (Takahashi, 2009). It provided specific frequencies for all objects seen by at least 1% of the sample. Brazilian Sample. A total of 600 Rorschach protocols are included in this nonpatient Brazilian sample (Villemor-Amaral et al., 2008). The data file provided specific frequencies for all objects that were used as Rorschach response objects by the subjects. Criterion Database. Data for this adult mixed-status database was collected in Chicago through a hospital-based psychological testing program (Meyer, 1997; Meyer, Riethmiller, Brooks, Benoit, & Handler, 2000; Hoelzle & Meyer, 2008). Valid Rorschach protocols and MMPI-2 administrations were obtained from 362 patients as part of treatment or evaluation. Of the patients with valid Rorschachs, “…52% were psychiatric 67 inpatients, 30% were psychiatric outpatients, 15% were general medical patients, and 3% were drawn from other settings” (Meyer, 1997). Diagnostic categorization of the sampled individuals was used to better understand the relative strengths and weaknesses of FA and PF indices on the Rorschach. For the current study, a subset of the database was used, consisting of 212 Rorschach protocols that met criteria for R-Optimized modeling. R-Optimized administration includes instructions to the Rorschach test-taker that they should “give 2, or maybe 3 responses…” to each card, the test-taker is prompted for a second response if they only give one, and the card is removed after a test-taker delivers four responses to a card (Meyer et al., 2011). Given that the Criterion Database protocols were collected using CS administration procedures instead of R-Optimized instructions, R-Optimized modeling was applied to the database. Meyer et al. (2011) used a complex procedure for determining which responses to retain so as to closely match the distribution of responses in the Criterion Database to a target database that had been administered using ROptimized administration procedures. They applied the same procedures described in the R-PAS Manual, and “…wanted the distribution of first, second, third, and fourth responses given to each card in our modeled sample to match the distribution of first, second, third, and fourth responses to each card in the target sample….” Of the 212 subjects, the final dataset used for the present study consisted of data from the 159 subjects who also had Diagnostic Severity scores available. The decision to use the Criterion Database was based on the sampled population – The Criterion Database contains data collected from a large clinical sample, and the criterion scores reflect 68 severity of their diagnosis(es), which closely approximates the clinical constructs of interest. Measures Percept Frequency samples measures. Match numbers are used as an indexing aid within the FA and PF tables and for coding Rorschach protocols. Each Rorschach object is assigned a unique match number in the FA and PF tables, and Rorschach protocols can then be coded with match numbers using the FA and PF tables. This process allows the researcher to then import other data from the tables into the Rorschach coding file. The FA and PF tables also contain Rorschach card numbers, location codes, angle of the card, and the object names. The card number is used to identify which Rorschach card the test taker was responding to when delivering each Rorschach response. Rorschach location numbers are assigned to each card and are used to identify the various parts of the inkblot image, and these codes indicate where the response object was located when it was used as part of a response. The angle of the card is used to indicate which orientation the test taker was holding the Rorschach card when constructing and delivering their response. Each response object is also named and listed in the FA and PF tables. The same object can be seen in different ways and in different locations, but each unique perception has a unique match number. For example, a butterfly could be seen on different cards, in different locations within the same card, or even in the same location on the same card but in a different orientation. Each unique type of “butterfly” response would have its own match number and listing in the FA and PF tables. 69 An object-level FA score is associated with each unique response object in the FA tables, with each object’s FA value having been derived from an average of 9.9 rater judgements (Meyer & Viglione, 2008). The PF tables also contain a variety of objectlevel variables. The first set of PF variables are contained within the non-consolidated PF tables. These are within-country variables that indicate the percentage of protocols that contained each unique object. In other words, within each country’s sample, the percentage of people who gave responses containing each unique response object was calculated and indexed within the lookup tables. The variable for each country indicates the specific frequencies for all objects reported by subjects during the Rorschach administration, with two exceptions: The Japanese Sample listings indicate specific frequencies for all objects identified by at least 1% of the sample, and the Italian Sample listings indicate specific frequencies for all objects identified by at least 2% of the subjects. These percentage-based variables were also converted into country-specific binary variables that indicate whether or not the percentage of protocols that contained each match number is greater than or equal to 1.5% of the protocols from that country. In other words, the PF tables indicate whether or not each unique percept (i.e., match number) was given by at least 1.5% of the participants from each country’s sample. A series of composite international PF variables were also computed at the object level for the PF tables. The first variable is the international version of the percentagebased variable, which is computed as the average of the non-missing values for the six country-specific percentage-based variables. More simply, it is the mean of the six within-country variables that indicate the percentage of protocols that contained each match number. It indicates on average how often a particular percept is reported across 70 samples when it was identified by at least 1.5% of the participants in at least one of the samples. The binary PF variable was also converted to a composite international variable, which equals the sum of the six country-specific binary variables. Thus, it is a count of the number of samples in which the match number was found in at least 1.5% of the protocols. It has a possible range of 0-6. In an effort to reduce the length and complexity of the FA and PF tables, Meyer et al. (2011) had consolidated many of the response objects into response-object-categories within the tables. The consolidation decisions had been based on careful consideration of the response object properties; consolidations occurred when there were multiple objects listed within a single card location and orientation, and the objects were similar to each other in shape and content and had similar FA ratings. For example, the initial tables contained separate listings for “anchor,” “fishhook,” and “hook” in the D2 location of Card III. Upon consolidation, those 3 response objects were consolidated into a single response object category listing: “hook or similar object (e.g., anchor).” The table consolidation process is described in more detail in the Procedures section. To account for the consolidations, object-level variables were computed for each country based on the consolidated FA and PF tables. The first set of variables indicate the percentage of protocols from each country (if the percentage was greater than or equal to 1.5%) that contained each match number from the consolidated FA and PF tables. In order to compute each country’s percentage-based variable for the consolidated tables, the percentage-based variable values from the unconsolidated tables were aggregated to match the consolidation of objects, this having been accomplished by summing across the various object listings included within each consolidated listing. For example, in the 71 unconsolidated tables, Card III location D2 (see Figure 12) contained separate listings for “anchor,” “fishhook,” and “hook”, and each listing has a percentage-based PF score for each of the six countries. Upon consolidation, the PF scores for those objects were summed into the object listing for “hook or similar object (e.g., anchor).” Figure 12. Card III location D2. Binary representations of the consolidated percentage-based variables were also computed and listed within the consolidated PF tables. These within-country variables indicate whether or not the percentage of protocols that contained any match number within each consolidated category was greater than or equal to 1.5% of the protocols from that country. In other words, the consolidated PF tables indicate whether or not, within the given country’s sample, any object that contributed to a specific consolidated listing was found in at least 1.5% of the protocols. Thus, this consolidated object-level score was a zero or a one for each country. Two consolidated international object-level PF variables were also computed for the consolidated PF tables. The first variable is the mean of the six within-country variables that indicate the percentage of protocols that contained each match number after the listings were consolidated. Thus, it indicates on average how often a particular 72 consolidated percept category is reported across samples. The percentage-based variable was also converted into a count-based composite international variable. It is a count of the number of samples (out of the six countries) in which any match number that contributed to the consolidated listing was found in at least 1.5% of the protocols from each given country. In other words, it is just a sum of the binary country-specific consolidated objects that were found in at least 1.5% of the protocols. The count variable has a possible range of 0-6. Criterion Database measures. Each patient within the Criterion Database was assigned an ID number, which was used for indexing the data and merging data files after coding was completed. By using an ID number as an indexing variable, Rorschach coding could be completed blind to any information about the patient, including their diagnosis(es) or their Diagnostic Severity score. Patients were assigned diagnoses, which were used to later construct a Diagnostic Severity indicator. Initial billing diagnoses were recorded for each individual before testing began, and thus diagnoses were made independent of Rorschach data. The billing diagnoses were assigned by the treating clinician or by a multi–disciplinary inpatient treatment team. Diagnoses contained in the database include depressive disorders, psychotic disorders, personality disorders, anxiety disorders, bipolar disorders, and gender identity disorder. Medical patients with diabetes, pain management concerns, and organ transplant candidates were also included. The diagnostic severity criterion variable is based on the 1–3 diagnoses obtained for each patient, which were then converted to a 5–point severity scale (Dawes, 1999; 73 Meyer & Resnick, 1996). The severity scale was conceptually derived to quantify the degree of overall dysfunction associated with a diagnosis, with higher scores indicating higher levels of dysfunction (e.g., 1 = Adjustment Disorder with Depressed Mood; 3 = Major Depression, Recurrent, Severe, Non-psychotic; 5 = Schizoaffective Disorder). When developing the scale there was good agreement between the independent raters on severity ratings for 141 diagnostic codes (r = .84; 97.9% of ratings were within one point of each other). The highest diagnosis severity rating for each patient was used as the criterion measure. Patients were also administered the Rorschach as part of their treatment or evaluation at the hospital. The Rorschach was administered using the CS, which was the most commonly used administration and interpretation system for Rorschach assessment (Exner, 2003). As is dictated by CS administration guidelines, each patient was individually presented with the standard series of 10 inkblots and was asked to respond to each, answering the question, “What might this be?” Their responses were written down, and later transcribed into an Excel database. Clarifications were collected from each patient after they completed the response phase of the Rorschach. Each patient’s response within the database was accompanied by the response clarification, as well as the other information that is part of a typical CS record: The response number, the card number, the angle of the response, and response object location information. Some additional Rorschach information was coded and calculated for this study. The variable R_InCard was assigned to each patient’s responses to indicate the ordering of their responses within each card. The possible range for R_InCard, using the ROptimized protocols, was 1-4. Match numbers were assigned to the objects that were 74 included in each Rorschach response. By coding object-level match numbers into the Criterion Database, other indexed information in the FA and PF tables could be pulled from the tables into the Criterion Database through the use of syntax, saving manual coding time and reducing human errors in coding. Five data columns allowed up to five match numbers to be assigned to each response. However, not all response objects could be assigned a match number because the look-up table of 11,352 consolidated objects with corresponding match numbers is not exhaustive and people identify response objects that are not listed. Object-level FA was pulled from the FA tables into the Criterion Database. It is the FA score assigned to each response object. Each response had up to five FA scores, corresponding to the match numbers that were coded to identify the various objects used in each response. Response-level FA scores were assigned to each Rorschach response within the Criterion Database by a human coder (rather than by match number). Response-level FA was determined by coders after reading the response and clarification, considering any of the object-level FA scores associated with match numbers, and applying the coding rules discussed and practiced during the Rorschach coder training. The protocol-level mean of the response-level FA scores within a patient’s protocol was calculated as well. The two international object-level PF variables were also pulled into the Criterion Database from the consolidated PF tables. As with the object-level FA score assignments, the object-level PF variables were applied to all response objects with a listed match number, with a maximum of five objects and associated scores per response. Unlike FA, there was no coder judgment in assigning PF values; the observed frequencies were used. 75 As a reminder, the first object-level PF variable is the percentage-based variable, which represents the mean of the six within-country variables that indicate the percentage of protocols that contained each match number after the listings were consolidated. The count-based international variable is a count of the number of samples (out of the six countries) in which the consolidated match number was found in at least 1.5% of the protocols, with a possible range of 0-6. Response-level PF variables were also calculated for the Criterion Database. PFM (Percept Frequency Mean) is the response-level average of the object-level international percentage-based PF scores that were coded for each object within the response. PFN1.5 (Percept Frequency Number of samples >= 1.5%) is the response-level average of the object-level international count-based PF scores that were coded for each object within a response. Procedures Frequency tables construction. Structure of the original FA and PF tables. The present study began with expanding and updating the Rorschach response object PF tables, one step in developing the new Rorschach PF scores. The preliminary Microsoft Excel file of FA and PF values (Meyer et al., 2011) included specific object frequency information from five of the proposed international samples: Argentina, Brazil, Italy, Japan, and Spain. The Excel FA and PF tables were structured such that a row was assigned to each individual response object, with columns indicating country-specific frequency information. Each row was also assigned a unique match number, which functions as an index aid for the various response objects included in the tables. As described earlier, there were two types of PF 76 values entered for each country within each response object listing (i.e., within each row): (1) The percentage of protocols collected from the indicated country that contained the response object, with indexed values representing frequencies that were greater than, or equal to, 1.5% of the country’s protocols, and (2) a binary value indicating whether or not the percentage-based frequency value for the indicated country was greater than or equal to 1.5% of the protocols. Coding the U.S. Sample. The preliminary FA and PF tables were expanded by adding object-level PF information from the U.S. Sample. The U.S. Rorschach responses were contained in a set of Excel files. The files were structured so that the full Rorschach responses were represented in the rows; five columns contained the preliminary match numbers (i.e., the response object identifiers) and response location information for up to five response objects. The U.S. frequency coding was accomplished using a simplified version of the preliminary FA and PF tables as lookup tables. The only variables retained in the tables for this step of the project were match numbers, object names, and the necessary percept location information (i.e., Rorschach card number, percept location within the card, the angle of the card when response was delivered, and whether the response included the use of non-inked areas). All existing FA and PF information was removed from the tables before using the tables for coding match numbers in the U.S. Sample. A merged Excel file was created for the U.S. Sample that combined the 145 protocols that are part of the R-PAS normative database (Meyer et al., 2007) with the 127 protocols from college students from a University in Ohio (Horn, 2009). Initial match numbers had been assigned to the responses to identify the response objects, with up to 77 five match numbers assigned to each response. SPSS was used to import object-level information into the Excel file, based on match number. Syntax was written that scanned the match numbers assigned to the U.S. responses, used the match numbers to locate the corresponding object information listed in the preliminary FA and PF tables (i.e., Rorschach card number, percept location within the card, the angle of the card when response was delivered), and that object information was then embedded into the U.S. Excel file. All responses within the U.S. Excel file were then manually screened for accuracy of the listed match numbers (i.e., did the newly-embedded response object information match the actual card number, location, and response object that were indicated in the Excel file), as well as for the presence of unique response objects that were not already listed in the preliminary FA and PF tables. In rare instances, incorrect match numbers had been assigned to response objects within the U.S. Sample -- These errors were corrected by inserting the correct object match numbers into the Excel file. In cases when a response contained an object without an associated match number having been assigned, either the correct match number was assigned or the response was flagged for further manual screening if no object match existed in the FA and PF tables. After the U.S. Sample Excel file was coded and checked for accuracy, responses were extracted that had been flagged for further manual screening due to no available match number for one or more objects used in the response. Each unlisted object used in a response was assigned a unique match number, and the new match numbers were then assigned to all additional instances of the unique response objects within the U.S. Sample. Additionally, nine colleagues were asked to independently assign FA ratings to each of the new objects that occurred in at least 1.5% of the U.S. protocols. Due to not 78 having enough ratings within each judge to form ipsatized scores for the newly-rated objects, the median rating within object was used to determine the final object-level FA rating for each object that would be added to the FA and PF tables from the U.S. Sample. Updating and adding variables to the FA and PF tables. Following the coding of the U.S. Sample and the identification of the new objects to be added to the FA and PF tables, the tables were updated to reflect the addition of the U.S. Sample’s data. As a first step, the U.S. Sample’s data were imported from Excel to an SPSS database and a variable was created to index all of the match numbers (i.e., response objects) used within each protocol. Some protocols contained several similar responses (e.g., more than one response that incorporated a “butterfly” to location D3 of Card III in the upright orientation), and thus had more than one response with the same match numbers assigned. In such instances, the duplicate match numbers within a protocol were filtered out when tabulating frequency values; in other words, only the first instance of each match number within a protocol was included in the frequency tabulations -- This prevented individual protocols from over-contributing to the frequency variables. The match numbers were then tabulated to create a count of U.S. protocols that used each match number. The count variable was then converted into a new variable, which represented the percentage of U.S. protocols that contained each response object. The U.S. percentage-based variable was indexed in the FA and PF tables, and it was also used to compute the U.S. binary variable, which indicates whether the U.S. percentage-based frequency value was greater than, or equal to, 1.5% of the total U.S. protocols. After the new response objects and the U.S. frequency variables were added to the FA and PF tables the tables contained the two frequency variables described earlier for 79 each of the six PF Samples: (1) The percentage of protocols collected from the indicated country that contained the response object, with values representing specific frequencies greater than or equal to 1.5% of the country’s protocols, and (2) a binary variable indicating whether the percentage-based frequency value for the indicated country was greater than, or equal to, 1.5% of the protocols. At this point, as described previously, two international frequency variables were computed for each listed object: (1) The average of the six countries’ percentage-based frequency values, and (2) The count of countries (range 0-6) that had a percentage-based frequency value of greater than, or equal to, 1.5%. Given that response object listings had been consolidated in an attempt to simplify the FA and PF tables, as discussed in the Measures section, the frequency data also needed to be consolidated for the tables to function properly as lookup tables. Therefore, SPSS syntax was written to calculate the new country-specific frequency variables for all object listings that had been consolidated. The first variable to be calculated reflects the sum of the individual percentage-based frequency values within a consolidated category. For example, if the consolidated category contained three separate object listings (as in the “anchor-fishhook-hook” example above), the consolidated variable equaled the sum of the three objects’ frequency percentages (i.e., the sum of the three values that represent the percentage of protocols from the indicated country that contained each of the three response objects). The binary frequency variable was then calculated for each country to represent whether any response object within a consolidated category had a frequency greater than or equal to 1.5% of the country’s protocols; in other words, the variable 80 represented whether the consolidated category contained any object that was present within at least 1.5% of protocols from the indicated country. Finally, the two international frequency variables were computed for each consolidated category listing: (1) The average of the six countries’ consolidated percentage-based frequency values for all values greater than, or equal to, 1.5%, and (2) The count of countries (range 0-6) that had a consolidated percentage-based frequency value of greater than, or equal to, 1.5%. For object listings that did not get consolidated into a category, the frequency values for the original object were retained. Within the FA and PF tables, PF ratings are listed as missing values for percentage-based frequency values that are less than 1.5%, and for counts of countries that are 0. Criterion Database coding. Coder training and interrater reliability. Prior to the current study, the author (S. Horn) was extensively trained in coding FA and PF, and also co-trained a research team on FA coding under the supervision of Gregory J. Meyer. The most extensive coding training utilized a database collected by Dean, Viglione, Perry, and Meyer (2007; 2008), consisting of Rorschach protocols from 61 adults who were receiving long-term residential treatment at the time of the assessment, either at a state psychiatric facility or in a state prison. This was the primary database employed by G. Meyer and S. Horn for calibrating their own scoring, as well as training a team of coders on FA coding procedures, establishing coding reliability, and ensuring calibration across a full coding team. The full coding team consisted of G. Meyer and S. Horn, fellow graduate student T. Ozbey, and three undergraduate research assistants. Coder training included several months of weekly team meetings where coding procedures were taught and reviewed, 81 practice protocols were collectively coded, independently-coded practice protocols were reviewed as a team, and coding procedure questions were addressed. Coding reliability for response-level FA was clearly established in this training database. Each of the 40 reliability protocols had been scored by at least two coders, and reliability was computed using a 2-Way Random Effects Model ICC with an Absolute Agreement definition. Across the 841 responses (40 protocols), the response-level single measure ICC = .74, indicating good to excellent interrater reliability (Cicchetti, D. V., 1994). Coding reliability was computed at the response level because it provided a conservative assessment of the reliability of the coding rules. Prior to the current study, S. Horn completed and conducted additional FA coding training and subsequent interrater reliability analyses using a database that contained complete Rorschach and criterion data for 110 college students at a small University in Ohio (Horn, 2009). Within the college student database, agreement ratings for responselevel FA were obtained for 10 protocols coded by S. Horn and an independent coder, E. Crawford, who completed one-on-one training with S. Horn and was co-supervised by S. Horn and G. Meyer. Reliability for this training database was computed using a 2-Way Random Effects Model ICC with an Absolute Agreement definition. Across the 245 responses (10 protocols), the response-level single measure ICC = .82, indicating excellent interrater reliability (Cicchetti, D. V., 1994). Given S. Horn’s clearly established coding reliability using 50 protocols within the training databases, it was determined that less extensive reliability coding would be needed in the current study due to her predetermined proficiency in coding. The Criterion Database was initially reviewed and had match numbers assigned to response objects by 82 an independent coding team. All Rorschach responses were reviewed by S. Horn, and the associated match number coding within the database was revised, as necessary, by S. Horn. To establish interrater reliability for the Criterion Database coding, individual coder training was provided to graduate student coder N. Bromley by S. Horn prior to coding the protocols. As N. Bromley was already familiar with the Rorschach, the training consisted of two hours of intensive one-on-one coding training. After an orientation to response-level FA coding and the FA and PF tables, practice responses were collaboratively coded for response-level FA, allowing for practice and clarification of concepts. After the training, N. Bromley felt comfortable independently completing the 10 reliability protocols. After reliability was computed using a 2-Way Random Effects Model ICC with an Absolute Agreement definition, all coding disagreements were resolved between S. Horn and N. Bromley on the reliability protocols. All remaining protocols then underwent match number coding review and revision by S. Horn, followed by the assignment of object-level FA and PF codes through the use of syntax, and response-level FA coding by S. Horn. Coding FA and PF. As a first step in coding the Criterion Database, coders assigned match numbers to the responses. The coders were provided with an Excel file that contained all of the Rorschach responses, and they assigned up to five match numbers to each response to identify the important response objects. As described earlier, the criterion measure scores were not included in the Excel file and were not available to the coders; the Excel file only contained the Rorschach responses and the information needed to code them (e.g., card number, card orientation, location information), and indexing numbers that could be used to match Rorschach response coding back to the full 83 Criterion Database. The coding was accomplished using a simplified version of the FA and PF tables. The only data available in the FA and PF tables for this step of the project were match numbers, object names, the necessary percept location information (i.e., Rorschach card number, percept location within the card, the angle of the card when response was delivered, and whether the response included the use of non-inked areas), and object-level FA ratings; no PF information was available in the tables for coding the criterion database. As described above, following the initial match number coding completed by the coding team, all responses within the Criterion Database were manually screened by S. Horn for accuracy of the match numbers (i.e., the response object identification), as well as for the presence of unique response objects that were not already listed in the FA and PF tables. SPSS syntax was written that imported information from the FA and PF tables into the Criterion Database Excel file, with the assigned match numbers serving as the index values for the various objects contained within the Rorschach responses. For each Rorschach response in the Criterion Database Excel file, the imported information included the object names, the object location information, the object-level FA ratings, and the two object-level consolidated international PF scores. Each response was read in full, followed by verification of the accuracy of the match numbers, the object names, the orientation of the card, and the object locations. The imported object-level FA and PF data for each response were also scanned for missing or unusual-looking values that might indicate errors in the tables. Response-level FA scores were also assigned by S. Horn during this stage. 84 In rare instances, incorrect match numbers had been assigned by coders to response objects within the Criterion Database. Such coding errors were corrected by inserting the correct object match numbers into the file, and correcting the associated scores. A more common inaccuracy occurred when an important object was present in the response language but the object was not accompanied by a match number. Sometimes such instances of missing information resulted from a simple oversight when the data was initially coded for match numbers; however, in most cases this type of missing information occurred when a response contained an object that had been recently added to the FA and PF tables and thus had not been a listed/indexed object when the Criterion Database Excel file was initially coded. In such cases the correct match number was assigned as long as the object and match number were available in the most recent FA and PF tables; when a response contained an object without an associated match number having been assigned, and the object was not listed in the most recent FA and PF tables, then the response object was flagged for further manual screening and FA score assignment, and the object-level PF ratings were left as missing values. After the coding and verification stages were completed within the Criterion Database Excel file, the Rorschach data was imported into SPSS and the response-level PF scores (i.e., PFM and PFN1.5) were calculated for each Rorschach response. If a response had no objects that were listed in the FA and PF tables (and thus missing values for the object-level PF scores), PFM and PFN1.5 were assigned a value of 0. Statistical Analyses Overview of planned analyses. The goal was to understand the structure of the FA and PF variables through the use of HLM modeling, and to determine their 85 relationship with the criterion measure. These steps were part of the process of exploring how FA and PF indices might be combined to form PA scores. It was believed that once FA was combined with PF to form PA, correlations with the criterion measure would be stronger and PA would lead to more accurate assessment of the kind of perceptual difficulties that impair functioning and interpersonal interactions. This was an important issue to explore so that standardized methods of scoring and interpreting PA scores could potentially be applied to future research and ideally, to future clinical practice. Hierarchical Linear Modeling (HLM). Hierarchical Linear Modeling (HLM) was the proposed method for exploring the optimally weighted structure of the FA and PF variables relative to a Diagnostic Severity criterion. Using HLM, I planned to run regression analyses that would concurrently model response-level information within Rorschach cards, response number to the card, and people at higher hierarchical levels. HLM was selected as the most appropriate statistical approach to the data due to the nested nature of the variables, as well as the fact that consecutive responses to the Rorschach cards can be conceptualized as repeated measures. It was anticipated that R_InCard and card number would likely factor into the optimal weighting of FA and PF scores to jointly predict Diagnostic Severity. In a broad overview of HLM, Garson (2013) summarized that HLM is a type of multilevel model, also broadly referred to as a Linear Mixed Model (LMM), and advised that HLM/LMM is an appropriate way to model data that violate assumptions of independent observations, as correlated error is accurately modeled in HLM. Assumptions of independence are often violated in general linear models (e.g., analysis of variance, correlation, regression) when observations are clustered by grouping 86 variables that can also cause correlated error terms. Garson warns that the standard errors for prediction parameters that get computed through general linear modeling (e.g., beta values for regression equations) are inaccurate when the error terms are clustered by a grouping factor. Such inaccuracies in the computed standard errors (e.g., incorrect magnitude or direction of beta values for the predictor variables in a regression equation) can lead to very different conclusions about the relationships between variables than when using HLM. As described by Garson (2013), any time data are sampled, there could be a random effect of the sampling unit as a grouping variable, violating the assumption of independence of error terms in general linear modeling and OLS regression. Garson summarized the difference between the models as follows: … Unlike OLS regression, linear mixed models take into account the fact that over many samples, different b coefficients for effects may be computed, one for each group. Conceptually, mixed models treat b coefficients as random effects drawn from a normal distribution of possible b’s, whereas OLS regression treats the b parameters as if they were fixed constants (albeit within a confidence interval)… In summary, OLS regression and GLM assume error terms are independent and have equal error variances, whereas when data are nested or cross-classified by groups, individual-level observations from the same upper-level group will not be independent but rather will be more similar due to such factors as shared group history and group selection processes. While random effects associated with upper-level random factors do not affect lower- 87 level population means, they do affect the covariance structure of the data. Indeed, adjusting for this is a central point of LMM models and is why linear mixed models are used instead of regression and GLM, which assume independence. (pp. 5-6). When this concept is tied back to the interpretation of research results, the effect of inaccurate standard errors in general linear models is an inflation of the Type I error rate (i.e., concluding there is a relationship between variables when there is not). Linear mixed modeling techniques have a broad array of language and terminology tied to them, with labels oftentimes varying by modeling technique, author, and field of study. Garson (2013) noted how terms for LMM models currently used in various disciplines include random intercept modeling, random coefficients modeling, random coefficients regression, random coefficient regression modeling, random effects modeling, mixed effects modeling, hierarchical linear modeling, linear mixed modeling, growth modeling, and longitudinal modeling. According to Garson (2013), “In sociology, ‘multilevel modeling’ is common, alluding to the fact that regression intercepts and slopes at the individual level may be treated as random effects of a higher (ex., organizational) level. And in statistics, the term ‘covariance components models’ is often used, alluding to the fact that in linear mixed models one may decompose the covariance into components attributable to within-groups versus between-groups effects.” What links all of these models, despite the variety of names, is that each approach statistically accounts for the clustering of scores at the lowest level by at least one grouping variable when the prediction model is calculated. 88 Although it quickly became apparent that there are a wide variety of potential applications for HLM in psychological research, HLM is still a relatively new statistical approach. Therefore, guidelines for HLM are still in the process of being developed (Beaubien, Hamman, Holt, & Boehm-Davis, 2001; Raudenbush & Bryk, 2002). Additionally, there are few articles published within the social sciences that employ HLM, and fewer that discuss the approach with the level of detail needed for HLM novices to fully digest the method and results. Luke (2004) and Hox (2010) provide comprehensive overviews of the statistical foundations and the technical details of HLM approaches, though the texts are written to those with intermediate to advanced knowledge of statistics, and they do not discuss application of HLM within SPSS specifically. Heck, Thomas, and Tabata (2010) is an excellent example-based resource for conducting HLM within SPSS, especially with regard to understanding the menu options and specifications that are specific to the SPSS software. They also include their syntax in the text, and provide a copy of the database they use in the examples. Although the book is not entirely specific to SPSS, Garson (2013) provides a comprehensive overview of HLM written to those with intermediate-level knowledge of statistics, and he includes straightforward summaries of output within the HLM example chapters. Using HLM allowed for more accurate exploration of the Criterion Database because it can correctly model error terms that are correlated (rather than independent of each other) due to repeated measures (i.e., responses) occurring within the 10 sequentially administered specific cards. The modeling approach used for the current study, as well as the terminology used in describing the models and the results, closely follows conventions established by Garson (2013). I completed a series of linear mixed 89 models that included hierarchical (i.e., nested) data, and therefore are referred to as hierarchical linear models. The models were used to explore the differences between, as well as within groups. With nested data, the variables can be conceptualized as falling within different levels in the data. Level-1 is the lowest level of the data hierarchy, and level-1 variables are nested within level-2 groupings, which are nested within level-3 groupings and so on. These grouping variables are also referred to as cluster variables or subject variables. Variables can be defined in a variety of ways within HLM. The models include a dependent variable, also called the predicted variable. This variable must be a level-1 variable, meaning it occurs at the lowest level of measurement in the model. For the Criterion Database, level-1 variables occur at the Rorschach response level. Predictor variables can include level-1 variables, as well as variables from other levels of the hierarchy. Predictor variables can be entered as fixed effects and/or random effects. Fixed effects are effects that impact the intercept of the dependent variable (i.e., the mean of the dependent variable when all other predictors are set at zero) in the model. Fixed effects are generally thought of as variables whose values of interest are all represented in the data file, and predictors of interest should be included as fixed effects. Random effects are effects that impact the covariance structure of the data. Random effects are typically modeled for variables whose values can be considered a random sample from a larger population. They are useful for accounting for excess variability in the dependent variable. An effect can also be specified as both fixed and random, if it contributes to both the intercept and the covariance structure in the model. 90 Fixed effects are specified as factors or covariates. A factor is an independent categorical variable that defines groups of cases, and each unique group is assigned a fixed effect parameter estimate that indicates how each group membership impacts the intercept of the dependent variable. Factors consist of different nominal levels, not to be confused with the cluster/grouping levels of the hierarchy of data. The levels of a factor equate to the data values of the factor, and each level can have a different linear effect on the value of the dependent variable. For example, Rorschach card number could be specified as a fixed effect factor at level-2 of the data hierarchy, with factor levels being 1-10, identifying the 10 cards that are the possible data values for the variable. A covariate is an independent dimensional scale variable, and changes in the value of a covariate should be linearly associated with changes in the value of the dependent variable. Scale predictors should be selected as covariates in the model because within combinations of factor levels, values of covariates are assumed to be linearly correlated with values of the dependent variable. Fixed effect covariates are also assigned fixed effect parameter estimates that indicate how the value of the covariate impacts the intercept of the dependent variable. Variables can also be defined as repeated effects. Repeated effects variables are the variables that mark multiple observations of a single subject. Specification of repeated effects (i.e., repeated measure variables) is a way to relax the assumption of independence of the error terms. Subject variables (i.e., grouping variables) are used to define the individual subjects of the repeated measurements, and by identifying subject variables, the error terms for each individual specified by the subject variable are treated as independent of those of other individuals in the model. The covariance structure that is 91 applied in the model is used to specify the relationship between the levels of the repeated effects. There are a variety of covariance matrix structures available, hence allowing for residual terms with a wide variety of variances and covariances. To explore the structure of the Criterion Database, HLM modeling was used to predict the response-level FA score and PF scores (PFM and PFN1.5). The predictor variables included response number within a card (R_InCard), the specific card the response was delivered to, Diagnostic Severity, and the individual subject (the subject ID number). These were used as components in the various HLM models. Modeling of each variable began with null models, which are random intercept models. In the 2-level null model, the intercept of the dependent variable (level-1; i.e., FA, PFM, or PFN1.5) was predicted as a random effect of the identified grouping variable (level-2; i.e., card number or R_InCard), with no other predictors included in the model. For the 3-level null model, a second grouping variable (level-3; i.e., Diagnostic Severity) was added as a random effect predictor of the dependent variable’s intercept. These models can also be considered one-way ANOVA models with random effects. The null models were used to determine whether the data demonstrates a hierarchical structure with both betweengroup and within-group variance. Following the null models, predictors can be added to the model and additional structure can be specified. Various combinations and iterations of the predictor variables R_InCard, card number, Diagnostic Severity, and ID number were used to specify fixed effects (i.e., predictor equations) and random effects (i.e., variance not accounted for by the predictor equations). Fixed effects included both main effects, as well as cross-level interaction effects. Revisions were also made to the covariance matrices when appropriate. 92 Of note, SPSS does not allow for the reference category to be changed within the HLM syntax or dropdown specifications. Therefore, the R_InCard variable and card number variable were recoded, and these recoded variables were used in place of the original variables for all HLM modeling. The recoding allowed for the reference categories to be set to Card 1 (instead of Card 10) and response 1 within card (instead of response 4 within card). By using the recoded variables, when interpreting the results, comparisons were made to responses on the first card instead of the last card, and to the first response within each card, instead of the last response within each card. This makes for more intuitive interpretations of the results. It also makes the reference category the most frequent category, as all patients gave at least one response to each card; most did not give a fourth. Supplemental analysis strategies. HLM was the initial approach used in modeling the data. However, relationships between variables were smaller than expected, limiting the usefulness of HLM. Therefore, supplemental strategies were employed to further explore the structure and relationship between variables. The strategies included simple correlation coefficients and tables, as well as graphical representations of the data. The initial descriptive statistics, as well as the HLM analyses, made use of the data at the response level. In completing the supplemental analyses, first the data were aggregated at the protocol level. For the protocol level aggregation, each person ended up with a protocol level score that was computed as a mean of the response level scores for FA and both PF variables. These mean scores were computed overall, individually for each card, and sequentially for each first through fourth response to a card. 93 Chapter Four Results Interrater Reliability There were 250 responses in the 10 protocols independently coded by S. Horn and N. Bromley. The response-level single measure ICC was .75, indicating good to excellent interrater reliability (Cicchetti, D. V., 1994). Frequency Tables: Descriptives Within the U.S. Sample (protocols n = 262), I identified Rorschach responses that had no available match number for one or more objects in the response. Each of those previously unlisted objects was assigned a unique match number and their frequency was determined. FA ratings were assigned to each of those new objects that occurred in at least 1.5% of the U.S. protocols. The five new objects and their associated data values are provided in Table 1. 94 Table 1 New Response Objects Derived From the U.S. Frequency Sample Card Location Angle Number 1 3 4 5 W D1 D4 W Object v v Median FA Rating 4.00 3.00 4.00 2.00 Frequency (% U.S. Protocols) 1.53 1.53 3.44 1.91 Airplane Frog Penguin Head Shoes (2; toes pointing out) 9 W Goblet or Trophy Cup 3.00 1.53 Note. The Angle indentifier “v” indicates the card was held at a 180-degree rotation. Criterion Database: Descriptives The Criterion Database contained 159 valid Rorschach protocols with accompanying Diagnostic Severity scores available. Of the 3,979 responses in the database, there were 3,897 responses with complete response-level data. The 82 responses that were not assigned an FA score and also not matched with PF data most often lacked any type of form, although some were verbalizations that were not considered valid responses to the task (e.g., “Blue inkblots”; “Something on each side so it's symmetrical. That's good enough for that.”). Table 2 provides descriptive statistics for the primary variables used in the primary analyses. As demonstrated in the table, Diagnostic Severity had the full range of possible values represented in the sample, with a moderately high mean score (M = 3.52, SD = 1.06). Response-level FA scores also covered the full possible range of values, with the mean FA score being 3.32 (SD = 1.00). As a reminder, PFM is the mean of the international percentage-based PF values across all the objects in a response (and those PF values are themselves mean values 95 computed across all six of the PF databases when the frequencies were 1.5% or higher). It is the response-level average of the object-level scores. The descriptive statistics indicate that PFM scores ranged from 0 to 63.25. At the low end, participants gave responses that contained only objects that did not show up in any of the six countries at a frequency of 1.5% or higher. Recall that a value of zero was applied to all objects that had a frequency less than 1.5 in a given sample because it was impractical to have all objects in each sample translated into English, and impossible in the case of the Italian sample. On the high end, at least one participant gave a response that contained a PFM score of 63.25. This score indicates that within a single response, on average their response objects were present in 63.25% of the protocols across samples. In other words, people were delivering responses containing objects that more than half of people in the comparison samples also saw. The mean of the response-level PFM score across all responses and protocols (M = 8.80, SD = 14.37) indicates that, on average, people delivered responses with objects that about 9% of people in the comparison samples also saw. PFN1.5 is the mean of the international object-level count-based PF values within a response. It is the response-level average of the object-level count-based scores and it indicates on average how often the objects in a response appeared with a frequency of 1.5% or more across the six samples. The observed range for PFN1.5 was 0-6. At the low end, people gave responses containing only objects that were observed in less than 1.5% of protocols across all 6 comparison samples. At the high end of the range, people gave responses containing objects that were present in all six samples at a frequency of 1.5% or higher. This means that some people gave responses only containing an object or objects that occurred with a frequency of 1.5% or higher in all 6 samples. The average of 96 the PFN1.5 variable was 2.37 (SD = 2.43). On average, people gave response objects that were present in 2.37 of the six samples at a frequency of 1.5% or higher. Table 2 Descriptive Statistics for the Criterion Database Diagnostic Severity FA PFM PFN1.5 M 3.52 3.32 8.80 2.37 SD 1.06 1.00 14.37 2.43 Min 1.00 1.00 0.00 0.00 Max 5.00 5.00 63.25 6.00 Skew -0.11 -0.34 2.02 0.39 Kurtosis -0.96 -0.69 3.36 -1.51 Table 3 provides mean values of the primary Rorschach variables for the Criterion Database, organized by card number and by R_InCard. As anticipated, cards can be conceptualized as having different levels of complexity and different mean scores for the Rorschach variables reported in the table. Additionally, the means also vary according to which response within a card a person is on. For some cards, the average response had a fairly high level of fit (FA) and frequency (PFM and PFN1.5). For example, on Card 5, the response-level FA scores (M = 3.79), as well as the PFM (M = 17.47) and PFN1.5 (M = 3.37) scores are high compared to other cards. On Card 9, we see that the scores for FA (M = 2.78), PFM (M = 1.32), and PFN1.5 (M = 1.05) are much lower. This example indicates that, on average, people gave responses to Card 9 that had lower fit scores (FA scores) and contained less-common response objects than the responses that people tended to deliver on Card 5. Of additional value, Table 3 demonstrates the variation in fit and object frequency for responses as a function of which response within a card is being examined, both within cards and across cards. Response-level FA decreases, on average, as a person 97 delivers each additional response within a card (R_InCard 1 M = 3.57; R_InCard 2 M = 3.23; R_InCard 3 M = 3.07; and R_InCard 4 M = 2.94). The same pattern holds true for the variables PFM and PFN1.5, when examined according to the variable levels for R_InCard. Within each card, the trend for fit and frequency scores decreasing with each subsequent response is highly consistent as well. 98 Table 3 Mean Values by Card Number and R_InCard for the Criterion Database R_InCard FA PFM PFN1.5 N 1 2 3 4 Total 1 2 3 4 Total 1 2 3 4 Total 1 2 3 4 Total 1 3.87 3.50 3.26 2.99 3.56 15.13 7.22 5.63 2.61 9.60 4.12 2.91 1.85 1.18 3.05 158 152 80 28 418 2 3.47 3.38 3.26 3.06 3.38 10.16 5.94 5.56 5.44 7.51 3.05 2.35 2.16 1.52 2.54 156 144 70 25 395 3 3.39 3.45 3.19 3.05 3.36 27.35 11.56 6.85 8.52 16.95 3.54 2.59 1.88 1.55 2.79 159 139 73 20 391 4 3.61 3.12 3.03 2.80 3.29 13.35 5.51 3.22 1.31 8.17 3.10 1.77 1.39 0.76 2.21 157 143 61 19 380 99 Mean Card Number 5 6 7 4.34 3.34 3.65 3.48 3.03 3.18 3.07 2.92 3.03 2.99 2.72 2.93 3.79 3.13 3.35 28.85 6.29 12.62 10.67 3.48 4.80 4.33 2.10 2.35 0.35 2.93 2.81 17.47 4.42 7.69 4.68 1.92 3.96 2.67 1.37 2.12 1.92 0.77 1.54 0.44 0.80 1.63 3.37 1.48 2.80 159 159 158 133 137 139 48 60 56 16 20 16 356 376 369 8 3.60 3.07 3.00 3.00 3.26 19.39 8.68 7.34 2.91 12.18 2.85 1.80 1.14 0.98 2.03 153 136 68 32 389 9 2.84 2.86 2.60 2.53 2.78 1.51 1.49 0.65 1.10 1.32 1.17 1.22 0.52 0.77 1.05 150 135 67 24 376 10 3.50 3.22 3.20 3.09 3.29 4.78 3.26 3.08 2.23 3.61 2.99 2.19 2.14 1.52 2.36 152 143 95 57 447 Total 3.57 3.23 3.07 2.94 3.32 14.04 6.25 4.18 2.96 8.80 3.15 2.11 1.56 1.18 2.37 1561 1401 678 257 3897 Criterion Database: HLM HLM models for FA. FA Model 1 was the 2-level null model. The intercept of the FA scores (level-1) was specified as a random function of card number (level-2 grouping variable). The only fixed effect specified was the level-1 intercept. The model fit was -2LL = 10905.25 (see Table 4 for a statistical summary of all FA HLM Models), and the SPSS Type III Test of Fixed Effects table indicted a significant card number effect on FA scores (F = 1810.03, p < .05), signaling that constructing a multilevel model was an appropriate way to explore the structure of the data. In other words, there was significant between-card variation in FA. The SPSS Estimates of Covariance Parameters table also indicated that the clustering of FA scores by card number (as a level-2 random effect) accounted for a significant portion of the total variance (Estimate = 0.06, p < .05). The residual component signaled that there was a significant amount FA score variance that was not accounted for by the model (Estimate = 0.95, p < .05). Thus, there was evidence of unexplained within-card variation in FA scores. FA Model 2 was the 3-level null model. As in the 2-level null model, there are no predictors at any level; the only specified fixed effect was the level-1 intercept. The intercept of FA (level-1) was modeled with an accounting of the card number effect (level-2) and the possible grouping effect by person (ID number at level-3). The model fit statistic was slightly higher (-2LL = 11074.26), indicating a worse fit than the 2-level null model. The test of fixed effects remained significant (F = 26502.56, p < .05) as expected, indicating variance in the intercept attributable to higher-order effects. In the SPSS Estimates of Covariance Parameters table, the within-person residual component (Estimate = 0.98, p < .05) was higher than in the 2-level null model, indicating more 100 unexplained variance in FA than in Model 1. The between-card effects within person component (card number*ID number Estimate = 0.01, p = .78) was reduced to a nonsignificant level in Model 2, due to adding the ID number component as a level-3 grouping variable. The between-person effects account for a small amount of variance in FA (ID number component Estimate = .03, p < .05). FA Model 3 was a 3-level Random Intercepts Model. A predictor variable, R_InCard (level-1) was added as a fixed factor to Model 2 to account for possible trends due to R_InCard. As compared to the null models, Model 3 had slightly better fit (-2LL = 10890.04). The test of fixed effects revealed significant main effects for the intercept (F = 18793.36, p < .05) and for R_InCard (F = 63.34, p < .05), indicating there was variance in the intercept of FA attributable to higher-order effects as well as R_InCard. In the SPSS Estimates of Covariance Parameters table, the components indicate there was still variance in FA that was attributable to between-person effects (ID number component Estimate = 0.02, p < .05) as well as unexplained within-person variance (residual component Estimate = 0.92, p < .05). In this model, a regression equation for FA was built for each group rather than having a single regression equation across all groups (i.e., people). Because R_InCard was modeled as a fixed effect predictor with no corresponding R_InCard random effects, the slope coefficients were the same for each regression line (i.e., each ID number). In other words, the regression lines were calculated to indicate how FA scores were impacted by R_InCard, with that impact (i.e., the slope coefficients) being consistent across groups (i.e., people), but with each person having a different FA intercept (i.e., a different FA mean across the responses within their protocol). The SPSS Estimates of Fixed Effects table gives estimates of individual 101 parameters, and was generated in response to having identified fixed effect predictors – the FA intercept and R_InCard as a fixed factor (level-1). The parameter estimate for the intercept of FA (Estimate = 3.57, p < .05) indicates the mean value of FA when all predictors are set at zero. The remaining parameter estimates indicated that, when using R_InCard = 1 as the reference category, predicted FA was highest on the first response within a card (Estimate = 0.00; i.e., no change from 3.57), and slightly lower for each subsequent response within a card (second response Estimate = -0.33, p < .05; third response Estimate = -0.49, p < .05; fourth response Estimate = -0.61, p < .05). FA Model 4 was a 3-level Random Intercepts Model with repeated measures. The specification of card number as a level-2 grouping variable was removed, and R_InCard (level-1) within card number (level-2) was specified as a repeated measure with a scaledidentity matrix covariance structure. This allowed for modeling the possible correlation of residual errors due to R_InCard within card number being a repeated measure within subject (ID number, at level-3). As compared to Model 3, Model 4 fit was essentially unchanged (-2LL = 10891.88). The main effects for the intercept (F = 18851.17, p < .05) and for R_InCard (F = 62.33, p < .05) remained significant. The Estimates of Fixed Effects showed parameter estimates for the FA intercept and for R_InCard that were also essentially unchanged. In the SPSS Estimates of Covariance Parameters table, the components indicated variance in FA that was attributable to between-person effects (ID number component Estimate = 0.02, p < .05) as well as significant within-person repeated measures variance (repeated measures Estimate = 0.94, p < .05). FA Model 5 was also a 3-level Random Intercepts Model with repeated measures. As compared to Model 4, card number (level-2) was added as a fixed factor predictor 102 variable to account for possible predictive trends due to card number. As compared to the previous FA models, Model 5 fit was substantially improved (-2LL = 10648.04). The main effects for the intercept (F = 18981.03, p < .05) and for R_InCard (F = 64.37, p < .05) remained significant, and card number entered the model as a main effect (F = 28.00, p < .05). Within the SPSS Estimates of Fixed Effects table, the parameter estimates for R_InCard displayed the same pattern as in Models 3 and 4, with predicted FA being lower for later responses within a card. The parameter estimates for card number indicated that, when using card number = 1 as the reference category, predicted FA also differs for every card. In order, predicted FA was highest for Card 5 (Estimate = 0.19, p < .05), followed by Card 1 (Estimate = 0.00), Card 2 (Estimate = -0.20, p < .05), Card 3 (Estimate = -0.22, p < .05), Card 10 (Estimate = -0.23, p < .05), Card 7 (Estimate = -0.23, p < .05), Card 4 (Estimate = -0.29, p < .05), Card 8 (Estimate = -0.30, p < .05), Card 6 (Estimate = -0.45, p < .05), and Card 9 (Estimate = -0.79, p < .05). In the SPSS Estimates of Covariance Parameters table, the between-person effects remained (ID number component Estimate = 0.02, p < .05) as well as the slightly reduced repeated measures variance (repeated measures Estimate = 0.88, p < .05). FA Model 6, like Models 4 and 5, was also a 3-level Random Intercepts Model with repeated measures. In Model 6, a factor-factor cross-level interaction term (R_InCard*card number) was added to the list of fixed effects – The interaction term was used to model the effect of card number (level-2) on R_InCard (level-1) in predicting FA. More specifically, since both R_InCard and card number are specified as factors (not covariates), the interaction term was used to explore the possibility that each unique combination of factor levels might have a different linear effect on FA. Model fit was 103 again improved (-2LL = 10563.04). The main effects for the intercept (F = 18071.57, p < .05), R_InCard (F = 68.08, p < .05), and card number (F = 11.53, p < .05) remained significant, and the R_InCard*card number interaction term entered the model as a small but statistically significant fixed effect (F = 3.18, p < .05). Within the SPSS Estimates of Fixed Effects table, although the values changed slightly, the parameter estimates for R_InCard displayed the same pattern as in Models 3, 4, and 5. Predicted FA was different for each level of R_InCard, with the value being highest for the first response and lower for each subsequent response within a card. Although card number also remained as a main effect, and all cards still had a unique parameter estimate, the parameter estimates demonstrated a slightly altered pattern as compared to Model 5. Predicted FA was still highest for Card 5, followed by Card 1, and lowest for Cards 6 then 9. However, the remaining cards had a slightly different pattern for the parameter estimates when placed in descending order (i.e., Card 7, 4, and 8 were now higher than 10, 2, and 3). In examining the interaction effect parameter estimates, predicted FA scores were set at the FA mean intercept (Estimate = 0) for 13 of the 40 combinations of R_InCard*card number. Eighteen others did not differ from that mean intercept at a statistically significant level (p ≥ .05). The remaining nine interaction effect parameter estimates were statistically different from the mean intercept. Relative to the baseline estimates provided by the main effects, the interaction based intercepts were higher for the 2nd response to Cards 3 and 9; the 3rd response to Cards 2, 3, and 9; and the 4th response to Cards 9 and 10; and lower for the 2nd and 3rd response to Card 5. The interaction effect parameter estimates can be interpreted as ways to adjust the main effects based on the exact combination of factor levels. In general, FA did not decline as much as expected on 104 subsequent responses to Cards 3 and 9 but FA declined more than expected on Card 5, with “expected” defined by the marginal means from card number and R_InCard. This can be seen in the pattern of means in Table 3. In the SPSS Estimates of Covariance Parameters table, the between-person effects remained (Estimate = 0.02, p < .05) as well as the repeated measures variance (Estimate = 0.86, p < .05). The addition of the interaction term would also be the reason the pattern of estimates for card number changed. Though it may be a more precise model, it is also more complex to understand at a conceptual level when examining parameter estimates due to the sheer number of effects that are modeled, and the fact that the various fixed effects impact each other in the modeling process. In FA Model 7, the scaled-identity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification. Using a diagonal matrix, like with the scaled-identity matrix, it is assumed that residual covariances between times are independent of each other (i.e., equal to 0.0), and this is typically used as the default specification for repeated measures. The difference between the scaled-identity matrix and the diagonal matrix is that the diagonal matrix permits unequal variances and thus estimates a different FA variance for the 1st, 2nd, 3rd, and 4th response to a card. The notably lowered model fit statistic in Table 4 (-2LL = 10398.06) indicated an improvement in the model by allowing for unequal variances across sequential responses within cards, with later responses within card having higher variance estimates. As anticipated, the fixed effects (i.e., the main effects and the interaction term) remained significant. Within the SPSS Estimates of Fixed Effects table, the main effect of R_InCard retained the same pattern of estimates as in Models 3-6. The card number 105 main effect also retained the same pattern for the parameter estimates as seen in Model 6: When placed in order of descending estimates, the factor levels were Card 5, 1, 7, 4, 8, 10, 2, 3, 6, then 9. Examination of the SPSS Estimates of Covariance Parameters table confirmed that the between-person effects remained (Estimate = 0.02, p < .05) and that the diagonal matrix specification was appropriate, as the covariance parameter estimates for all R_InCard* card number combinations were statistically significant (p < .05), supporting the assumption of no residual covariance between measurement occasions. For FA Model 8, Diagnostic Severity was added as a fixed effect covariate (level3). The model would still be classified as a 3-level Random Intercepts Model for FA with Repeated Measures, but with an additional fixed effect predictor variable specified. The model fit statistic was relatively unchanged (-2LL = 10391.13). The fixed effects remained significant for the intercept (F = 2640.59, p < .05), R_InCard (F = 70.28, p < .05), card number (F = 11.55, p < .05), and the R_InCard*card number interaction term (F = 3.70, p < .05). Additionally, Diagnostic Severity entered the model as a main effect (F = 7.17, p < .05). Within the SPSS Estimates of Fixed Effects table, the intercept is slightly higher than in the beginning models (Estimate = 4.04, p < .05). The main effect of R_InCard retained the same pattern of estimates as in Models 3-7, with predicted FA being lower for later responses within a card (first response Estimate = 0; second response Estimate = -0.37, p < .05; third response Estimate = -0.61, p < .05; fourth response Estimate = -0.89, p < .05). The card number main effect also retained the same pattern for the parameter estimates as seen in Models 6 and 7: When placed in order of descending estimates, predicted FA was highest for Card 5 (Estimate = 0.47, p < .05), followed by Card 1 (Estimate = 0.00), Card 7 (Estimate = -0.22, p < .05), Card 4 106 (Estimate = -0.26, p < .05), Card 8 (Estimate = -0.27, p < .05), Card 10 (Estimate = -0.37, p < .05), Card 2 (Estimate = -0.40, p < .05), Card 3 (Estimate = -0.48, p < .05), Card 6 (Estimate = -0.53, p < .05), and Card 9 (Estimate = -1.03, p < .05). Diagnostic Severity was specified as a fixed effect covariate (i.e., a linear variable, as opposed to a nominal fixed effect factor), so the parameter estimate demonstrated a singular linear effect of Diagnostic Severity on predicted FA scores (Estimate = -0.05, p < .05), with higher Diagnostic Severity scores predicting slightly lower FA scores, on average. The SPSS Estimates of Covariance Parameters table was essentially unchanged (ID number Estimate = 0.02, p < .05; all R_InCard* card number covariance parameters had p < .05). In FA Model 9, two factor-covariate cross-level interaction terms (R_InCard*Diagnostic Severity and card number*Diagnostic Severity) were added to the list of fixed effects. The interaction terms were used to model the possible effects of R_InCard (level-1) and card number (level-2) on Diagnostic Severity (level-3) in predicting FA. In other words, the linear relationship between Diagnostic Severity and FA (i.e., the slope of Diagnostic Severity) could change for different levels of R_InCard and card number. The model fit statistic was again relatively unchanged (-2LL = 10385.32). The fixed effects remained significant for the intercept (F = 1864.54, p < .05), R_InCard (F = 7.10, p < .05), card number (F = 2.01, p < .05), and the R_InCard*card number interaction term (F = 3.71, p < .05). However, Diagnostic Severity dropped out of the model as a main effect (F = 3.62, p = .06), and neither of the new interaction terms were significant (R_InCard*Diagnostic Severity F = 0.27, p = .85; card number*Diagnostic Severity F = 0.56, p = .83). In essence, the slope of the previouslyseen linear relationship between Diagnostic Severity and FA was not altered according to 107 different levels of R_InCard or card number. The model also seems to be over-specified, as a main effect was lost. 108 Table 4 Statistical Summary of FA HLM Models for the Criterion Database -2LL Model 1 Intercept Model 2 Intercept Model 3 Intercept R_InCard Model 4 Intercept R_InCard Model 5 Intercept R_InCard Card Number Model 6 Intercept R_InCard Card Number R_InCard * Card Number Model 7 Intercept R_InCard Card Number R_InCard * Card Number Model 8 Intercept R_InCard Card Number R_InCard * Card Number Dx Severity 10905.25 11074.25 10890.04 10891.88 10648.04 10563.04 10398.06 10391.13 109 Type III Tests of Fixed Effects Num df Denom df F P 1 9.97 1810.03 < .01 1 152.61 26502.56 < .01 1 3 278.79 3079.55 18793.36 63.34 < .01 < .01 1 3 276.96 3844.18 18851.17 62.33 < .01 < .01 1 3 9 276.14 3841.59 3775.92 18981.03 64.37 28.00 < .01 < .01 < .01 1 3 9 27 300.05 3844.16 3833.64 3789.97 18071.57 68.08 11.53 3.18 < .01 < .01 < .01 < .01 1 3 9 27 243.17 653.15 95.45 141.23 17916.73 69.90 11.56 3.68 < .01 < .01 < .01 < .01 1 3 9 27 1 162.49 647.19 95.07 142.79 151.85 2640.59 70.28 11.55 3.70 7.17 < .01 < .01 < .01 < .01 .01 -2LL Type III Tests of Fixed Effects Num df Denom df F P Model 9 10385.32 Intercept 1 241.87 R_InCard 3 971.77 Card Number 9 604.47 R_InCard * Card Number 27 140.36 Dx Severity 1 244.28 R_InCard * Dx Severity 3 957.52 Card Number * Dx Severity 9 639.76 Note. The identifier “Dx Severity” refers to Diagnostic Severity. 110 1864.54 7.10 2.01 3.71 3.62 0.27 0.56 < .01 < .01 .04 < .01 .06 .85 .83 HLM models for PFM. PFM Model 1 was the 2-level null model. The intercept of the PFM scores (level-1) was specified as a random function of card number (level-2 grouping variable). The only fixed effect specified was the level-1 intercept. The model fit was -2LL = 31367.94 (see Table 5 for a statistical summary of all PFM HLM Models), and the SPSS Type III Test of Fixed Effects table indicted a significant card number effect on PFM scores (F = 30.54, p < .05), signaling that, like with FA, constructing a multilevel model was an appropriate way to explore the structure of the PFM data. In other words, there was significant between-card variation in PFM. The SPSS Estimates of Covariance Parameters table also indicated that the clustering of PFM scores by card number (as a level-2 random effect) accounted for a significant portion of the total variance (Estimate = 25.41, p < .05). The residual component signaled that there was a significant amount PFM score variance that was not accounted for by the model (Estimate = 181.47, p < .05). Thus, there was evidence of unexplained within-card variation in PFM scores. PFM Model 2 was the 3-level null model. As in the 2-level null model, there were no predictors at any level; the only specified fixed effect was the level-1 intercept. The intercept of PFM (level-1) was modeled with an accounting of the card number effect (level-2) and the possible grouping effect by person (ID number at level-3). The model fit statistic was higher (-2LL = 31825.05), indicating a worse fit than the 2-level null model. The test of fixed effects remained significant (F = 1204.44, p < .05), as expected, indicating variance in the intercept attributable to higher-order effects. In the SPSS Estimates of Covariance Parameters table, the residual component (Estimate = 200.72, p < .05) was higher than in the 2-level null model, indicating more unexplained within111 person variance in PFM than in Model 1. The variance in PFM attributable to betweencard effects within person (card number*ID number component Estimate = 3.99, p = .34) was reduced to a non-significant level in Model 2, due to adding the ID number component as a level-3 grouping variable. The variance attributable to between-person effects was also non-significant (ID number component Estimate = 1.72, p = .17). PFM Model 3 was a revised 2-level null model, in which the intercept of the PFM scores (level-1) was specified as a random function of ID number (level-3 grouping variable). The only fixed effect specified was the level-1 intercept. The model fit (-2LL = 31826.00) was essentially unchanged as compared to Model 2. The SPSS Type III Test of Fixed Effects table indicted variance in the intercept attributable to higher-order effects on PFM scores (F = 1197.99, p < .05), continuing to signal the need for a multi-level model. However, the SPSS Estimates of Covariance Parameters table indicated that PFM scores did not have significant variance accounted for by the component ID number as a level-3 random effect (Estimate = 2.00, p = .10). PFM Model 4 was a revision of Model 2 (i.e., the 3-level null model). Model 4 was designed as a 3-level Random Intercepts Model for PFM. A predictor variable, R_InCard (level-1), was added as a fixed effect factor. As compared to Model 2, Model 4 had improved fit (-2LL = 31437.05). The test of fixed effects revealed significant main effects for the intercept (F = 559.87, p < .05) and for R_InCard (F = 137.76, p < .05), indicating there was variance in the intercept of PFM attributable to higher-order effects as well as R_InCard. The SPSS Estimates of Fixed Effects table listed significant unique estimates for each parameter. The parameter estimate for the intercept of PFM (Estimate = 14.04, p < .05) indicates the mean value of PFM when all predictors are set at zero (i.e., 112 on average, the objects seen in responses by these participants were seen by about 14% of others). The remaining parameter estimates indicated that, when using R_InCard = 1 as the reference category, predicted PFM was highest on the first response within a card (Estimate = 0.00; i.e., no different from 14.04), and lower for each subsequent response within a card (second response Estimate = -7.78, p < .05; third response Estimate = -9.79, p < .05; fourth response Estimate = -10.89, p < .05). In the SPSS Estimates of Covariance Parameters table, the components indicate a small amount of the variance in PFM was attributable to between-card effects within person (card number*ID number component Estimate = 11.50, p < .05), but not between-person effects (ID number component Estimate = 0.14, p = .90). The majority of the within-person variance in PFM remains unexplained (residual component Estimate = 175.60, p < .05). PFM Model 5 was a 3-level Random Intercepts Model with repeated measures. The specification of card number as a level-2 random effects grouping variable was removed, and R_InCard (level-1) within card number (level-2) was specified as a repeated measure with a scaled-identity matrix covariance structure. This allowed for modeling the possible correlation of residual errors due to R_InCard within card number being a repeated measure within subject (ID number, at level-3). As compared to Model 4, the Model 5 fit statistic was slightly higher (-2LL = 31446.13), indicating a very small decline in model fit. The main effects for the intercept (F = 558.65, p < .05) and for R_InCard (F = 133.28, p < .05) remained significant. The SPSS Estimates of Fixed Effects table retained the same pattern as in Model 4. In the SPSS Estimates of Covariance Parameters table, the components indicated no significant variance in PFM that was attributable to between-person effects (ID number component Estimate = 0.91, p 113 = .37), but there was significant variance accounted for by the repeated measures (repeated measures Estimate = 186.20, p < .05). PFM Model 6 was a revised 3-level Random Intercepts Model with repeated measures, in which card number (level-2) was added as a fixed effect factor. The model fit was clearly improved as compared to Models 4 and 5 (-2LL = 30902.41 vs. ~31440). The main effects for the intercept (F = 603.81, p < .05) and for R_InCard (F = 142.19, p < .05) remained significant, and card number entered the model as a main effect (F = 65.12, p < .05). Within the SPSS Estimates of Fixed Effects table, the intercept was slightly higher at 14.91 though the parameter estimates for R_InCard displayed the same pattern as in Models 4 and 5, with predicted PFM being lower for each subsequent response within a card. The parameter estimates for card number indicated that, when using card number = 1 as the reference category, predicted PFM also differs for every card. In order, predicted PFM was highest for Card 5 (Estimate = 7.26, p < .05, such that responses to this card have objects seen by about 22% of other people [7.26 + 14.91 = 22.17]), followed by Card 3 (Estimate = 7.10, p < .05), Card 8 (Estimate = 2.51, p < .05), Card 1 (Estimate = 0), Card 4 (Estimate = -1.76, p = .05), Card 2 (Estimate = -2.27, p < .05), Card 7 (Estimate = -2.40, p < .05), Card 10 (Estimate = -5.48, p < .05), Card 6 (Estimate = -5.60, p < .05), and Card 9 (Estimate = -8.46, p < .05). In the SPSS Estimates of Covariance Parameters table, the repeated measures variance component remained (Estimate = 160.73, p < .05), and the between-person variance component became statistically significant (Estimate = 2.30, p < .05). PFM Model 7 was another 3-level Random Intercepts Model with repeated measures, and a factor-factor cross-level interaction term (R_InCard*card number) was 114 added to the list of fixed effects. The interaction term was used to model the effect of card number (level-2) on R_InCard (level-1) in predicting PFM. Model fit was again clearly improved (-2LL = 30649.22). The main effects for the intercept (F = 552.34, p < .05), R_InCard (F = 155.92, p < .05), and card number (F = 21.03, p < .05) remained significant, and the R_InCard*card number interaction term entered the model as a fixed effect (F = 9.69, p < .05). Within the SPSS Estimates of Fixed Effects table, the parameter estimates for R_InCard displayed the same pattern as in Models 4, 5, and 6, with predicted PFM being lower for each subsequent response within a card. Although card number also remained as a main effect, the parameter estimates demonstrated a very slightly altered pattern as compared to Model 6: When placed in order of descending estimates, the factor levels were Card 5, 3, 8, 1, 4, 7, 2, 6, 10, then 9. The 40 interaction effect parameter estimates can be interpreted as ways to adjust the main effects based on the exact combination of factor levels. As with the model predicting FA, 13 of these estimates were set to zero because they were redundant. Eleven others did not differ significantly from zero. Relative to the marginal means set by the Card Number and R_InCard, the interaction coefficients increased for the 2nd response to Cards 6, 9, and 10 and for the 3rd and 4th responses to Cards 2, 6, 9, and 10; they decreased for the 2nd and 3rd responses to Cards 3 and 5 and for the 4th response to Card 5. This pattern is broader and somewhat different from that observed for FA, with PF values declining more rapidly than expected across responses to the two cards with the highest PF means (i.e., 5 and 3) and less rapidly than expected across responses to the three cards with the lowest PF means (9, 10, and 6). These trends can be seen in the means in Table 3. In the SPSS 115 Estimates of Covariance Parameters table, the between-person effects remained (Estimate = 2.41, p < .05) as well as the repeated measures variance (Estimate = 150.43, p < .05). In PFM Model 8, the scaled-identity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification. The lowered model fit statistic (-2LL = 27815.62) indicated a notable improvement in the model from allowing the PFM variances to differ by R_InCard. The main effects for the intercept (F = 1114.10, p < .05), R_InCard (F = 155.65, p < .05), and card number (F = 79.62, p < .05), and the R_InCard*card number interaction term (F = 23.27, p < .05) remained significant, but with increased F-values. Within the SPSS Estimates of Fixed Effects table, the main effect of R_InCard retained the same pattern of estimates as in Models 4-7. The card number parameter estimates also retained the same pattern as compared to Model 7: When placed in order of descending estimates, the factor levels were Card 5, 3, 8, 1, 4, 7, 2, 6, 10, then 9. Examination of the SPSS Estimates of Covariance Parameters table revealed that the between-person effects returned to a nonsignificant level (Estimate = 0.07, p = .48) and that the diagonal matrix specification was appropriate, as the covariance parameter estimates for all R_InCard* card number combinations were statistically significant (p < .05), supporting the assumption of no residual covariance between measurement occasions. PFM Model 9 was used to explore whether Diagnostic Severity contributes to the model as a fixed effect covariate (level-3). The model would still be classified as a 3level Random Intercepts Model with Repeated Measures. The model fit statistic was essentially unchanged (-2LL = 27814.09). The fixed effects remained significant for the intercept (F = 492.38, p < .05), R_InCard (F = 155.82, p < .05), card number (F = 79.74, 116 p < .05), and the R_InCard*card number interaction term (F = 23.28, p < .05). However, Diagnostic Severity did not enter the model as a main effect (F = 1.60, p = .21). PFM Model 10 was another 3-level Random Intercepts Model with repeated measures, but with added fixed effects specifications for two factor-covariate cross-level interaction terms: R_InCard*Diagnostic Severity (level-1*level-3) and card number*Diagnostic Severity (level-2*level-3). The model fit statistic indicated a slight improvement in model fit (-2LL = 27794.10). Fixed effects remained significant for the intercept (F = 167.75, p < .05), R_InCard (F = 59.80, p < .05), card number (F = 10.27, p < .05), and the R_InCard*card number interaction term (F = 23.31, p < .05). As in Model 9, Diagnostic Severity did not enter the model as a main effect (F = 0.46, p = .50). However, one of the two new cross-level interactions did enter the model as a small but statistically significant fixed effect: R_InCard*Diagnostic Severity (F = 3.57, p < .05). The interaction term was used to model the effect of R_InCard (level-1) on Diagnostic Severity in predicting FA. More specifically, the interaction term was used to explore the possibility that within each unique factor level of R_InCard, Diagnostic Severity might have a different linear effect on FA (i.e., a change in slope). Although the overall factor*covariate interaction term was statistically significant, none of the individual factor level parameter estimates for the interaction are significant; for each level of R_InCard, the R_InCard*Diagnostic Severity parameter estimate is not statistically significant, and therefore does not differ from 0. The card number*Diagnostic Severity interaction did not enter the model (F = 1.23, p = .28). PFM Model 11 was a simplification of Model 10, in which the significant fixed effects were retained but the non-significant effects were deleted from the model 117 specification. This model is identical to Model 8 except for one additional fixed effect specification: The R_InCard*Diagnostic Severity interaction term. The model fit statistic (-2LL = 27804.91) is almost identical to that of Model 8. All specified main effects were significant (Intercept F = 475.49, p < .05; R_InCard F = 65.41, p < .05; card number F = 79.70, p < .05; R_InCard*card number F = 23.34, p < .05; R_InCard*Diagnostic Severity F = 2.74, p < .05). Within the SPSS Estimates of Fixed Effects table (Intercept Estimate = 15.34, p < .05), the parameter estimates for R_InCard retained the same pattern of estimates as in Model 8, with each subsequent response within a card having a lower predicted PFM (first response Estimate = 0; second response Estimate = -7.00, p < .05; third response Estimate = -10.54, p < .05; fourth response Estimate = -12.02, p < .05). The card number parameter estimates also retained the same pattern as was seen in Model 8: In order, predicted PFM was highest for Card 5 (Estimate = 13.72, p < .05), followed by Card 3 (Estimate = 12.22, p < .05), Cards 8, 1, 4, and 7 (Estimate = 0), Card 2 (Estimate = -4.97, p < .05), Card 6 (Estimate = -8.84, p < .05), Card 10 (Estimate = 10.35, p < .05), and Card 9 (Estimate = -13.62, p < .05). The 40 R_InCard*card number interaction effect parameter estimates can be interpreted as ways to adjust the main effects based on the exact combination of factor levels. Of the four R_InCard*Diagnostic Severity interaction effect parameter estimates, only one was significant (R_InCard = 2*Diagnostic Severity Estimate = -0.31, p < .05). It indicates that if the response was the second response within a card, for each unit of increase on Diagnostic Severity, predicted PFM is reduced by 0.31 units. Examination of the SPSS Estimates of Covariance Parameters table revealed that the between-person effects remained at a non-significant level (Estimate = 0.07, p = .46) and that the diagonal matrix specification was still 118 appropriate because the covariance parameter estimates for all R_InCard* card number combinations were statistically significant (p < .05), supporting the assumption of no residual covariance between measurement occasions. 119 Table 5 Statistical Summary of PFM HLM Models for the Criterion Database -2LL Model 1 Intercept Model 2 Intercept Model 3 Intercept Model 4 Intercept R_InCard Model 5 Intercept R_InCard Model 6 Intercept R_InCard Card Number Model 7 Intercept R_InCard Card Number R_InCard * Card Number Model 8 Intercept R_InCard Card Number R_InCard * Card Number Model 9 Intercept R_InCard Card Number R_InCard * Card Number Dx Severity 31367.94 31825.05 31826.00 31437.05 31446.13 30902.41 30649.22 27815.62 27814.09 120 Type III Tests of Fixed Effects Num df Denom df F P 1 9.99 30.54 < .01 1 143.79 1204.44 < .01 1 143.42 1197.99 < .01 1 3 282.44 2876.11 559.87 137.76 < .01 < .01 1 3 274.19 3836.06 558.65 133.28 < .01 < .01 1 3 9 275.38 3841.87 3778.10 603.81 142.19 65.12 < .01 < .01 < .01 1 3 9 27 297.55 3842.13 3847.23 3794.09 552.34 155.92 21.03 9.69 < .01 < .01 < .01 < .01 1 3 9 27 277.25 330.75 160.35 254.01 1114.10 155.65 79.62 23.27 < .01 < .01 < .01 < .01 1 3 9 27 1 275.69 330.74 160.29 254.34 136.98 492.38 155.82 79.74 23.28 1.60 < .01 < .01 < .01 < .01 .21 -2LL Type III Tests of Fixed Effects Num df Denom df F P Model 10 27794.10 Intercept 1 841.43 R_InCard 3 710.81 Card Number 9 386.11 R_InCard * Card Number 27 252.60 Dx Severity 1 769.61 R_InCard * Dx Severity 3 422.54 Card Number * Dx Severity 9 314.89 Model 11 27804.91 Intercept 1 250.41 R_InCard 3 524.76 Card Number 9 160.47 R_InCard * Card Number 27 254.49 R_InCard * Dx Severity 4 162.93 Note. The identifier “Dx Severity” refers to Diagnostic Severity. 121 167.75 59.80 10.27 23.31 0.46 3.57 1.23 < .01 < .01 < .01 < .01 .50 .01 .28 475.49 65.41 79.70 23.34 2.74 < .01 < .01 < .01 < .01 .03 HLM models for PFN1.5. PFN1.5 Model 1 was the 2-level null model. The intercept of the PFN1.5 scores (level-1) was specified as a random function of card number (level-2 grouping variable). The only fixed effect specified was the level-1 intercept. The model fit was -2LL = 17714.07 (see Table 6 for a statistical summary of all PFN1.5 HLM Models), and the SPSS Type III Test of Fixed Effects table indicted a significant card number effect on PFN1.5 scores (F = 124.09, p < .05), signaling that multilevel modeling was an appropriate way to explore the structure of the PFN1.5 data. The SPSS Estimates of Covariance Parameters table also indicated that the clustering of PFN1.5 scores by card number (as a level-2 random effect) accounted for a significant portion of the total variance (Estimate = 0.44, p < .05). The residual component signaled that there was a significant amount PFN1.5 score variance that was not accounted for by the model (Estimate = 5.47, p < .05). Thus, there was evidence of unexplained withincard variation in PFN1.5 scores. PFN1.5 Model 2 was the initial 3-level null model. As in the 2-level null model, there were no predictors at any level; the only specified fixed effect was the level-1 intercept. The intercept of PFN1.5 (level-1) was modeled with an accounting of the card number effect (level-2) and the possible grouping effect by person (ID number at level-3) as random effects. The model failed to converge and produced a warning that the final Hessian matrix was not positive definite even though all the convergence criteria were satisfied. This warning means that the best estimate for the variance of the random effect(s) is zero. A common cause of the Hessian matrix warning is a model specification that involves redundant covariance parameters, and a typical recommendation is to try using a simpler covariance structure specification. Failure to specify a “Subject” variable 122 on the “Random” subcommand line can also produce redundant covariance parameters, though failure to specify a subject was not the cause of the problem in this model. PFN1.5 Model 3 was a reattempted 3-level null model, but with a simplified covariance structure specification. As in Model 2, there were no predictors at any level; the only specified fixed effect was the level-1 intercept. The intercept of PFN1.5 (level-1) was modeled with an accounting of the card number effect (level-2) within person (ID number at level-3), but with no separate between-person effect specified as an individual random effect. The model fit statistic was higher (-2LL = 17968.39) than in Model 1, indicating a worse fit than the 2-level null model. The test of fixed effects remained significant (F = 3602.85, p < .05), as expected, indicating variance in the intercept attributable to higher-order effects. In the SPSS Estimates of Covariance Parameters table, the residual component (Estimate = 5.77, p < .05) indicated unexplained withinperson variance in PFN1.5. The card number*ID number component (Estimate = 0.12, p = .25) was non-significant, indicating the clustering of PFN1.5 scores by card number within person did not account for a significant portion of the total variance in PFN1.5 scores. PFN1.5 Model 4 was a second 2-level null model. As compared to Model 1, the intercept of the PFN1.5 scores (level-1) was specified as a random function of ID number (level-3 grouping variable), instead of card number (level-2). Once again, the only fixed effect specified was the level-1 intercept. The model fit was -2LL = 17949.48, and the SPSS Type III Test of Fixed Effects table indicted a significant ID number effect on PFN1.5 scores (F = 2381.55, p < .05), still signaling that multilevel modeling was an appropriate way to explore the structure of the PFN1.5 data within person. The SPSS 123 Estimates of Covariance Parameters table also indicated that the clustering of PFN1.5 scores by ID number (as a level-3 random effect) accounted for a significant portion of the total variance (Estimate = 0.14, p < .05). The residual component signaled that there was a significant amount of PFN1.5 score variance that was not accounted for by personlevel effects (Estimate = 5.75, p < .05). Thus, there was evidence of unexplained withinperson variation in PFN1.5 scores. PFN1.5 Model 5 was a revision of Models 2 and 3 (i.e., the 3-level null model). Model 5 was designed as a 3-level Random Intercepts Model for PFN1.5. A predictor variable, R_InCard (level-1), was added as a fixed effect factor. As compared to Model 3, Model 5 had clearly improved fit (-2LL = 17629.52 vs. 17949.48). The test of fixed effects revealed significant main effects for the intercept (F = 1389.12, p < .05) and for R_InCard (F = 112.37, p < .05), indicating there was variance in the intercept of PFN1.5 attributable to higher-order effects as well as R_InCard. The Estimates of Fixed Effects table listed significant unique estimates for each parameter. The parameter estimate for the intercept of PFN1.5 (Estimate = 3.15, p < .05) indicated the mean value of PFN1.5 when all predictors are set at zero. As such, the average response contains objects that are reported by 1.5% of the people in about 3 of the 6 international PF samples. The remaining parameter estimates indicated that, when using R_InCard = 1 as the reference category, predicted PFN1.5 was highest on the first response within a card (Estimate = 0.00), and lower for each subsequent response within a card (second response Estimate = -1.04, p < .05; third response Estimate = -1.57, p < .05; fourth response Estimate = -1.93, p < .05). Thus, by the fourth response to a card, the average response contains objects that are reported by 1.5% of the people in about 1 of the 6 international PF samples (3.15 – 124 1.93 = 1.22). In the SPSS Estimates of Covariance Parameters table, the variance components indicated a small amount of the variance in PFN1.5 was attributable to between-person effects (ID number component Estimate = 0.09, p < .05), but not between-card effects within person (card number*ID number component Estimate = 0.12, p = .23). The majority of the within-person variance in PFN1.5 remained unexplained (residual component Estimate = 5.21, p < .05). PFN1.5 Model 6 was a 3-level Random Intercepts Model with repeated measures. The specification of card number within person as a level-2 random effects grouping variable was removed, and R_InCard (level-1) within card number (level-2) was specified as a repeated measure with a scaled-identity matrix covariance structure. This allowed for modeling the possible correlation of residual errors due to R_InCard within card number being a repeated measure within subject (ID number, at level-3). As compared to Model 5, the Model 6 fit statistic was slightly higher (-2LL = 17631.04), indicating a very small decline in model fit. The main effects for the intercept (F = 1388.75, p < .05) and for R_InCard (F = 110.92, p < .05) remained significant. The Estimates of Fixed Effects table retained the same pattern as in Model 5. In the SPSS Estimates of Covariance Parameters table, the components indicated variance in PFN1.5 that was attributable to person-level effects (ID number component Estimate = 0.10, p < .05), as well as significant variance accounted for by the repeated measures (repeated measures Estimate = 5.32, p < .05). PFN1.5 Model 7 was a revised 3-level Random Intercepts Model with repeated measures, in which card number (level-2) was added as an additional fixed effect factor. The model fit was notably improved as compared to Models 5 and 6 (-2LL = 17303.37 125 vs. ~17630). The main effects for the intercept (F = 1417.42, p < .05) and for R_InCard (F = 118.77, p < .05) remained significant, and card number entered the model as a main effect (F = 38.03, p < .05). Within the SPSS Estimates of Fixed Effects table, the intercept was 3.86 and the parameter estimates for R_InCard displayed the same pattern as in Models 5 and 6, with predicted PFN1.5 being lower for each subsequent response within a card. The parameter estimates for card number were computed using card number = 1 as the reference category. In order, estimates were highest for Cards 5, 1, and 3 (all three Estimates = 0.00 or p > .05), followed by Card 7 (Estimate = -0.34, p < .05), Card 2 (Estimate = -0.54, p < .05), Card 10 (Estimate = -0.57, p < .05), Card 4 (Estimate = -0.90, p < .05), Card 8 (Estimate = -1.02, p < .05), Card 6 (Estimate = -1.64, p < .05), and Card 9 (Estimate = -2.02, p < .05). In the SPSS Estimates of Covariance Parameters table, the repeated measures variance component remained significant (Estimate = 4.88, p < .05), as did the between-person variance component (Estimate = 0.11, p < .05). PFN1.5 Model 8 was another 3-level Random Intercepts Model with repeated measures, and a factor-factor cross-level interaction term (R_InCard*card number) was added to the list of fixed effects. The interaction term was used to model the effect of card number (level-2) on R_InCard (level-1) in predicting PFN1.5. Model fit was again clearly improved (-2LL = 17212.51). The main effects for the intercept (F = 1327.89, p < .05), R_InCard (F = 122.32, p < .05), and card number (F = 13.86, p < .05) remained significant, and the R_InCard*card number interaction term entered the model as a fixed effect (F = 3.41, p < .05). Within the SPSS Estimates of Fixed Effects table, the parameter estimates for R_InCard displayed the same pattern as in Models 5, 6, and 7, with predicted PFN1.5 being lower for each subsequent response within a card. Although 126 card number also remained as a main effect, the parameter estimates demonstrated a very slightly altered pattern as compared to Model 7: When placed in order of descending estimates, the factor levels were Card 5, 1, 7, 3, 4, 2, 10, 8, 6, then 9. The 40 interaction effect parameter estimates can be interpreted as ways to adjust the main effects based on the exact combination of factor levels. As with the previous dependent variables, 13 of these parameters were set to zero; an additional 17 were not significantly different from zero. For the remaining 10, relative to the marginal means set by the card number and R_InCard, the interaction coefficients increased for the 2nd response to Card 9 and for the 3rd and 4th responses to Cards 2, 6, 9, and 10; they decreased for the 2nd response to Card 5. This pattern is somewhat similar to that observed for PFM, with the number of data sets having the response object present for at least 1.5% of the respondents declining more rapidly than expected from the 1st to 2nd response on the card with the highest dataset PFN1.5 mean (5) and less rapidly than expected for the 2nd, 3rd, and 4th responses to the card with the lowest PFN1.5 mean (9) and for the 3rd and 4th responses to Cards 10, 2, and 6. These trends can be seen in the means in Table 3. In the SPSS Estimates of Covariance Parameters table, the between-person effects remained significant (Estimate = 0.11, p < .05) as well as the repeated measures variance (Estimate = 4.76, p < .05). In PFN1.5 Model 9, the scaled-identity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification. The lowered model fit statistic (-2LL = 17048.34 vs. 17212.51) indicated a clear improvement in the model from allowing variances to differ by R_InCard. The main effects for the intercept (F = 1622.04, p < .05), R_InCard (F = 136.99, p < .05), and card number (F = 23.43, p < .05), as well as the R_InCard*card number interaction term (F = 4.68, p < .05) 127 remained significant. Within the SPSS Estimates of Fixed Effects table (Intercept Estimate = 4.12, p < .05), the main effect of R_InCard retained the same pattern of estimates as in Models 5-8 (first response Estimate = 0.00; second response Estimate = 1.21, p < .05; third response Estimate = -2.25, p < .05; fourth response Estimate = -2.92, p < .05). The card number parameter estimates also retained the same pattern as compared to Model 8: In order, predicted PFN1.5 was highest for Card 5 (Estimate = 0.56, p < .05), followed by Cards 1 and 7 (both Estimates = 0 or p > .05), Card 3 (Estimate = -0.58, p < .05), Card 4 (Estimate = -1.02, p < .05), Card 2 (Estimate = -1.07, p < .05), Card 10 (Estimate = -1.13, p < .05), Card 8 (Estimate = -1.27, p < .05), Card 6 (Estimate = -2.19, p < .05), and Card 9 (Estimate = -2.94, p < .05). Examination of the SPSS Estimates of Covariance Parameters table revealed that the between-person effects remained (Estimate = 0.10, p < .05) and that the diagonal matrix specification was appropriate because the covariance parameter estimates for all R_InCard* card number combinations were statistically significant (p < .05), supporting the assumption of no residual covariance between measurement occasions. PFN1.5 Model 10 was used to explore whether Diagnostic Severity contributes to the model as a fixed effect covariate (level-3). The model would still be classified as a 3level Random Intercepts Model with Repeated Measures. The model fit statistic was essentially unchanged (-2LL = 17047.30). The fixed effects for the intercept (F = 206.64, p < .05), R_InCard (F = 137.21, p < .05), card number (F = 23.42, p < .05), and the R_InCard*card number interaction term (F = 4.68, p < .05) remained significant. However, Diagnostic Severity did not enter the model as a main effect (F = 1.05, p = .31). 128 PFN1.5 Model 11 was another 3-level Random Intercepts Model with repeated measures, but with added fixed effects specifications for two factor-covariate cross-level interaction terms: R_InCard*Diagnostic Severity (level-1*level-3) and card number*Diagnostic Severity (level-2*level-3). The model fit statistic was again essentially unchanged (-2LL = 17041.27). The fixed effects for the intercept (F = 166.59, p < .05), R_InCard (F = 12.43, p < .05), card number (F = 2.58, p < .05), and the R_InCard*card number interaction term (F = 4.69, p < .05) once again remained significant. However, Diagnostic Severity still did not enter the model as a main effect (F = 0.64, p = .42), and neither did the newly-specified R_InCard*Diagnostic Severity (F = 0.42, p = .74) and card number*Diagnostic Severity (F = 0.53, p = .86) interaction terms. 129 Table 6 Statistical Summary of PFN1.5 HLM Models for the Criterion Database -2LL Model 1 Intercept Model 2 Intercept Model 3 Intercept Model 4 Intercept Model 5 Intercept R_InCard Model 6 Intercept R_InCard Model 7 Intercept R_InCard Card Number Model 8 Intercept R_InCard Card Number R_InCard * Card Number Model 9 Intercept R_InCard Card Number R_InCard * Card Number Model 10 Intercept R_InCard Card Number R_InCard * Card Number Dx Severity 17714.07 -17968.39 17949.48 17629.52 17631.04 17303.37 17212.51 17048.34 17047.30 130 Type III Tests of Fixed Effects Num df Denom df F P 1 9.98 124.09 < .01 -- -- -- -- 1 1397.76 3602.85 < .01 1 153.56 2381.55 < .01 1 3 282.80 3050.93 1389.12 112.37 < .01 < .01 1 3 280.48 3845.49 1388.75 110.92 < .01 < .01 1 3 9 281.06 3843.50 3778.51 1417.42 118.77 38.03 < .01 < .01 < .01 1 3 9 27 305.37 3845.45 3839.19 3793.25 1327.89 122.32 13.86 3.41 < .01 < .01 < .01 < .01 1 3 9 27 200.49 855.09 123.81 168.16 1622.04 136.99 23.43 4.68 < .01 < .01 < .01 < .01 1 3 9 27 1 153.14 854.94 123.79 168.59 148.88 206.64 137.21 23.42 4.68 1.05 < .01 < .01 < .01 < .01 .31 -2LL Type III Tests of Fixed Effects Num df Denom df F P Model 11 17041.27 Intercept 1 192.50 R_InCard 3 1210.30 Card Number 9 564.19 R_InCard * Card Number 27 167.32 Dx Severity 1 194.41 R_InCard * Dx Severity 3 1183.67 Card Number * Dx Severity 9 664.04 Note. The identifier “Dx Severity” refers to Diagnostic Severity. 131 166.59 12.43 2.58 4.69 0.64 0.42 0.53 < .01 < .01 .01 < .01 .42 .74 .86 Supplemental Analysis Strategies The majority of the supplemental analyses were completed using data that were aggregated at the protocol level, accomplished by calculating the mean of each responselevel variable within each protocol. Previous descriptive statistics and the HLM analyses were completed using data at the response level. The protocol-level aggregation leads to descriptive statistics and analyses in which each person’s scores are equally represented in the results; in the response-level analyses, a person with a greater number of responses would have more data contributing to the descriptive statistics and models than a person with fewer responses. Therefore, some minor changes are apparent in the descriptive statistics as compared to the previously reported results. Consistent with the descriptive statistics reported earlier, the Criterion Database contained 159 valid Rorschach protocols with accompanying Diagnostic Severity scores (M = 3.53, SD = 1.07) available. Descriptive statistics for the new protocol-level variables are reported in Table 7. As compared to response-level FA (M = 3.32, SD = 1.00), the protocol-level FA has a similar mean but much less dispersion of scores (M = 3.34, SD = 0.26). This signifies more score fluctuation in FA between responses than between protocols. PFM (response-level M = 8.80, SD = 14.37; protocol-level M = 9.10, SD = 3.41) and PFN1.5 (response-level M = 2.37, SD = 2.43; protocol-level M = 2.43, SD = 0.63) have similar patterns. 132 Table 7 Protocol-Level Descriptive Statistics for the Criterion Database FA PFM PFN1.5 M SD M SD M SD N Total 3.34 0.26 9.10 3.41 2.43 0.63 159 Card 1 3.59 0.47 10.00 7.48 3.21 1.65 159 Card 2 3.40 0.62 8.02 6.17 2.60 1.51 158 Card 3 3.36 0.71 18.73 14.53 2.95 1.46 159 Card 4 3.34 0.63 8.81 10.74 2.32 1.70 159 Card 5 3.89 0.63 19.61 13.96 3.61 1.73 159 Card 6 3.18 0.69 4.63 5.48 1.56 1.53 159 Card 7 3.39 0.64 8.37 7.07 2.99 1.77 159 Card 8 3.31 0.81 13.67 13.32 2.22 1.42 159 Card 9 2.79 0.64 1.45 1.34 1.15 1.20 157 Card 10 3.27 0.65 3.65 2.52 2.42 1.40 158 R_InCard 1 3.57 0.33 14.14 5.34 3.16 0.83 159 R_InCard 2 3.24 0.36 6.33 4.30 2.13 0.87 159 R_InCard 3 3.10 0.60 4.58 6.35 1.75 1.45 150 R_InCard 4 2.96 0.65 3.25 7.66 1.17 1.33 101 Note. All variables listed in the table represent protocol-level means of response-level variables; the means and standard deviations listed are across all protocols in the Criterion Database with relevant data. Table 7 also includes protocol-level descriptive statistics for the variables broken down by card number and by R_InCard. The protocol-level means for FA, PFM, and PFN1.5 are also displayed graphically in Figures 13-18. The protocol-level mean of FA is highest for Cards 5 and 1, and lowest on Card 9, indicating that, on average, response objects have the best perceptual fit to Cards 5 and 1, and worst fit on Card 9. Examination of protocol-level PFM reveals that subjects, on average, delivered responses containing the most popularly-reported response objects to Cards 5, 3, 8, and 1, while delivering the least common response objects to Card 9. It is also noteworthy that Cards 5, 3, and 8 had the largest standard deviations for PFM, while Card 9 had the lowest. The pattern of means and standard deviations indicates that people, on average, gave 133 conventional responses to Cards 5, 3, and 8, but it was on those cards that there also was the most variation between people on the conventionality of their responses, at least with regard to the response objects they used in constructing their responses. Protocol-level PFN1.5 statistics demonstrated that subjects delivered responses containing objects commonly used in the most countries to Cards 5, 1, 7, and 3, while delivering responses containing objects that were common to the least number of countries to Card 9. In examining Table 7, the PFN1.5 statistics showed less-pronounced patterns than the statistics for PFM. 134 Figure 13. Protocol-Level FA Means by Card Number. 135 Figure 14. Protocol-Level PFM Means by Card Number. 136 Figure 15. Protocol-Level PFN1.5 Means by Card Number. 137 Figure 16. Protocol-Level FA Means by R_InCard. 138 Figure 17. Protocol-Level PFM Means by R_InCard. 139 Figure 18. Protocol-Level PFN1.5 Means by R_InCard. 140 When Figures 16-18 are examined with regard to patterns in protocol-level scores averaged across people and organized according to R_InCard, a clear pattern emerges across variables – For FA, PFM, and PFN1.5, the mean protocol-level score decreases with each subsequent response within a card. With each subsequent response within a card, the objects used in constructing the responses have worse perceptual fit, are lesscommonly-used objects, and are also commonly-used objects in fewer countries. Examination of the standard deviations reported in Table 7 reveals that FA scores also seem have more variation with each subsequent response within a card. For PFM and PFN1.5, the pattern is similar, with R_InCard 1 and 2 having lower standard deviations than R_InCard 3 and 4. However, this effect is partially driven by the reduced number of responses in the sample for each subsequent R_InCard (e.g., there are more first responses within a card than fourth responses within a card). The sample size affects the size of the standard deviations, with smaller sample sizes leading to larger standard deviations. Thus, with each subsequent R_InCard, the smaller sample size contributes to the larger estimate for the standard deviation. To further explore differences in FA, PFM, and PFN1.5 based on card number and response within card, the means and standard deviations were used to compute Cohen’s d scores. At the protocol-level, means and standard deviations from Table 7 were used to compute the d-values that are listed below in Table 8 and displayed in Figure 19 and 20. Card 1 and R_InCard 1 were used as the reference categories, and thus they have d-values of 0. The d-values associated with Cards 2-10 and R_InCard 2-4 reflect differences relative to Card 1 and R_InCard 1. When FA is examined, relative to the 1st response to a card, on average the 2nd response was 0.96 of a SD lower, the 3rd 141 response was about one full SD lower, and the 4th response had average FA that was about 1.2 SD’s lower than the 1st response. With regard to differences in FA based on card number, Card 5 was about a half a standard deviation higher in FA than the reference value set by Card 1. Cards 2, 7, and 3 were about a third of a SD lower in FA than Card 1, while cards 8, 4, 10, and 6 were about a half a SD lower in FA than Card 1. Card 9 stood out by having average FA scores that were 1.4 SD’s below Card 1 and thus about two full SDs below Card 5. As can be seen in the table and figures, the same patterns are fairly consistent across FA, PFM, and PFN1.5. The same general patterns also hold true for the data when Cohen’s d scores are calculated based on response-level FA, PFM, and PFN1.5 means and standard deviations. Response-level d-values are listed below in Table 9 and displayed in Figure 21 and 22. 142 Table 8 Protocol-Level Cohen’s d comparing Each Card to Card 1 and Each R_InCard to Response 1 for the Criterion Database FA d PFM d PFN1.5 d Card # 1 0.00 0.00 0.00 2 -0.35 -0.29 -0.39 3 -0.39 0.79 -0.17 4 -0.45 -0.13 -0.53 5 0.55 0.90 0.24 6 -0.71 -0.83 -1.04 7 -0.36 -0.22 -0.13 8 -0.44 0.35 -0.64 9 -1.44 -1.94 -1.45 10 -0.57 -1.27 -0.52 R_InCard 1 0.00 0.00 0.00 2 -0.96 -1.62 -1.21 3 -1.01 -1.64 -1.24 4 -1.24 -1.68 -1.84 Note. Card 1 and R_InCard 1 are used as the reference values. All values listed are based on protocol-level means and standard deviations of response-level variables; the means and standard deviations used to calculate d-scores are across all protocols in the Criterion Database with relevant data. 143 Figure 19. Protocol-Level Cohen’s d Comparing Cards 2-10 to Card 1 on FA, PFM, and PFN1.5. 144 Figure 20. Protocol-Level Cohen’s d Comparing R_InCard 2-4 to R_InCard 1 on FA, PFM, and PFN1.5. 145 Table 9 Response-Level Cohen’s d Comparing Each Card to Card 1 and Each R_InCard to Response 1 for the Criterion Database FA d PFM d PFN1.5 d Card # 1 0.00 0.00 0.00 2 -0.22 -0.20 -0.21 3 -0.22 0.45 -0.11 4 -0.31 -0.11 -0.33 5 0.25 0.49 0.12 6 -0.48 -0.51 -0.67 7 -0.24 -0.17 -0.10 8 -0.30 0.15 -0.41 9 -0.92 -1.15 -0.94 10 -0.29 -0.71 -0.28 R_InCard 1 0.00 0.00 0.00 2 -0.34 -0.54 -0.43 3 -0.50 -0.73 -0.69 4 -0.66 -0.89 -0.93 Note. Card 1 and R_InCard 1 are used as the reference values. All values listed are based on response-level means and standard deviations. 146 Figure 21. Response-Level Cohen’s d Comparing Cards 2-10 to Card 1 on FA, PFM, and PFN1.5. 147 Figure 22. Response-Level Cohen’s d Comparing R_InCard 2-4 to R_InCard 1 on FA, PFM, and PFN1.5. 148 In the HLM analyses, the predicted variables were FA, PFM, and PFN1.5. Although Diagnostic Severity was introduced to the models for each of the predicted variables, it was not possible to decipher the simple relationship between Diagnostic Severity and each of the predicted variables, as other variables were part of the structural models as well. Therefore, 2-tailed Pearson correlation coefficients were calculated between Diagnostic Severity and each of the Rorschach variables at the protocol level (i.e., overall protocol-level FA, PFM, and PFN1.5, as well as FA, PFM, and PFN1.5 broken out by card number and by R_InCard). There were hypothesized relationships between variables, with increases in Diagnostic Severity hypothesized to correspond with decreases in FA, PFM, and PFN1.5 scores. There were very few statistically-significant correlations between the Rorschach variables and Diagnostic Severity. Using Cohen's (1988) conventions and an alpha of .05, there were small correlations between Diagnostic Severity and protocol-level FA over all responses (r = -.16, p = .04), and for responses to Cards 4 (r = -.16, p = .05). There were correlation coefficients that neared significance for responses to Card 6 (r = -.13, p = .09), and for the 1st (r = -.14, p = .08) and 2nd (r = -.15, p = .06) responses to each card. The correlations were in the expected direction, with higher Diagnostic Severity scores corresponding with lower FA scores. However, when the Holm, Larzelere and Mulaik alpha correction procedure was used (see Howell, 2010) to adjust for the number of cardspecific correlations (and thus null hypotheses) being tested, the correlation between FA scores on Card 4 and Diagnostic Severity was no longer significant (corrected alpha to surpass = .002, based on 30 correlation tests). Surprisingly, there were no statistically- 149 significant correlations between Diagnostic Severity scores and the protocol-level PFM and PFN1.5 variables, even when using the more lenient alpha of .05. However, there were moderate correlations between the Rorschach fit and frequency variables, as would be anticipated. Response-level correlations between FA and PFM (r = .51, p < .01), and FA and PFN1.5 (r = .59, p < .01) were slightly smaller than the correlation between PFM and PFN1.5 (r = .69, p < .01). All correlations referenced here were also recomputed using the nonparametric alternative, Spearman’s Rho. Effect sizes and significance values were very similar to the Pearson correlation coefficient values, direction of effects remained the same, and the same effects were determined to meet the threshold for statistical significance. Given that there were moderate correlations between FA, PFM, and PFN1.5, but there were no correlations between the frequency variables and Diagnostic Severity and only a small correlation between protocol-level FA and Diagnostic Severity, further follow-up data exploration seemed warranted. Thus, a new approach was taken to quantifying the fit and frequency information at the protocol level. If rank ordered and quantified according to frequency, Rorschach response objects form a Zipf distribution: Very few objects are extremely frequent (i.e., the populars), a small proportion of objects occur with high enough frequency that they are assigned a PFM and PFN1.5 value in the lookup tables (i.e., occur in at least 1.5% of protocols in at least 1 sample), and the remaining objects are relatively unique objects and occur infrequently, thus creating a long tail in the distribution of objects by frequency. It was believed that the right-hand long tail of the object distribution might hold a lot of information that was not being wellrepresented in the previous analyses because of the way that Percept Frequency variables 150 were tabulated, with all objects that had frequencies of less than 1.5% being weighted equally. First, a new response-level variable was calculated: Form Inaccuracy (FI) was computed by subtracting FA scores (range of 1-5) from 6.0, resulting in an inverse of FA that still has a range of 1-5. Thus, higher FI scores are indicative of greater levels of inaccuracy in perception. As expected, FA and FI had a perfect correlation of -1.0. The Criterion Database was then sorted and filtered according to the response-level PFM scores. If a response had a PFM score of 0, the response was included in the following protocol-level computations: Mean of the response-level FA scores per protocol, and the sum of the response-level FI scores per protocol. The mean FA score, calculated only from responses with a PFM score of 0, was conceptualized as a way to represent the accuracy of each test-taker’s perceptions on responses that contained only infrequent objects; it is the average of a person’s FA scores from the right tail of the Zipf distribution. The sum of FI scores, again calculated only from responses with a PFM score of 0, was considered a way to quantify inaccurate fit while also inherently accounting for how often the person gave responses that contained only objects that are infrequent and reside in the tail of the Zipf distribution. In the end, neither of the new scores had an association with Diagnostic Severity (mean FA r = -.02, p = .79; sum FI r = .01, p = .87). 151 Chapter Five Discussion FQ scoring systems and related indices have been constructed, evaluated, and revised over time and across Rorschach systems. Hermann Rorschach devised FQ as a way to describe whether the objects used in Rorschach responses were appropriate for the contours of the inkblot used in the constructing the response (Exner, 2003). It was Rorschach’s belief, which is shared by many others who use and research the Rorschach, that the manner in which form was used in constructing a response delivered information about the person’s perceptual accuracy or “reality testing” ability (Exner, 2003). Though the validity of the Rorschach has been the topic of intense debate since its development, even the toughest current critics of the Rorschach attest to the validity of FQ (e.g., Dawes, 1999; Wood, Nezworski, & Garb, 2003; Wood, Garb, Nezworski, Lilienfeld, & Duke, 2015). Adding to their appeal, these scores also serve as an example of variables with a clear relationship to the construct they are intended to assess (McGrath, 2008). The existing peer-reviewed literature has clearly and consistently demonstrated that the Rorschach can be used to accurately identify psychosis in test-takers by employing FQ scores and indices, as well as indices that are partially comprised of FQ information (e.g., Mihura et al., 2013), and this is attributed to FQ functioning as a gauge of the accuracy of the test-takers’ perceptions. 152 Within the CS, the three primary FQ designations are ordinary (o), unusual (u), and minus (-). CS FQ is comprised of response goodness-of-fit with the inkblot, frequency of the percept, selection of words used to describe the percept, and use of arbitrary lines in forming the percept within the inkblot. However, goodness-of-fit is the concept most emphasized in Exner’s (2003) descriptions of FQ. Partly due to the disparities between CS definitions of FQ and the actual elements that contributed to the FQ designations listed in the tables, researchers have revisited the topic of how to best capture the FQ construct of interest: A person’s perceptual accuracy or “reality testing” ability (e.g., Meyer et al., 2011). Over the past few years, the argument has been made that factors like the frequency of perceptions on the Rorschach do in fact relate to the accuracy of a person’s perceptions, when the concept of perceptual accuracy is considered from an ecological position. Should a person’s objective misperception of a stimulus (i.e., low fit) be considered a misperception if it is a highly common misperception (i.e., high frequency)? Prior to beginning work on R-PAS as a formal system, Meyer and Viglione (2008) conceptualized and developed the FA scoring category, which is a dimensional indicator of the accuracy of perceptual fit between a response object and the features of the inkblot. Meyer et al. (2011) followed the development of FA with the initial development of PF indices, which indicate the frequency with which the various response objects are used on the Rorschach. In developing the R-PAS FQ tables, Meyer et al. (2011) wanted to retain the essence of FQ as a measure of accuracy of perception that can be used to identify distorted perceptual processes of the test-taker. Following their existing line of research, 153 they included two distinct elements in their operational definition of FQ: Fit between the perceived object and the features of the inkblot, and the frequency with which the reported object occurs in the location used by the respondent. Consistent with their conceptualization, in an iterative review process the fit and frequency elements were both used in constructing the final R-PAS FQ reference tables. Like in the CS, R-PAS designates three FQ codes that can be assigned to responses that incorporate the use of form: ordinary (o), unusual (u), and minus (-). Although the R-PAS version of FQ was developed in an attempt to rectify some of the problems associated with the CS version of FQ, early validity studies demonstrate additional room for improvement in the detection of psychosis using the Rorschach. There is currently no single fully dimensional Rorschach score (within the CS, RPAS, or otherwise) that can thoroughly and efficiently tap into both the conventionality or spontaneously given frequency of response objects and the perceptual fit of those response objects to the cards. Researchers have anticipated that an empirically-developed and dimensional score that is comprised of both goodness-of-fit information as well as frequency information (i.e., PA) could substantially improve our ability to detect distorted perceptual processes and impaired reality testing of the test-taker, and thus improve validity coefficients in the Rorschach-based identification of psychosis. It seemed a worthwhile investment to first explore how FA and PF function independently, and to understand the structure of the various FA and PF indices across responses and cards within the Rorschach. Without exploring the variables independently, it would be difficult to determine how to best combine FA and PF information within a protocol to maximize the performance of a new PA scoring system. By clarifying the structure and 154 performance of FA and PF, it was hoped that standardized methods of scoring and interpreting PA scores could then be developed and applied to future research and ideally, to future clinical practice. Updating the PF Tables In the current study, the preliminary lookup table of PF variables and values that was developed by Meyer et al. (2011) was expanded by examining the existing specific object frequencies from five international datasets (Argentina, Brazil, Italy, Japan, and Spain), and by adding data from a sixth country (the U.S.) before creating international summary PF indices. The two final PF indices serve as cross-cultural indicators of the conventionality of response objects, and they were developed and inserted into the lookup tables at the object level. The first PF variable is the mean of the six withincountry variables that indicate the percentage of protocols that contained each match number. This variable is computed based on data from those countries that had objects reported by at least 1.5% of the participants in the sample. Thus, it roughly indicates on average how often a particular percept is reported across samples. When this percentagebased variable is applied to actual Rorschach responses and averaged across the objects within the response, it is referred to as PFM (Percept Frequency Mean). The object-level percentage-based variable was also converted into the second PF variable, which is a count of the number of samples (out of the six countries) in which the object was found in at least 1.5% of the protocols from each country. In other words, it is a sum of the binary country-specific variables that were used to indicate whether the response object was found in at least 1.5% of the protocols from that country. When this count-based variable is applied to actual Rorschach responses and averaged across the objects within 155 the response, it is referred to as PFN1.5 (Percept Frequency Number of samples ≥ 1.5%). Object-level FA ratings were also retained in the lookup tables, with each object’s FA value having been derived from an average of 9.9 rater judgements (Meyer & Viglione, 2008). After responses are coded for FA at the object level, response-level FA scores are determined. If the gestalt of the response percept is listed in the lookup tables, the corresponding FA score is applied to the response; if the gestalt is not listed, the lowest FA score from across the important objects used in the response is assigned as the response-level FA score. After the frequency data were compiled for the U.S. Sample protocols, 5 unique objects with a percentage-based frequency of ≥ 1.5% in the U.S. Sample were identified that were not listed in the previous version of the lookup tables. Although it was initially surprising that a greater number of unique objects were not identified within the U.S. Sample with a frequency of ≥ 1.5%, this result can also be interpreted as evidence that the FA and PF projects nearly exhausted the list of objects that will be encountered on most protocols. Interrater Reliability Like Exner (2003), Meyer et al. (2011) considered interrater reliability of great importance for each score included in the R-PAS system. FQ has consistently demonstrated high levels of interrater reliability in the published literature. When Meyer and Viglione (2008) began developing the FA scoring system, they anticipated that FA would have interrater reliability on-par with, or better than that typically encountered in CS FQ scoring. In the existing FA research, interrater reliability has ranged from good to excellent, and the same is true for the current study (ICC = .75). One might predict that 156 there would be more opportunity for disagreements on the proper FA code for objects and responses than on the FQ code. However, the FA scoring steps and guidelines are very similar to those used in FQ coding, decisions use lookup tables, and there are more objects contained within the FA lookup tables than in the CS or R-PAS FQ tables. Therefore, there would likely be fewer extrapolations and coder judgments required for FA coding than for CS or R-PAS FQ coding across the average protocol. This is an important consideration because if interrater reliability is low for a variable, more error is introduced into the scores and the validity coefficients will be reduced as an effect. Though interrater reliability for PFM and PFN1.5 was not computed in this study, it can also be safely assumed that it would mirror that of FA because the PF variable values were assigned through the use of syntax following coding of match numbers, which was part of the process of coding for FA. Coder judgment and extrapolation from the tables was not involved in assigning PF variable scores; either the specific object is listed in the lookup tables (with corresponding PF codes) or it is not. If an object is not listed, the PF variable scores for those objects are set to zero. The response-level PF variables are then also calculated through syntax as the mean of the object-level PF variables within the response. Modeling the Criterion Database I explored the structure of the response-level and protocol-level FA and PF indices using an archival database that included Rorschach protocols and a Diagnostic Severity score that served as a criterion measure. Diagnostic Severity was expressed on a 5-point scale, with higher scores indicating higher degree of overall dysfunction associated with a diagnosis. The response-level FA and PF indices were explored by 157 modeling how card number, response within card, and the criterion variable contributed to the structure of each variable, and protocol-level validity coefficients with the criterion measure were calculated in follow-up analyses. Modeling the Structure of FA The descriptive statistics for the response-level FA scores indicate that, across the Criterion Database sample, on average people gave responses that had a gestalt goodnessof-fit rating of 3.32 (SD = 1.00, range of 1-5). This means that if the average response from the Criterion Database was shown to a group of judges who were asked, “Can you see the response quickly and easily at the designated location,” the consensus would fall between “A little. If I work at it, I can sort of see that” (a rating of 3), and “Yes. I can see that. It matches the blot pretty well” (a rating of 4). A series of nine HLM models were constructed for predicting FA scores at the response level, following a modeling approach suggested by Garson (2013) in which the null modeling is the first step, and is followed by building in additional model terms based on theory if there is indication that multilevel modeling is needed. The FA modeling began with two-level and three-level null models (“unconditional models”). The null models are random intercept models in which the intercept of the predicted variable (in this case, FA) is specified as a random effect of 1 or more grouping variables at a higher level(s), with no fixed effect predictors specified. Null modeling is used to establish a baseline model, and also functions as a test of possible higher-order grouping effects; if higher-order grouping effects are present in the data (i.e., the covariance structure of the data is impacted by the grouping variable, due to clustering of effects, which creates correlated error), mixed modeling (e.g., HLM) of the data is indicated. The 158 FA null modeling (FA Models 1 and 2) indicated clustering by card, supporting the application of HLM modeling procedures. As a next step (Models 3-5), possible fixed effects (i.e., predictors that impact the intercept of FA) were specified, as well as a repeated measures effect on the covariance structure of the data. The FA intercept, R_InCard (level-1 fixed factor as a main effect), and card number (level-2 fixed factor as a main effect) were all statistically significant predictors in the modeling. Additionally, there was statistical support and a theory-based rationale for including a repeated measures specification in the models. Repeated measurements made on the same unit (e.g., the same person responding to a stimulus multiple times) exhibit clustering effects. Within HLM, the level-1 repeated measurements (i.e., the FA scores) can be modeled as clustered within higher-order observation units (i.e., within R_InCard that is in turn within card number). Next (Models 6-7), a possible fixed effect interaction term was added to the model specifications and the scaled-identity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification. Fixed effect interaction terms are used to model the possibility that each unique combination of factor levels might have a different linear effect on the predicted variable. The R_InCard*card number interaction term (cross-level factor*factor interaction) entered as a small but statistically significant effect, and the decision to use a diagonal matrix specification for the covariance structure that allowed FA variances to differ across the first to fourth response had statistical support and led to improved model fit. In the final models (Models 8-9), Diagnostic Severity was introduced into the model specifications. In Model 8, Diagnostic Severity had a small but statistically 159 significant main effect (level-3 fixed effect covariate) on predicted FA intercept, and the other model parameters remained significant (i.e., the fixed effects for FA intercept, R_InCard, card number, and R_InCard*card number interaction term; the repeated measures variance for R_InCard within card number). In Model 9, additional possible fixed effect interaction terms were added to the model specifications: The R_InCard*Diagnostic Severity and the card number*Diagnostic Severity interaction terms (cross-level factor*covariate interactions) were not statistically significant. Model 8 proved to be the best model for understanding the structure of FA. The fit statistic was low compared to the other models (indicating it had improved fit), all specified effects were statistically significant, and the patterns within the fixed effect parameter estimates were largely consistent with the simpler models, indicating stability in the effects. Based on R_InCard parameter estimates, predicted FA was lower for each subsequent response within a card. Based on Cohen’s d values computed at the response level, relative to the 1st response to a card, on average the 2nd response was about 3/10 of a SD lower, the 3rd response was about 5/10 of an SD lower, and the 4th response had average FA that was about 7/10 of a SD lower than the 1st response. Predicted FA was also different for each factor level of card number: When placed in order of descending parameter estimates, the factor levels were Card 5, 1, 7, 4, 8, 10, 2, 3, 6, and 9. Based on Cohen’s d values computed at the response level, Card 5 was about a quarter of a standard deviation higher in FA than the reference value set by Card 1. Cards 7, 4, 8, 10, 2, and 3 were about a third a SD lower in FA than Card 1, and Card 6 was about a half a SD lower in FA than Card 1. Card 9 stood out by having average FA scores that were almost a full SD below Card 1 and about 1.2 SDs below Card 5. The R_InCard*card 160 number interaction effect parameter estimates also contributed small adjustments to predicted FA based on the exact combinations of R_InCard and card number factor levels. Additionally, there was a small linear effect of Diagnostic Severity (fixed effect covariate) on predicted FA scores, with higher Diagnostic Severity scores predicting slightly lower FA scores, on average. Lastly, there was statistical support for specifying repeated measurements of FA within R_InCard within card number. Modeling the Structure of PFM The mean of the response-level PFM score across all responses and protocols (M = 8.80, SD = 14.37) indicates that, on average, people delivered responses with objects that about 9% of people in the comparison samples also saw. Note, however, the SD is larger than the M and the distribution has a floor of zero, indicating that this is a variable with a positively skewed distribution (skew = 2.02). At the high end of the range of observed scores (0 to 63.25), at least one person delivered a response in which their objects, on average, were present in 63.25% of the protocols across samples. In other words, people were delivering responses containing objects that more than half of people in the comparison samples also saw. Response-level PFM was modeled with a series of 11 HLM models. The PFM modeling began with two-level and three-level null models (PFM Models 1-3), and the models collectively indicated that there was higher-order effects on PFM, that a multilevel model (HLM) was appropriate, and there was clustering of PFM error variance by card number. Possible fixed effects were specified in the next steps, as well as a repeated measures effect on the covariance structure of the data (Models 4-6). The PFM intercept, R_InCard (level-1 fixed factor as a main effect), and card number (level-2 fixed factor as 161 a main effect) were all statistically significant predictors in the modeling. There was also statistical support for specifying repeated measurements of PFM within R_InCard within card number. Next (Models 7-8), a fixed effect interaction term was specified as a possible predictor of PFM in addition to the previous main effects. The R_InCard*card number interaction term (cross-level factor*factor interaction) entered as a statistically significant effect. Additionally, the scaled-identity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification, which again led to an improved model fit. In the last PFM models (Models 9-11), Diagnostic Severity was introduced into the model specifications. In Model 9, Diagnostic Severity failed to enter the model as a main effect (level-3 fixed effect covariate). In Model 10, additional possible fixed effect interaction terms were added to the model specifications: R_InCard*Diagnostic Severity (level-1*level-3) and card number*Diagnostic Severity (level-2*level-3). Both terms are cross-level factor*covariate interaction terms. Diagnostic Severity still failed to enter the model as a main effect, and card number*Diagnostic Severity was not a significant interaction. However, there was a small but statistically significant fixed effect for the interaction term of R_InCard*Diagnostic Severity in predicting PFM. Model 11 was a simplification of Model 10 in which the significant effects were retained in the specifications, but nonsignificant predictors were dropped from the model. Model 11 was the best structural model for predicting response-level PFM scores. The fit statistic was lower than previous models and all specified effects were statistically significant. Also, like with FA, there was good consistency in fixed effect parameter estimate patterns throughout the modeling of PFM. Entirely consistent with the prediction 162 models for FA, predicted PFM scores were lower for each subsequent response within a card. When the response-level d values were assessed, relative to the 1st response to a card, on average the 2nd response was about ½ of a SD lower in PFM, the 3rd response was about ¾ of a SD lower, and the 4th response had average PFM that was about 9/10 of a SD lower than the 1st response. Predicted PFM also varied based on the factor level of card number. In order, predicted PFM was highest for Card 5, followed by Card 3, Cards 8, 1, 4 and 7, Card 2, Card 6, Card 10, and Card 9. When d was computed using response-level data, PFM values for Cards 5 and 3 were about half a SD above the reference value of Card 1; Cards 8, 4, 7, and 2 were within 2/10 a SD of Card 1; Card 6 was about a half a SD below Card 1; and Cards 10 and 9 were about 3/4 to 1 SD below Card 1. The R_InCard*card number interaction effect parameter estimates also contributed small adjustments to predicted PFM based on the exact combinations of R_InCard and card number factor levels. Of the four R_InCard*Diagnostic Severity interaction effect parameter estimates, only one was significant (R_InCard = 2*Diagnostic Severity). It indicates that if the response was the second response within a card, for each unit of increase on Diagnostic Severity, predicted PFM is reduced by 0.31 units, which is a small change. In conjunction with the p value of .03 for this effect and the fact that there was not a consistent pattern in the interaction across other response positions, this small degree of change has to be considered tentative. Finally, there was also statistical support for specifying repeated measurements of PFM within R_InCard within card number. 163 Modeling the Structure of PFN1.5 The response-level PFN1.5 scores ranged from 0-6, with an average score of 2.37 (SD = 2.43). On average, people gave response objects that were present in 2.37 samples at a frequency of 1.5% or higher. Modeling of the response-level PFN1.5 scores was accomplished using 11 HLM models. The two-level and three-level null models (PFN1.5 Models 1-4) collectively indicated that there was higher-order effects on PFN1.5, that a multi-level model (HLM) was appropriate, and there was clustering of PFN1.5 error variance by card number. Possible fixed effects were specified in the next steps, as well as a repeated measures effect on the covariance structure of the data (Models 5-7). The PFN1.5 intercept, R_InCard (level-1 fixed factor as a main effect), and card number (level-2 fixed factor as a main effect) all entered the modeling as significant predictors. As with FA and PFM, there was also statistical support for specifying repeated measurements of PFN1.5 within R_InCard within card number. Next (Models 8-9), a R_InCard*card number interaction term was specified as a possible fixed effect predictor of PFN1.5, and it entered as a statistically significant effect. Additionally, the scaledidentity matrix covariance structure was replaced with a diagonal matrix covariance structure for the repeated measures specification, leading to better model fit. In the last PFN1.5 models (Models 10-11), Diagnostic Severity was introduced into the model specifications. In Model 10, Diagnostic Severity failed to enter the model as a main effect (level-3 fixed effect covariate). In Model 11, the additional possible fixed effect interaction terms were added to the model specifications: R_InCard*Diagnostic Severity (level-1*level-3) and card number*Diagnostic Severity (level-2*level-3). Diagnostic 164 Severity still failed to enter the model as a main effect, and neither of the new interaction terms was statistically significant in the prediction of response-level PFN1.5 scores. Model 9 was the best structural model for predicting response-level PFN1.5 scores; the model fit statistic was lower than previous models and the model was not over-specified in that all effects were statistically significant. There was once again good consistency in fixed effect parameter estimate patterns throughout the modeling sequence. As observed in the FA and PFM models, the predicted response-level PFN1.5 scores were lower for each subsequent response within a card, and the scores also varied based on the factor level of card number. Based on response-level effect sizes, relative to the 1st response to a card, on average the 2nd response had PFN1.5 values that were about 4/10 of a SD lower, the 3rd response was about 7/10 of a SD lower, and the 4th response had an average PFN1.5 value that was about 9/10 of a SD lower than for the 1st response. With respect to card number, in order, predicted PFN1.5 was highest for Card 5, followed by Cards 1 and 7, Card 3, Card 4, Card 2, Card 10, Card 8, Card 6, and Card 9. According to response-level Cohen’s d values, Card 5 was about a 1/10 of a SD higher than the reference value for Card 1; Cards 7 and 3 were about 1/10 of a SD lower than Card 1; Cards 4, 2, 10, and 8 were about 2/10 to 4/10 a SD lower than Card 1; Card 6 was about 7/10 of a SD lower; and Card 9 was about 9/10 of a SD lower than Card 1. The R_InCard*card number interaction effect parameter estimates also contributed small adjustments to predicted PFN1.5 scores. Lastly, there was statistical support for specifying repeated measurements of PFM within R_InCard within card number. 165 Summary of Variable Structures Across Modeling Techniques Analyses of the Criterion Database were completed using a total of 159 protocols, collectively containing 3,897 responses with form demand. At the response level across the variables of Diagnostic Severity, FA, PFM, and PFN1.5, the scores covered the full range of possible values (except for PFN1.5, which still had a large range), and also had high degrees of variability across responses within the Criterion Database. The full ranges and large standard deviations were useful to consider before examining results of analyses because they indicate good spread of scores, and based on the theory behind the current research, it was anticipated that FA, PFM, and PFM1.5 all relate to reality testing ability and would correlate with Diagnostic Severity. Therefore, the lack of range restriction increases the possible effect sizes and power of the analyses, and reduces the chance of making Type II errors (i.e., failing to reject the null hypothesis when the null hypothesis is false). Across the various analyses completed at different levels of aggregation, there was resounding evidence that the structure of the cards and the Rorschach task itself produce deviations in goodness-of-fit and frequency scores that cannot be entirely attributed to stable characteristics of the test-taker. When regression equations were computed independently for each person when predicting FA, PFM, and PFN1.5 (specified through the random effects commands), there were very consistent clustering effects in the data due to card number and due to response within card. Understanding the structural patterns of the fit and frequency data is an important undertaking in forming the foundation for future research on Rorschach perceptual accuracy scoring. 166 Card number and R_InCard main effects, as well as their interaction, accounted for a significant portion of the score variance in FA, PFM, and PFN1.5. Across all 3 scores, the predicted scores are lowered with each subsequent response within card, as a main effect across cards. Additionally, the factor level of card number (i.e., which card the response is being given to) also impacts the intercept of the 3 variables, but the pattern is a bit less consistent than is seen with R_InCard. Generally speaking, the main effects of card number across the different sets of models indicate that FA, PFM, and PFN1.5 predicted scores tend to be highest for Card 5, Card 1, and Card 7 (in decreasing score order); they tend to be lowest on Card 9, followed by Card 6. PFM scores showed a slight deviation in this pattern, with a spike in PFM scores on Card 3 due to the popular response to locations D1, D9, and W, and the extremely common response object of bow or butterfly to the D3 location. The patterns observed in the HLM modeling were also evidenced in the response-level and protocol-level descriptive statistics. The protocollevel statistics reinforced the pattern of FA, PFM, and PFN1.5 scores decreasing with each subsequent response within a card, and it can be interpreted to mean that with each subsequent response within a card, the objects used in constructing the responses have worse perceptual fit, are less-commonly-used objects, and are also commonly-used objects in fewer countries. Interestingly, the protocol-level standard deviations also revealed that FA, PFM, and PFN1.5 have more variation on later responses within card; R_InCard 1 and 2 have tighter distributions than R_InCard 3 and 4 across FA, PFM, and PFN1.5. In other words, people’s scores tend to scatter more on later responses within cards. However, reduced sample size with each subsequent response within card likely accounts for a portion of this effect. 167 The HLM modeling accounts for clustering of scores within people, but the descriptive statistics do not. Therefore, it can be concluded that there were structural consistencies in the data across the FA and PF indices that were present within people as well as across people. There was also substantially more unexplained within-person variance in scores than between-person variance in scores in the HLM models. This is evidence that the patterns of scores within person, organized by card number and by R_InCard, were stable and pronounced. However, there is also a lot of residual withinperson variance that has not yet been accounted for by the modeling. It was anticipated that Diagnostic Severity would correlate with mean FA, PFM, and PFN1.5 at the protocol level, and that the variables might have stronger or weaker relationships with Diagnostic Severity as an effect of card number or response within card. Surprisingly, Diagnostic Severity had very little association with the fit and frequency scores. There were small correlations between Diagnostic Severity and protocol-level FA over all responses, and for responses to Card 4; correlation coefficients were near significance for FA on Card 6, and for the 1st and 2nd responses to each card. The correlations were in the expected direction, with higher Diagnostic Severity scores corresponding with lower FA scores. However, when an adjusted alpha was used to account for running multiple tests of exploratory correlations, the correlation with FA for responses to Card 4 no longer reached significance. There were no statistically-significant correlations between Diagnostic Severity scores and the protocol-level PFM and PFN1.5 variables. This lack of correlation also occurred when mean FA and sum of FI were calculated for the tail of the object distribution, thus isolating the variables for responses with only unique objects that occur in less than 1.5% of protocols. This finding also coincides with the absence of 168 Diagnostic Severity as a significant effect in most of the PFM and PFN1.5 HLM models. Within the HLM models, there was a small linear effect of Diagnostic Severity on predicted response-level FA scores, with each unit of increase on Diagnostic Severity scores predicting a slight reduction in FA score (-0.05 units). In the HLM modeling of PFM, if the response was the second response within a card, predicted PFM was reduced by a very small amount (-0.31 units) for each unit of increase on Diagnostic Severity. Strengths and Limitations of the Study Although HLM proved to be a very useful technique in exploring and understanding the structure of FA, PFM, and PFN1.5, I was not able to specify the models such that Diagnostic Severity was the dependent variable, with predictor variables including FA, PFM, PFN1.5, card number, and R_InCard. This is due to the requirement in Hierarchical Linear Modeling techniques that the dependent variable be a level-1 variable (i.e., it has variance at the lowest level of the structure present within the data). However, HLM did prove useful and appropriate for the structural modeling of the fit and frequency variables, and supplemental techniques were used to help broaden the data exploration. The Diagnostic Severity criterion measure has strong interrater reliability and is an interesting approach to quantifying a clinical construct. However, it is a 5-point scale which is a somewhat blunt criterion measure, and it is based on ratings of billing diagnoses that were assigned before patients underwent assessment. Billing diagnoses are based on chart review, and are used to allow the hospital to bill for services. They are therefore quite tentative diagnoses, and are oftentimes revised after the patient completes the in-person evaluation. It should also be noted that the specific sample of patients used 169 was comprised of individuals who had complex clinical presentations, and that was the reason they had been referred for assessment, making it quite possible that billing diagnoses might have lower agreement with final diagnosis than is typically the case. Diagnostic Severity is also not specific to reality testing ability; it was conceptually derived to quantify the degree of overall dysfunction associated with a diagnosis, with higher scores indicating higher levels of dysfunction. The scores cannot be parsed into a domain of perceptual distortion. Although Diagnostic Severity scores were distributed across the possible range, there were no non-patients in the sample, which would have broadened the range of level of dysfunction, and thus increased power of the analyses. Some of the benefits of using the Criterion Database were the relatively large sample size, the nature of the sample as a clinical sample with a variety of diagnoses having been applied to the patients in the sample, and that the Rorschach protocols had been modeled for R-Optimization, with R-Optimized administration being the new standard for Rorschach assessment using R-PAS. FA clearly had more encouraging results than the PF variables with regard to aligning with the criterion variable. Coder judgment and extrapolation from the tables was not involved in assigning PF variable scores, though it is a component of FA scoring. It is possible that the extrapolation and coder judgment process is an important element in the final fit scores (i.e., FA) aligning with Diagnostic Severity, and this may help explain why FA clearly outperformed PFM and PFN1.5. Of note, the PFM variable was also more crude of a measure than it may seem at first glance. Response-level PFM scores averaged the PF object-level percentage-based scores only when the PF value for the object was 1.5% or higher. If three important objects were included in a response, but 170 only two of the objects had PF values of 1.5% or higher, the response-level PFM score was the mean of two of the three objects because the third object had a missing value for the object-level score. If a response contained one or more objects that incorporated form, but none of the objects had PF values of 1.5% or higher, the response-level PFM score was assigned a value of zero. These coding decisions should lead to an upward bias of PFM. However, this was done because it would not have been reasonable to assign a value of 0 to all objects with PFs less than 1.5%. This concept also applies to the computation of PFM scores in the FA and PF lookup tables. The final international object-level score reflects the percentage of protocols containing each object, averaged across countries. However, the country-specific scores are missing values in the tables if the object occurs in less than 1.5% of protocols. Therefore, the calculation of the international percentage-based score also has upward bias. Response-level PFM also evidenced a positively skewed distribution (skew = 2.02) and the variable was not transformed prior to being modeled. PFN1.5 may also be limited in that it is based on a single dichotomy; each unique object was either present in at least 1.5% of the withincountry samples or it was not. It is possible that some other cut-point (e.g., 2%, 4%, 10% or 1%) would have been more discriminating. Expected and Surprising Findings Based on previous research and clinical use of the Rorschach, it was anticipated that the Rorschach cards would differ to a degree on average FA, PFM, and PFN1.5 scores, as it is commonly professed that some cards are “easier” or “more difficult” than others (e.g., Meyer & Viglione, 2008). It was also anticipated that R_InCard would have an impact on the fit and frequency scores, with scores decreasing over responses within a 171 card. Prior to this research, one theory held by the FA and PF systematizers and their students is that, within each card, people might deliver their subjectively good and more obvious responses first, then deliver their less-obvious and more tentative responses that even the test-taker might realize do not have strong perceptual fit. It was surprising that Diagnostic Severity did not have stronger associations with FA scores, and that it had no association with PFM and PFN1.5. With regard to R_InCard, though differences in scores were expected as an effect of response within card, it was not anticipated that R_InCard 1 and 2 would stand out as seeming to be the more important responses for differentiating between people on FA. This conclusion is supported by the trends seen in the data, with near-significant small correlations between FA and Diagnostic Severity that were observed for the first and second response within a card, and the lack of correlation between FA and Diagnostic Severity for the third and fourth response within a card. The near-significant correlations between FA and Diagnostic Severity that were observed for Cards 4 and 6 were not anticipated either, because they are mid-level cards in that they are not the obvious “easy” or “hard” cards to do well on with regard to FA, PFM, and PFN1.5. However, the correlations did not meet significance when using 2-tailed adjusted alphas, and are also small so it is not clear whether such results would replicate in other samples. Although the degree of association between Diagnostic Severity and response-level FA does not change as a function of card number or R_InCard (i.e., the interactions were not significant), from the FA protocollevel analyses it does appear to be the case that Cards 4 and 6, and response within card 1 and 2, may help differentiate levels of diagnostic severity to a small degree. 172 Following the analyses, further attempts were made to understand the patterns seen with the Diagnostic Severity scores. In the meta-analyses by Mihura et al. (2013), CS X+% and X-% differentiated psychotic disorder samples from comparison psychiatric samples with medium effect sizes (r = .31, p < .01; r = .47, p < .01), suggesting that it would be possible to have similar effect sizes when exploring FA, PFM, and PFN1.5. However, in Meyer et al. (2011), Diagnostic Severity scores did not have significant correlations with R-PAS FQ indices (non-significant p values: FQo% r = -.14; FQ-% r = .13; WD-% r = .15), but correlations were significant for 3 of 4 CS FQ indices (significant p values: FQo% r = -.26; FQ-% r = .29; WD-% r = .28). This indicates that Diagnostic Severity was not a strong criterion for FQ-%, which is the primary CS and RPAS FQ variable of interest in studying reality testing and psychosis. If the moderate effect sizes seen in the meta-analyses are considered to be a good indication of the effect sizes that could be anticipated when differentiating psychotic disordered patients from non-psychotic disordered patients using the Rorschach, it becomes apparent that the criterion variable in validity studies must be highly correlated with the true construct of interest to have enough power in the analyses to have significant results without using an extremely large sample. Given that the CS FQ indices had small correlations with Diagnostic Severity, and R-PAS FQ indices did not have significant correlations with Diagnostic Severity (though the effect size magnitudes were .13-.15 for 3 of the 4 indices), it seems quite possible that the Diagnostic Severity scores were too rough and non-specific to the construct of interest to result in higher associations with the FA, PFM, and PFN1.5 variables. Finding a criterion measure that allows for accurate and dimensional measurement of perceptual accuracy is an incredibly difficult task. In 173 previous Rorschach research, similar problems have arisen. For example, in Horn (2009), across six performance-based measures of accuracy of perception that ranged from very basic neuropsychological accuracy tests to very complex interpersonal perception tasks, there was minimal association between criterion measure scores. This made it much more difficult to interpret the moderate correlations observed between FA and FQ indices and a few of the criterion measures. However, it was ultimately concluded that FA seems to indicate a more emotionally-removed and colder cognitive style of perceptual accuracy, while Form Quality seems to encompass a warmer and more emotionally-involved style of accuracy. Conclusions In essence, the Rorschach cards have a lot of perceptual structure present in the inkblots, and people seem to attend to that structure of the cards to a high degree, regardless of their level of psychopathology. Therefore, there is much more consistency in fit and frequency scores between people than within people when factors such as the card number and which response people are on within the card are considered. That being said, there is still a high level of unexplained residual variance in fit and frequency scores within people. The structure of FA, PFM, and PFN1.5 by card number and by response within card has been detailed. This explanation of structure can benefit researchers, and eventually clinicians, as it will help inform expectations for fit and frequency performance on the Rorschach. Knowing the structural contributions to the scores can allow researchers to account for those effects in test-taker performance on the test, and in the future this will ideally reduce erroneous assumptions about clinical constructs like reality testing ability due to score variance that is explained in substantial part by 174 elements of the cards rather than characteristics of the test-taker. For instance, observing relatively poor perceptual accuracy on Card 9 is to be expected and that says something about that particular inkblot, not necessarily about the person generating a response. Next steps in this line of research could include a follow-up criterion validity study of the frequency scores to see if the lack of association between the criterion measure and the Rorschach scores replicate. Additionally, there could be further exploration of how to best adapt FA (and potentially PFM and/or PFN1.5, if future results are highly promising) to improve the FQ system of scoring for perceptual accuracy using the Rorschach. Interesting approaches to future research could include isolation of just the first and second responses to each card, and investigating Form Accuracy and Percept Frequency scores for those initial responses. Researchers could also explore the reaction time of test-takers in the delivery of responses, and how reaction time might moderate relationships between variables. For example, if reaction time is extremely fast, the person may be delivering a very typical and obvious response. If reaction time is more delayed, the person may be delivering more unique responses that might have stronger association with a criterion variable. Interestingly though, social and cognitive psychologists might purport that responses with longer reaction time may also be indicative of a response search process that includes higher levels of social desirability. Another possible approach is to remove the responses that contain objects that are spikes in the frequency distribution (e.g., on Cards 1, 3, 5, and 7) from the analyses, which would reduce the impact of extremely typical responses on the final analyses. Similarly, a count variable could be created for the absence of those extremely typical responses within a protocol. 175 The perceptual structure that is present in the Rorschach cards and the inherent structure present in the sequential nature of the test-taking process contribute substantial variance to final fit and frequency scores. Form Accuracy does demonstrate some construct validity when modeled and correlated with Diagnostic Severity, and it is hypothesized that the association is due to Form Accuracy functioning as an indicator of reality testing ability, and thus psychosis. Despite having successfully modeled the structure of FA, PFM, and PFN1.5, as well as their relationships to Diagnostic Severity with consideration of structure within the cards and within the Rorschach test-taking process, the high levels of unexplained residual variance indicate that there is substantial person-specific information that contributes to meaningful score differences between people. Knowing the structural contributions to the scores will allow researchers to account for effects produced by the structure of the Rorschach task, and differentiate those effects from the effects produced by person-specific variables, ultimately moving the field closer to improved detection of reality testing abilities and psychosis. 176 References American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. Archer, R. P., & Gordon, R. A. (1988). MMPI and Rorschach indices of schizophrenic and depressive diagnoses among adolescent inpatients. Journal of Personality Assessment, 52, 276-287. doi: 10.1207/s15327752jpa5202_9 Archer, R. P. & Krishnamurthy, R. (1997). MMPI-A and Rorschach indices related to depression and conduct disorder: An evaluation of the incremental validity hypothesis. Journal of Personality Assessment, 69, 517-533. doi: 10.1207/s15327752jpa6903_7 Asari, T., Konishi, S., Jimura, K., Chikazoe, J., Nakamura, N., & Miyashita, Y. (2010). Amygdalar enlargement associated with unique perception. Cortex, 46, 94-99. doi: 10.1016/j.cortex.2008.08.001 Asari, T., Konishi, S., Jimura, K., Chikazoe, J., Nakamura, N., & Miyashita, Y. (2008). Right temporopolar activation associated with unique perception. NeuroImage, 41, 145-152. doi: 10.1016/j.neuroimage.2008.01.059 Balcetis, E., & Dunning, D. (2006). See what you want to see: Motivational influences on visual perception. Journal of Personality and Social Psychology, 91, 612-625. doi: 10.1037/0022-3514.91.4.612 177 Balcetis, E., & Dunning, D. (2007). Cognitive dissonance and the perception of natural environments. Psychological Science, 18, 917-921. doi: 10.1111/j.14679280.2007.02000.x Bannatyne, L. A., Gacono, C. B., & Greene, R. L. (1999). Differential patterns of responding among three groups of chronic, psychotic, forensic outpatients. Journal of Clinical Psychology, 55, 1553-1565. doi: 10.1002/(SICI)10974679(199912)55:12<1553::AID-JCLP12>3.0.CO;2-1 Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y., & Plumb, I. (2001). The “Reading the Mind in the Eyes” Test Revised Version: A study with normal adults, and adults with Asperger Syndrome or high-functioning autism. Journal of Child Psychology and Psychiatry, 42, 241-251. doi: 10.1111/1469-7610.00715 Beaubien, J. M., Hamman, W. R., Holt, R. W., & Boehm-Davis, D. A. (2001). The application of hierarchical linear modeling (HLM) techniques to commercial aviation research. Proceedings of the 11th annual symposium on aviation psychology, Columbus, OH: The Ohio State University Press. Beck, S. J. (1938). Personality structure in schizophrenia: A Rorschach investigation in 81 patients and 64 controls. Nervous & Mental Disorders Monograph Series, 63. Beck, S. J., Beck, A., Levitt, E., & Molish, H. (1961) Rorschach’s test. I: Basic processes (3rd ed.). New York, NY: Grune & Stratton. Benton, A. L., Sivan, A. B., deS. Hamsher, K., Varney, N. R., & Spreen, O. (1983). Benton Judgment of Line Orientation (Forms H & V and record forms). Lutz, FL: Psychological Assessment Resources. 178 Berkowitz, M., & Levine, J. (1953). Rorschach scoring categories as diagnostic “signs.” Journal of Consulting Psychology, 17, 110-112. doi: 10.1037/h0062113 Blais, M. A., Hilsenroth, M. J., Castlebury, F., Fowler, C. J., & Baity, M. R. (2001). Predicting DSM-IV cluster B personality disorder criteria from MMPI-2 and Rorschach data: A test of incremental validity. Journal of Personality Assessment, 76, 150-168. doi: 10.1207/S15327752JPA7601_9 Bruner, J. S. (1957). On perceptual readiness. Psychological Review, 64, 123-152. doi: 10.1037/h0043805 Camara, W. J., Nathan, J. S., & Puente, A. E. (2000). Psychological test usage: Implications in professional psychology. Professional Psychology: Research and Practice, 31, 141-154. doi: 10.1037/0735-7028.31.2.141 Carney, D. R., Colvin, C. R., & Hall, J. A. (2007). A thin slice perspective on the accuracy of first impressions. Journal of Research in Personality, 41, 1054-1072. doi: 10.1016/j.jrp.2007.01.004 Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284-290. doi: 10.1037/1040-3590.6.4.284 Clemence, A. J., & Handler, L. (2001). Psychological assessment on internship: A survey of training directors and their expectations for students. Journal of Personality Assessment, 76, 18-47. doi: 10.1207/S15327752JPA7601_2 Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi: 10.1037/0033-2909.112.1.155 179 Dao, T. K., & Prevatt, F. (2006). A psychometric evaluation of the Rorschach Comprehensive System’s Perceptual Thinking Index. Journal of Personality Assessment, 86, 180-189. doi: 10.1207/s15327752jpa8602_07 Dao, T. K., Prevatt, F., & Horne, H. L. (2008). Differentiating psychotic patients from nonpsychotic patients with the MMPI-2 and Rorschach. Journal of Personality Assessment, 90, 93-101. doi: 10.1080/00223890701693819 Dawes, R. M. (1999). Two methods for studying the incremental validity of a Rorschach variable. Psychological Assessment, 11(3), 297-302. doi: 10.1037/10403590.11.3.297 Dean, K. L., Viglione, D. J., Perry, W., & Meyer, G. J. (2007). A method to optimize the response range while maintaining Rorschach Comprehensive System validity. Journal of Personality Assessment, 89, 149-161. doi: 10.1080/00223890701468543 Dean, K. L., Viglione, D. J., Perry, W., & Meyer, G. J. (2008). Correction to: “A method to optimize the response range while maintaining Rorschach Comprehensive System validity”. Journal of Personality Assessment, 90, 2. doi: 10.1080/00223890701845542 Diener, M. J., Hilsenroth, M. J., Shaffer, S. A., & Sexton, J. E. (2011). A meta‐analysis of the relationship between the Rorschach Ego Impairment Endex (EII) and psychiatric severity. Clinical Psychology & Psychotherapy, 18, 464-485. doi: 10.1002/cpp.725 Dzamonja-Ignjatovic, T., Smith, B. L., Jocic, D. D., & Milanovic, M. (2013). A comparison of new and revised Rorschach measures of schizophrenic functioning 180 in a Serbian clinical sample. Journal of Personality Assessment, 95, 471-478. doi: 10.1080/00223891.2013.810153 Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Manual for kit of factor-referenced cognitive tests. Princeton, NJ: Eductional Testing Service. Epstein, S. (1979). The stability of behavior: I. On predicting most of the people much of the time. Journal of Personality and Social Psychology, 37, 1097-1126. doi: 10.1037/0022-3514.37.7.1097 Epstein, S. (1980). The stability of behavior: II. Implications for psychological research. American Psychologist, 35, 790-806. doi: 10.1037/0003-066X.35.9.790 Exner, J. E. (1974). The Rorschach: A comprehensive system: Vol 1. New York, NY: Wiley & Sons. Exner, J. E. (1984). More on the schizophrenia index. Alumni newsletter. Bayville, NY: Rorschach Workshops. Exner, J. E. (1986). The Rorschach: A comprehensive system: Vol 1. Basic foundations (2nd ed.). New York, NY: Wiley & Sons. Exner, J. E., Jr. (1989). Searching for projection in the Rorschach. Journal of Personality Assessment, 53, 520-536. doi: 10.1207/s15327752jpa5303_9 Exner, J. E., Jr. (1991). The Rorschach: A comprehensive system: Vol. 2. Interpretation (2nd ed.). New York, NY: Wiley& Sons. Exner, J. E., Jr. (2000). A primer for Rorschach interpretation. Asheville, NC: Rorschach Workshops. 181 Exner, J. E., Jr. (2003). The Rorschach: A comprehensive system: Vol. 1. Basic foundations and principles of interpretation (4th ed.). New York, NY: Wiley& Sons. Exner, J. E., Jr. (2007). A new U.S. adult nonpatient sample. Journal of Personality Assessment, 89(S1), S154-S158. doi: 10.1080/00223890701583523 Friedman, H. (1953). Perceptual regression in schizophrenia: An hypothesis suggested by the use of the Rorschach test. Journal of Projective Techniques, 17, 171-185. doi: 10.1080/08853126.1953.10380477 Ganellen, R. J. (1996). Comparing the diagnostic efficiency of the MMPI, MCMI-II, and Rorschach: A review. Journal of Personality Assessment, 67, 219-243. doi: 10.1207/s15327752jpa6702_1 Ganellen, R. J., Wasyliw, O. E., & Haywood, T. W. (1996). Can psychosis be malingered on the Rorschach? An empirical study. Journal of Personality Assessment, 66, 6580. doi: 10.1207/s15327752jpa6601_5 Garb, H. N. (1984). The incremental validity of information used in personality assessment. Clinical Psychology Review, 4, 641-655. doi: 10.1016/02727358(84)90010-2 Garson, G. D. (Ed.). (2013). Hierarchical linear modeling: Guide and applications. Los Angeles, CA: Sage Publications. Hathaway, S. R., & McKinley, J. C. (1967). MMPI manual (revised ed.). New York, NY: Psychological Corporation. Heck, R. H., Thomas, S. L., & Tabata, L. N. (2010). Multilevel and longitudinal modeling with IBM SPSS. New York: Routledge. 182 Hertz, M. R. (1970). Frequency tables for scoring Rorschach responses with code charts, normal and rare details, F+ and F– responses, and popular and original responses (5th ed.). Cleveland, OH: The Press of Case Western Reserve University. Hilsenroth, M. J., Eudell-Simmons, E. M., DeFife, J. A., & Charnas, J. W. (2007). The Rorschach Perceptual-Thinking Index (PTI): An examination of reliability, validity, and diagnostic efficiency. International Journal of Testing, 7(3), 269291. doi: 10.1080/15305050701438033 Hilsenroth, M. J., Fowler, J. C., & Padawer, J. R. (1998). The Rorschach Schizophrenia Index (SCZI): An examination of reliability, validity, and diagnostic efficiency. Journal of Personality Assessment, 70, 514-534. doi: 10.1207/s15327752jpa7003_9 Hoelzle, J. B., & Meyer, G. J. (2008). The factor structure of the MMPI-2 Restructured Clinical (RC) Scales. Journal of Personality Assessment, 90, 443-455. doi: 10.1080/00223890802248711 Horn, S. L. (2009). Rorschach perception: Multimethod validation of Form Accuracy. Unpublished master’s thesis, University of Toledo, Ohio. Horn, S. L., Meyer, G. J., Viglione, D. J., & Ozbey, G. T. (2008, March). The validity of the Rorschach Human Representational Variable using Form Accuracy. In G. J. Meyer (Chair), Assessing perceptual accuracy on the Rorschach using Form Accuracy ratings versus Form Quality scores. Symposium conducted at the annual meeting of the Society for Personality Assessment, New Orleans, LA. 183 Howell, D. C. (2010). Statistical methods for psychology (7th ed.). Belmont, CA: Wadsworth. Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York: Routledge. Jorgensen, K., Andersen, T. J., & Dam, H. (2000). The diagnostic efficiency of the Rorschach Depression Index and Schizophrenia Index: A review. Assessment, 7, 259-280. doi: 10.1177/107319110000700306 Kimball, A. J. (1950). Evaluation of form-level in the Rorschach. Journal of Projective Techniques, 14, 219-244. doi: 10.1080/08853126.1950.10380327 Kimhy, D., Corcoran, C., Harkavy-Friedman, J. M., Ritzler, B, Javitt, D. C., & Malaspina, D. (2007). Visual form perception: A comparison of individuals at high risk for psychosis, recent onset schizophrenia and chronic schizophrenia. Schizophrenia Research, 97, 25-34. doi: 10.1016/j.schres.2007.08.022 Kinder, B., Brubaker, R., Ingram, R., & Reading, E. (1982). Rorschach form quality: A comparison of the Exner and Beck systems. Journal of Personality Assessment, 46, 131-138. doi: 10.1207/s15327752jpa4602_4 Knopf, I. J. (1956). Rorschach summary scores in differential diagnosis. Journal of Consulting Psychology, 20, 99-104. doi: 10.1037/h0049120 Koivisto, M., & Revonsuo, A. (2007). How meaning shapes seeing. Psychological Science, 18, 845-849. doi: 10.1111/j.1467-9280.2007.01989.x Leichtman, M. (1996). The nature of the Rorschach task. Journal of Personality Assessment, 67, 478-493. doi: 10.1207/s15327752jpa6703_4 Luke, D. (2004). Multilevel modeling. Thousand Oaks, CA: Sage Publications. 184 Lunazzi, H. A., Urrutia, M. I., de La Fuente, M. G., Elias, D., Fernandez, F., & de La Fuente, S. (2007). Rorschach Comprehensive System data for a sample of 506 adult nonpatients from Argentina. Journal of Personality Assessment, 89(S1), S7S12. doi: 10.1080/00223890701582806 Mason, B. J., Cohen, J. B., & Exner, J. E., Jr. (1985). Schizophrenic, depressive, and nonpatient personality organizations described by Rorschach factor structures. Journal of Personality Assessment, 49, 295-305. doi: 10.1207/s15327752jpa4903_16 Mayman, M. (1970) Reality contact, defense effectiveness, and psychopathology in Rorschach form level scores. In B. Klopfer, M. Meyer, & F. Brawer (Eds.), Developments in the Rorschach technique. III: Aspects of personality structure (pp. 11-46). New York, NY: Harcourt Brace Jovanovich. McGrath, R. E. (2008). The Rorschach in the context of performance-based personality assessment. Journal of Personality Assessment, 90, 465-475. doi: 10.1080/00223890802248760 Meyer, G. J. (1997). On the integration of personality assessment methods: The Rorschach and MMPI. Journal of Personality Assessment, 68, 297-330. doi: 10.1207/s15327752jpa6802_5 Meyer, G. J. (2000). Incremental validity of the Rorschach prognostic rating scale over the MMPI ego strength scale and IQ. Journal of Personality Assessment, 74, 356370. doi: 10.1207/S15327752JPA7403_2 Meyer, G. J. (2001). Evidence to correct misperceptions about Rorschach norms. Clinical Psychology: Science and Practice, 8, 389-396. doi: 10.1093/clipsy/8.3.389 185 Meyer, G. J., & Eblin, J. J. (2012). An overview of the Rorschach Performance Assessment System (R-PAS). Psychological Injury and Law, 5, 107-121. doi: 10.1007/s12207-012-9130-y Meyer, G. J., Erdberg, P., & Shaffer, T. W. (2007). Toward international normative reference data for the Comprehensive System. Journal of Personality Assessment, 89(S1), S201-S216. doi: 10.1080/00223890701629342 Meyer, G. J., Hsiao, W., Viglione, D. J., Mihura, J. L., & Abraham, L. M. (2013). Rorschach scores in applied clinical practice: A survey of perceived validity by experienced clinicians. Journal of Personality Assessment, 95, 351-365. doi: 10.1080/00223891.2013.770399 Meyer, G. J., & Kurtz, J. E. (2006). Advancing personality assessment terminology: Time to retire “Objective” and “Projective” as personality test descriptors. Journal of Personality Assessment, 87, 223-225. doi: 10.1207/s15327752jpa8703_01 Meyer, G. J., Patton, W. M., & Henley, C. (2003, March). A comparison of form quality tables from Exner, Beck, and Hertz. Paper presented at the annual meeting of the Society for Personality Assessment, San Francisco, CA. Meyer, G. J., & Resnick, G. D. (1996). Assessing ego impairment: Do scoring procedures make a difference? Paper presented at the 15th International Research Conference, Boston, MA. Meyer, G. J., Riethmiller, R. J., Brooks, R. D., Benoit, W. A., & Handler, L. (2000). A replication of Rorschach and MMPI-2 convergent validity. Journal of Personality Assessment, 74, 175-215. doi: 10.1207/S15327752JPA7402_3 186 Meyer, G. J., Viglione, D. J., Mihura, J. L., Erard, R. E., & Erdberg, P. (2011). Rorschach Performance Assessment System: Administration, coding, interpretation, and technical manual. Toledo, OH: Rorschach Performance Assessment System. Meyer, G. J., & Viglione, D. J. (2008, March). Overview of the Form Accuracy rating project and general findings. In G. J. Meyer (Chair), Assessing perceptual accuracy on the Rorschach using Form Accuracy ratings versus Form Quality scores. Symposium conducted at the annual meeting of the Society for Personality Assessment, New Orleans, LA. Mihura, J. L., Meyer, G. J., Dumitrascu, N., Bombel, G. (2013). The validity of individual Rorschach variables: Systematic reviews and meta-analyses of the Comprehensive System. Psychological Bulletin, 139, 548-605. doi: 10.1037/a0029406 Minassian, A., Granholm, E., Verney, S., & Perry, W. (2004). Pupillary dilation to simple vs. complex tasks and its relationship to thought disturbance in schizophrenia patients. International Journal of Psychophysiology, 52, 53-62. doi: 10.1016/j.ijpsycho.2003.12.008 Miralles Sangro, F. (1997). Location tables, Form Quality, and Popular responses in a Spanish sample of 470 subjects. In I. B. Weiner (Ed.), Rorschachiana XXII: Yearbook of the International Rorschach Society (pp. 38-66). Ashland, OH: Hogrefe & Huber. Mohammadi, M. R., Hosseininasab, A., Borjali, A., & Mazandarani, A. A. (2013). Reality testing in children with childhood-onset schizophrenia and normal 187 children: A comparison using the Ego Impairment Index on the Rorschach. Iranian Journal of Psychiatry, 8, 44–50. Moore, R. C., Viglione, D. J., Rosenfarb, I. S., Patterson, T. L., & Mausbach, B. T. (2013). Rorschach measures of cognition relate to everyday and social functioning in schizophrenia. Psychological Assessment, 25, 253-263. doi: 10.1037/a0030546 Netter, B. E. C., & Viglione, D. J., Jr. (1994). An empirical study of malingering schizophrenia on the Rorschach. Journal of Personality Assessment, 62, 45-57. doi: 10.1207/s15327752jpa6201_5 Neville, J. W. (1995). Validating the Rorschach measures of perceptual accuracy. (Doctoral dissertation, University of Arkansas, 1993). Dissertation Abstracts International, 55, 4128B. Olson, I. R., Plotzker, A., & Ezzyat, Y. (2007). The enigmatic temporal pole: A review of findings on social and emotional processing. Brain, 130, 1718-1731. doi: 10.1093/brain/awm052 Ozbey., G. T., Meyer, G. J., Viglione, D. J., Dean, K., & Horn, S. L. (2008, March). The validity of Rorschach Perceptual Thinking Index and Ego Impairment Index-2 using Form Accuracy. In G. J. Meyer (Chair), Assessing perceptual accuracy on the Rorschach using Form Accuracy ratings versus Form Quality scores. Symposium conducted at the annual meeting of the Society for Personality Assessment, New Orleans, LA. Parisi, S., Pes, P., & Cicioni, R. (2005). Tavole di localizzazione Rorschach, Volgari e R+ statistiche. Disponibili presso I’lstituto. 188 Perry, W., Minassian, A., Cadenhead, K., Sprock, J., & Braff, D. (2003). The use of the Ego Impairment Index across the schizophrenia spectrum. Journal of Personality Assessment, 80, 50-57. doi: 10.1207/S15327752JPA8001_13 Perry, W., & Viglione, D. J. (1991). The Ego Impairment Index as a predictor of outcome in melancholic depressed patients treated with tricyclic antidepressants. Journal of Personality Assessment, 56, 487-501. doi: 10.1207/s15327752jpa5603_10 Peterson, C. A., & Horowitz, M. (1990). Perceptual robustness of the nonrelationship between psychopathology and popular responses on the Hand Test and the Rorschach. Journal of Personality Assessment, 54, 415-418. doi: 10.1207/s15327752jpa5401&2_38 Piotrowski, Z. (1957). Perceptanalysis; a fundamentally reworked, expanded, and systematized Rorschach method. New York, NY: Macmillan. Ptucha, K., Saltman, C., Filizetti, K., Viglione, D. J., & Meyer, G. J. (2008, March). Differentiating Psychiatric Severity using Form Accuracy and Form Quality. In G. J. Meyer (Chair), Assessing perceptual accuracy on the Rorschach using Form Accuracy ratings versus Form Quality scores. Symposium conducted at the annual meeting of the Society for Personality Assessment, New Orleans, LA. Rapaport, D., Gill, M., & Schafer, R. (1946). Diagnostic psychological testing: The theory, statistical evaluation, and diagnostic application of a battery of tests (Vol. 2). Chicago, IL: The Yearbook Publishers. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications. 189 Rickers-Ovsiankina, M. (1938). The Rorschach test as applied to normal and schizophrenic subjects. British Journal of Medical Psychology, 17, 227-257. doi: 10.1111/j.2044-8341.1938.tb00296.x Ritsher, J. B. (2004). Association of Rorschach and MMPI psychosis indicators and schizophrenia spectrum diagnoses in a Russian clinical sample. Journal of Personality Assessment, 83, 46-63. doi: 10.1207/s15327752jpa8301_05 Rizzo, d. C., Parisi, S., & Pes, P. (1980). Manuale per la raccolta, localizzazione e siglatura delle interpretazioni Rorschach. Rome: Kappa. Rorschach, H. (1942). Psychodiagnostics: A diagnostic test based on perception. Bern, Switzerland: Hans Humber. (Original work published in German in 1921). Rushton, J. P., Brainerd, C. J., & Pressley, M. (1983). Behavioral development and construct validity: The principle of aggregation. Psychological Bulletin, 94, 1838. doi: 10.1037/0033-2909.94.1.18 Schafer, R. (1954). Psychoanalytic interpretation in Rorschach testing. New York, NY: Grune & Stratton. Sherman, M. (1952). A comparison of formal and content factors in the diagnostic testing of schizophrenia. Genetic Psychology Monographs, 46, 183-234. Smith, S. R., Bistis, K., Zahka, N. E., & Blais, M. A. (2007). Perceptual-organizational characteristics of the Rorschach task. The Clinical Neuropsychologist, 21, 789799. doi: 10.1080/13854040600800995 Spitzer, R. L., Endicott, J., & Robins, E. (1978). Research Diagnostic Criteria: Rationale and reliability. Archives of General Psychiatry, 35, 773-782. doi: 10.1001/archpsyc.1978.01770300115013 190 Su, W., Viglione, D. J., Green, E. E., Tam, W. C., Su, J., & Chang, Y. (2015). Cultural and linguistic adaptability of the Rorschach Performance Assessment System as a measure of psychotic characteristics and severity of mental disturbance in Taiwan. Psychological Assessment, May, 1-13. doi: 10.1037/pas0000144 Sundberg, N. D. (1961). The practice of psychological testing in clinical services in the United States. American Psychologist, 16, 79-83. doi: 10.1037/h0040647 Takahashi (2009). [English translation of Rorschach object frequency counts in nonpatients from Japan]. Unpublished raw data. van Os, J., & Tamminga, C. (2007). Deconstructing psychosis. Schizophrenia Bulletin, 33, 861-862. doi: 10.1093/schbul/sbm066 Viglione, D. J., Jr. (1996). Data and issues to consider in reconciling self-report and the Rorschach. Journal of Personality Assessment, 67, 579-587. doi: 10.1207/s15327752jpa6703_12 Viglione, D. J., Giromini, L., Gustafson, M., & Meyer, G. J. (2014). Developing continuous variable composites for Rorschach measures of thought problems, vigilance, and suicide risk. Assessment, 21, 42-49. doi: 10.1177/1073191112446963 Viglione, D. J., Meyer, G. J., Ptucha, K., Horn, S. L., & Ozbey, G. T. (2008, July). Initial validity data for the form accuracy project from three studies. In G. J. Meyer (Chair), Advancing the assessment of perceptual accuracy using form quality and form accuracy, Part 2: Validity Data for the Brazilian and U.S. projects. Symposium presented at the XIXth Congress of the International Rorschach Society, Leuven, Belgium; July 23. 191 Viglione, D., Perry, W., Giromini, L., & Meyer, G. (2011). Revising the Rorschach Ego Impairment Index to accommodate recent recommendations about improving Rorschach validity. International Journal of Testing, 11, 349-364. doi: 10.1080/15305058.2011.589019 Viglione, D. J., Perry, W., Jansak, D., Meyer, G., & Exner, J. J. (2003). Modifying the Rorschach human experience variable to create the human representational variable. Journal of Personality Assessment, 81, 64-73. doi: 10.1207/S15327752JPA8101_06 Viglione, D. J., Perry, W., & Meyer, G. (2003). Refinements in the Rorschach Ego Impairment Index incorporating the Human Representational variable. Journal of Personality Assessment, 81, 149-156. doi: 10.1207/S15327752JPA8102_06 Viglione, D. J., & Rivera, B. (2003). Assessing personality and psychopathology with projective tests. In J. R. Graham & J. A. Naglieri (Eds.), Handbook of Psychology: Vol. 10. Assessment psychology (1st ed., pp. 531-553). New York: Wiley & Sons. Viglione, D. J., & Rivera, B. (2013). Performance assessment of personality and psychopathology. In J. R. Graham & J. A. Naglieri (Eds.), Handbook of Psychology: Vol. 10. Assessment psychology (2nd ed., pp. 600-621). Hoboken, NJ: Wiley & Sons. Villemor-Amaral, Yazigi, Nascimento, Primi, Semer, & Petrini (2008). [English translation of Rorschach object frequency counts in nonpatients from Brazil]. Unpublished raw data. 192 Wagner, E. E. (1998). Perceptual integrations and the “normal” Rorschach percept. Perceptual and Motor Skills, 86, 296-298. doi: 10.2466/pms.1998.86.1.296 Walker, R. G. (1953). An approach to standardization of Rorschach form-level. Journal of Projective Techniques, 17, 426-436. doi: 10.1080/08853126.1953.10380508 Weiner, I. B. (1998). Principles of Rorschach interpretation. Mahwah, NJ: Lawrence Erlbaum Associates. Weiner, I. B. (2003). Principles of Rorschach interpretation (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Wood, J. M., Garb, H. N., Teresa Nezworski, M., Lilienfeld, S. O., & Duke, M. C. (2015). A second look at the validity of widely used Rorschach indices: Comment on Mihura, Meyer, Dumitrascu, and Bombel (2013). Psychological Bulletin, 141, 236-249. doi: 10.1037/a0036005 Wood, J. M., Nezworski, M. T., & Garb, H. N. (2003). What’s right with the Rorschach? The Scientific Review of Mental Health Practice, 2, 142-146. doi: 10.1037/t03306-000 193
© Copyright 2026 Paperzz