SABRE: A Game-Based Testbed for Studying Team Behavior Alice M. Leung David E. Diller William Ferguson BBN Technologies 10 Moulton Street Cambridge, MA 02138 617-873-5617, 617-873-3390, 617-873-2208 [email protected], [email protected], [email protected] Keywords: Game-based Experimentation, Teamwork, Neverwinter Nights ABSTRACT: We describe a flexible, authorable game-based testbed that has been used to study team behavior that includes a sample scenario designed for meaningful game-play rather than realism. Although this scenario features the use of fictional resources in pursuit of a goal under non-realistic conditions, the game experience is immersive, so that it encourages a “suspension of disbelief,” which allows the participants to demonstrate realistic behaviors. The scenario is designed to be used in experiments investigating general aspects of teamwork such as group decisionmaking, resource management, and information sharing, as well as more context-specific behaviors such as negotiating, accommodating mission-irrelevant requests for assistance, and handling insults from in-game characters. This testbed has been developed primarily to study the effect of personality and culture on behavior and performance in a cooperative team mission, with the objective of facilitating the development of human behavior models that account for culture and personality. Beyond its use as a research tool, features of the testbed could also make it useful for developing training for effective teams, such as the capability for capturing participant behaviors, data analysis aids, and scenario authoring tools. 1. Introduction A game can be defined as a competition with explicit rules. Participants in a game strive to accomplish an arbitrary goal, employing strategy and taking advantage of luck. Although typically participants play a game for fun, the military has a long history of using games as training tools. For many of the same reasons that games can be good tools for training people, they can also be good tools for studying people. The converse point is true as well: any general tool for studying human behavior could also be used to understand how people learn, and thus could help to improve training. We describe a game-based testbed called SABRE (Situation Authorable Behavior Research Environment), designed for behavior research, and a specific SABRE scenario used for a pilot experiment examining communications and decision-making in teams. Like SABRE, many training applications must strike a balance between making skills in scenarios realistic enough to provide practice, yet general enough to be accessible to a range of students who may have different levels of expertise and disparate learning needs. Aspects of the testbed’s design and usage that have parallels for training applications are highlighted in the paper. 1.1 Why Use Games to Study Behavior? As has been often observed, one motivation for using games in training is that since people will voluntarily spend time and effort playing games, they will be similarly motivated to engage in game-based training. Caught up in such an immersive and compelling activity, the participants will be excited to learn. From the point of view of a researcher trying to understand human behavior, it is not so important that participants enjoy being in a study – although it does make recruiting participants easier. However, the immersive and motivating aspects of game-play are important, because participants are likely to behave more naturally when they are caught up in what they are doing. Further, while people under observation may tend to respond in ways that they believe will please the observer or will present themselves in a positive light, a game environment might help reduce a participant’s overriding awareness of being observed. And researchers can use the game world to make their experimental manipulations less obvious. The game’s narrative, immersive environment may encourage participants to focus on game-play, suspend disbelief, and be less conscious that they are taking part in an experiment. Game-based environments occupy an intermediate ground between experiences in the real world and non-interactive media such as printed text. For both training applications and behavior research experiments, taking part in simulated live-action scenarios offers superior realism but significant expense, while pencil-and-paper exercises offer the reverse. Adequately designed computermediated game environments are certainly much easier to control, replicate, manipulate, and record than the real world. These same capabilities that can satisfy the behavior researcher’s need for reproducibility and direct data capture can contribute to the training developer’s efforts to increase methods of demonstrating and measuring training effectiveness. 1.2 Games and Real World Behavior When evaluating the use of a game as an environment for training or experimentation, a critical question involves whether the game has an appropriate level of fidelity. This question must be answered for each application by considering the behaviors and skills of interest. The more abstract the behaviors, the less realistic detail is required from the game. Very generalized behaviors, such as handling the unexpected or managing information, are widely applicable in many contexts and thus can be observed (or practiced) in game settings that present a simplified, less realistic world. There are situations, however, where high-fidelity simulations are of critical importance. Specialized training and mission rehearsal environments that study the performance of specific skills in a near-real setting should provide highly realistic situations. However, less realistic, game-based systems do offer a straightforward way to utilize rich environments in which to study a range of behaviors. Furthermore, gamebased simulations are increasingly providing levels of fidelity matching all but the most expensive high-fidelity simulators [1]. Yet a common concern when using games for experimentation or training is that since games are not “real,” participants may be encouraged to behave significantly differently when playing a game than they would in a real situation. Many observed differences between in-game and real-world behaviors result from games being a low-risk, exploratory environment. Obviously, the consequences of one’s game character being injured are much less significant than a real-world injury, so participants are more likely to take chances. Other differences result from common game conventions and expected behaviors. For example, people may approach and talk to strangers in a game world because they have played previous games and have learned that talking to strangers is a typical way to obtain necessary information in games; whereas they might be reluctant do so in real life. To minimize differences between in-game and real world behaviors, the consequences of in-game behaviors can be increased. In a training setting, trainers can impress upon participants that the lessons of the training game are important for their mission readiness, and caution them that game performance will be graded. Similarly, a behavior researcher might set up a game to have more significant consequences by linking real-world rewards to game performance. However, the strategy of adding realworld consequences is unlikely to be effective unless it is accompanied by game or scenario design that adds realistic in-game consequences. For example, an in-game injury could result in a participant being unable to take additional action for the course of the exercise, depriving his team of manpower, perhaps making it impossible for them to accomplish their goal. Setting up the game rules realistically to reward or punish behaviors of interest will reduce the differences between in-game and real-world behaviors. Even after taking steps to reduce participants’ tendencies to behave unrealistically in games, researchers will need carefully to consider how to interpret behavioral data. Behaviors can be divided in three categories based on how they should be interpreted. Possibly the smallest set of behaviors is the type that can be translated literally. In order for a behavior to qualify for literal interpretation, the participant must have the same in-game reasons to prefer a particular decision as they would in real life. In other words, participants would make the same choices and perform the same behaviors in the real world and in the game environment when faced with identical situations. However, it can be difficult to justify interpreting an in-game behavior literally, given the problem of understanding all the factors which contribute to the motivation and reasoning involved in in-game decisions. Many more behaviors can be interpreted as demonstrating relative tendencies among participants. Even if the game situation causes participants to tend towards unrealistic behavior, if this influence applies uniformly, in-game differences between participants can reflect real-world differences. Behaviors that demonstrate a person’s tendency to take risks, tolerance for uncertainty, aggressiveness, or sensitivity to personal status can all be interpreted as showing an individual’s relative tendencies compared to the general tested population. Thus, someone who consistently in a game makes an important decision based on less information than the average player might be considered to have demonstrated impulsiveness. Of course, interpreting a behavior relatively requires the researcher to have comparable data from a number of individuals. Finally, some behaviors observed in a game can be considered specific instances of general behaviors that are likely to apply in many contexts. In contrast to behaviors that must be interpreted relatively, such general behaviors are not as dependant on a comparison set. In contrast to behaviors that are interpreted literally, they are less specific to a particular scenario. Some examples of this type of behavior are a style of communication, a preference for a certain team organization, and a tendency toward making extreme or moderate choices. Thus, a researcher might conclude that an individual whose interactions with teammates included a larger proportion of directives or unsolicited advice during a game scenario involving an unrealistic a mystery about a missing car, nonetheless was demonstrating that player’s general communication style. 1.3 Game Selection: Which Game to Use? One commonly cited reason for using a game-based system is that modifying an existing commercial game can be a way to deliver a better application with less development effort [2]. There are a wide range of gamebased environments on which to build applications, and it makes sense to survey existing game environments to see whether there are candidates that meet the envisioned application requirements. For existing surveys of game technology for experimentation and training see [3] and [4]. A key criterion in game evaluation is the range of possible actions and interactions supported by the game. Large-scale differences among games include the level or scale of decisions (for example, the strategic movement of armies versus the tactics of clearing a room) and the degree of support for teamwork and communication among participants. The next major consideration is the ease of game modification or authoring, since most “serious” applications will require significant changes to an entertainment product. For the majority of research or training applications, games with better authorability will facilitate faster scenario development and more flexible use [5]. Games will also differ significantly in both the length of time needed to learn to play them and the natural time frame for game-play. Some games can be learned in minutes and completed within an hour, while others may take days just to learn. For experimentation, anticipated access to participants and their expected availability will put a maximum limit on these times. Training applications must also fit within a curriculum’s time budget. Further software considerations include game availability and anticipated future availability, licensing usage rights and costs, and the availability of support for application use or modification. Additionally, some commercial games have a short shelf life, making it likely that any research or training tool using these games will be fated to using an obsolete product. 2 SABRE: Situation Authorable Behavior Research Environment 2.1 Motivation/Goals for the SABRE Testbed Even though adopting an existing game can decrease application-development effort, a significant amount of effort remains to change a game into an experimentation platform or training application. The SABRE project was designed to provide a flexible, customizable experimentation tool for behavior research while taking advantage of an existing commercial game platform. We chose a game that provided an immersive, interactive, virtual world in order to greatly reduce the effort required to conduct new behavioral research experiments. Gameplay did not need to be a high-fidelity reproduction of a military situation, but did need to be able to reasonably represent situations that elicited behaviors and performance at generalized tasks with military relevance. In addition to an environment for collecting humanbehavior data, SABRE was also envisioned as an environment to demonstrate and test synthetic entities whose behavior would be modeled on human data. The particular behavior domain of interest involves cultural effects on team communication and decision-making. A central technical issue in using a commercial game rather than a custom application developed for experiments, was how to instrument the game for data collection. Teachers who have incorporated games into their curricula have observed that the non-linear nature of some games and the games’ typical inability to support detailed reconstruction of what happened during play can make it difficult to determine what a student experienced or learned [6]. Support for collecting detailed, meaningful data about what a participant did during a game is even more important for behavioral research if it is to be able to correlate particular behaviors with overall game/task performance. Additionally, the data recorded must be in a format amenable to analysis at the level of behavioral intent or performance, rather than at the low level of keystrokes or mouse-movements. For example, a researcher might want to know how many times during an hour a participant entered a building without informing his teammates of his location, or what percentage of containers a participant opened while checking for boobytraps. Thus, the game used in a testbed must be capable of supporting high-level data capture itself, or must be modifiable to work within an external datacapture system Similarly, the game itself will often need to be modified to satisfy other experimentation requirements, whether through changing the game code or working with built-in game-customizing facilitites such as APIs. Ideally, existing scenario authoring tools, guidelines, and script libraries would make it easier for experimenters to tailor the game experience. It is desirable for the game’s goals, resources, events, and world maps to be customizable, enabling researchers to focus their experiments on particular behaviors or contexts of interest. Further, the game should include or facilitate the implementation of a tutorial or familiarization session. Beyond generating and supporting data collection from the game portion of an experiment, SABRE had to include a number of other components needed by researchers. For example, often researchers need to administer pre- and post-experiment questionnaires and be able to correlate a participant’s responses with their ingame behavior and performance data. Researchers also need to be able to do experiment administration, such as matching participant or team numbers with experimental conditions. Finally, SABRE needed to include capabilities for aggregating raw data into a description of meaningful behaviors, and for exporting data into standard formats. All these capabilities could also be of use to trainers who want to observe and analyze student performance. 2.2 Game Selection When selecting a game on which to build SABRE we wanted one that supported a wide variety of possible game scenarios and thus the largest set of contexts for examining behavior. We needed an environment that allowed for extensive modifiability; thus the existence of authoring tools was of great value. Additionally, we were more interested in cognition and less in motor coordination and reaction time, and we wanted a game for which there could be numerous kinds of actions and behaviors, including team communications. For those reasons we narrowed our selection to role-playing games. While generally featuring less realistic tactical actions than the commercially popular first-person shooter game genre, there is more scope for deliberation and the scenarios can be slower paced, leaving more time for communication and decision-making prior to action. The particular role-play game selected was Neverwinter NightsTM. This game satisfied the testbed requirement since it can be used to simulate cooperative team tasks and it facilitates scenario authoring and customization. The built-in game-editing tools allow users to customize the size and contents of the game world, including synthetic character behavior and dialogs, and the creation of customized items. Furthermore, there is an extensive API and a scripting language for additional modifiability. The game has a well-established user community, and is old enough (released in 2002) that it does not require cutting edge computer hardware yet is still actively supported with periodic software updates. These advantages have made the game a popular choice among other researchers [7], [8], [9]. In Neverwinter Nights players can move around the world, pick up and use items, go into buildings, read maps and signs, and interact with other characters. They can communicate with other players through free form typed text, or engage synthetic characters through dialog menus. Players can use a journal and map, as well as a variety of items within the environment. They may also attack targets, but do not have much detailed control over fighting actions. Because Bioware, Inc., the developer of Neverwinter Nights, has encouraged and supported user authoring of custom-game content, the SABRE testbed was able to take advantage of publicly available custom graphics to change from the game’s default medieval-fantasy setting to a modern cityscape [10]. Third-party software was also employed to integrate the game with a relational database for the handling of data. Although the game engine code itself is not accessible, the built-in scripting API made it possible to log many types of player actions in the database. Additionally, on request, Bioware Inc. was able to add time stamping and complete player-text-window logging capabilities to a scheduled software update. 2.3 Testbed Architecture and Design Our game choice influenced the overall testbed architecture and design. Like many multi-player games, Neverwinter Nights uses a client-server architecture where the server is responsible for maintaining the game state and the client primarily supports the user interface for viewing and interacting with the game world. Thus, most game information must be logged on the server side. However, the game also generates some log files on the client side, requiring the testbed to include capabilities for merging this information into the central database. As there is game information that is only relevant to a single player and is therfore stored at the local client, the testbed cannot collect complete data from just the server. However, as shown in Figure 1, the majority of the testbed components (such as the scenario modification tools and analysis toolkit) are installed on the server. The third party tool NWNX2 [11] is a wrapper for the game server, and enables database operations from within the game scripts. In addition to serving as a data repository for recorded game behaviors and responses to out-of- game questionnaires, the database is also be used to store the questionnaire contents and information used to direct the game scenario. Scenario Modification Tools Experiment Construction Game Server Analysis Toolkit Logging DB (MySQL) NWNX2 Server tools for opening doors and crates) and must decide how to allocate those resources. Additionally, team members have collaboration tools, allowing information to be shared between individuals, and locations flagged or marked within the virtual environment. Performance is team-based, with participants able to increase the team score by completing tasks (which have rewards) while managing costs and penalties. The participants are guided through a series of planning tasks before beginning the search. The scenario also includes encounters with synthetic entities who populate the town, as shown in Figure 2. These characters interact with the participants, either providing tips about cache locations, requesting assistance, behaving rudely, or negotiating to let the participants enter a building. Game Events Clients Game Client Log Files Log Daemon Figure 1: Testbed Architecture 3. Scenario Design In order to verify the testbed as a mechanism for experimentation and validate its utility as a means for conducting team research into culture and personality, we conducted a pilot experiment examining the impact of culture and personality on team decision-making, coordination, and performance. For this pilot experiment we developed a scenario for group planning and task execution, building in metrics for effectiveness and efficiency. Because the task was envisioned as a reasonable exercise of general skills and behaviors rather than a high-fidelity simulation of a realistic situation, we were able to balance elements of the scenario the better to support the design of the experiment. This scenario was designed partly as a demonstration of how SABRE could elicit a wide range of possible team and individual behaviors that might be affected by cultural/personality. This led us to examine a larger number of hypotheses than one might typically examine in an experiment. In the basic experiment, team members are assigned roles (e.g., patrol leader, weapons specialist) depending on the experimental condition, and the team is given the highlevel task of locating and acquiring caches of weapons hidden within a town. The team is provided with equipment to help with the task (sensors of varying capabilities designed to help locate weapons caches and Figure 2: Screenshot of a player interacting with a townsperson 3.1 Eliciting and Measuring Behaviors with the Scenario A principal design goal was to provide motivation and rewards for teamwork, but allow players significant choice over the amount, timing, and type of interaction. Thus, participants could demonstrate their preference towards independent or interdependent activity in response to team goals. We provided motivation for teamwork by assigning each participant a distinct role with different responsibilities, providing limited numbers of cache sensors to be shared among the team, and controlling the flow of information from the townspeople so that no one participant received a complete set of tips. The players were thus motivated to share information and coordinate their actions in order to maximize team performance. Because we were interested in examining communication-style preferences among team members, we limited team-wide broadcasting and instead required point-to-point communication. Another design guideline was the desire to take advantage of a game environment’s capability for gathering multiple data points from a single person. Some vignettes, such as negotiating with building residents, were repeated with minor variation a number of times within the scenario so that a participant’s average response could be calculated. Additionally, to ensure that all participants had to react to certain encounters, such as requests for assistance, offers of information, or insults, the synthetic entities were programmed to seek out each player a minimum number of times and force certain situations to occur. The scenario was also set up to minimize certain types of behaviors that were irrelevant to our cultural and personality hypotheses. For example, because we were not interested in studying aggression in general, the server was configured so that players were unable to attack each other, many game objects were implemented to be immune from attacks, and most synthetic entities were programmed to simply flee if attacked. However, we were interested in aggression as a response to insults, so we included starting a fight as a possible response in those vignettes. Although the scenario was designed to reduce motivation for outlier behaviors like attacking, we included data collection for these actions. 3.2 Design of Goals, Rewards, and Penalties An experimental scenario must balance game elements in order to elicit behaviors of interest. The task should not be so difficult that some participants completely fail, nor so easy that some participants perform perfectly. Often it is useful to manipulate a game’s scoring system so that the players demonstrate how they weigh various costs and benefits by their choice of game strategy. These costs and benefits must be clearly laid out in the game and understood by the participants so that they can make decisions informed by the situation’s expected utilities. In our pilot scenario the team is awarded points for recovering caches and penalized for opening empty containers (as this can be avoided by first using the sensors to test the container) or setting off traps. Participants must also pay a cost for entering buildings, and may also spend money to bribe or assist townspeople. Caches inside buildings are harder to find, and may be trapped, but earn a higher reward. The team starts the game with some number of points, and their ending total is used as a measure of performance. The reward, costs, and penalty levels and the number of caches were adjusted to avoid ceiling and floor effects. Additionally, because we wanted to study risk taking behavior, the scenario was designed so that the expected utility of a low-risk strategy (concentrating on exterior caches) was the same as a high-risk strategy (focusing on interior caches). Compared to commercial games, in which rewards are dispensed to maximize entertainment, a behavior-research game must ensure that its reward structure does not unnecessarily bias player decisions. One example of this is the way that traps were implemented in the scenario. Because initial experiences in a new environment heavily influence later decisions, we wanted to avoid having players experience a negative consequence early in the game. To insure this, traps were not pre-set to occur at fixed geographical locations, but instead were generated to occur during an encounter following the team’s successful recovery of several interior caches. 3.3 Managing Free Play In scenarios for experimentation there exist competing desires for constraining behavior in order to produce situations amenable to quantitative analysis and hypothesis testing, but yet providing more realistic and less artificial situations for participants to play out behaviors. Scenario design must control free play without making the game seem too linear, and without destroying a participant’s sense that his actions and decisions affect the game’s outcome. While this kind of control can be achieved by using a researcher as a confederate inside the game to actively guide or influence the participants, we wanted to maximize reproducibility and ease of experiment administration by including automated controlling elements in the scenario. Our pilot experiment scenario used several approaches to direct participants and increase data quality. One strategy was to structure the consequences of many decisions to lead towards convergent outcomes. For example, whether a participant treated a rude townsperson respectfully or escalated tensions until a scuffle broke out, the townsperson would eventually exit the scene and the player would resume their mission. Similarly, if a participant was unable to convince a resident to allow them into a building, they could use the team’s lockpick to eventually gain access. These convergent outcomes made it easier to compare actions across participants and across teams, since each group’s situation remained comparable regardless of their decisions. Another approach to controlling free play was to use built-in “gates” to enforce mandatory decisions. For example, during the planning phase of the game, the team had to distribute a limited number of sensors and tools among its members. They were allowed into the storeroom to pick up the items only fter they had indicated that they understood this task. Then they were not allowed to leave the area until each item was in someone’s possession. Both these gates were enforced by “smart” doors in the game that remained locked until the condition was met, and would provide feedback messages about the remaining task if the players tried to open the door early. The scenario also used periodic status updates to remind the participants of their overall goal and to motivate them by showing the current score and time remaining. 3.4 Usability Testing In order to refine the scenario for improved usability, game balance, and performance, we conducted iterative usability testing -- ultimately studying sixteen teams. One early observation was that participants from different age and computer-game-experience demographics used contrasting learning strategies and found different aspects of the game confusing. For example, participants who did not have experience with text message applications found it frustrating that group conversations would move on before they had finished typing a long response. We focused the usability studies on participants who matched the target study demographic, but included small numbers of people who were older and had no game playing experience in order to become aware of additional gameplay and familiarization issues. Usability study participants were instructed on the general purpose of the research game, and asked to look for aspects of the game-play experience which were confusing or surprising. Through both written and verbal debriefings, we solicited comments and suggestions that were used to refine the pilot study scenario and the accompanying game tutorial. One approach we used was to examine (during each spiral of scenario refinement) what the most confusing thing was for each player. If multiple participants were puzzled or frustrated with the same thing, that was a strong indication that at least one aspect of the scenario needed to be reworked. For example, early versions of the scenario allowed the team to vote on whether to deploy a strike team to unlock doors. However, it was common for at least one team member to miss or ignore the current call for a vote, which would prevent the authorization from being completed. Additionally, we found that calls to vote did not trigger significant team discussion about the merits of deploying the strike team to the suggested location. Instead, most team members always vote yes, with a minority of participants occasionally and seemingly arbitrarily voting no. Because the voting element was frustrating to many participants and was not eliciting interesting communications between players, we eliminated it from the scenario. Usability testing was also helpful in identifying a number of unexpected behaviors from the participants. For example, when asked to describe the team’s search plan, some participants simply entered an empty response. In other cases, players became bored or frustrated with their perceived lack of success at the mission and began attacking townspeople, or undressing their characters. In some cases, we tried to minimize the motivations for such unexpected behaviors, such as lowering average frustration by making it easier to find caches. In other cases, we modified the software to enforce some requirements, such as a minimum character length for the typed description of a plan, or to prevent behaviors like undressing which might interfere with the team earnestly pursuing their mission. 4. Experimental Pragmatics: Lessons Learned After refining our experimental scenario and game tutorial we began using SABRE in a pilot experiment to study the effect of culture and personality factors on team behavior. In this section, we describe some lessons learned from using the testbed as a behavior research tool. Our pilot experiment involved eight American and eight Chinese teams in order to contrast these two cultures. Additionally, half the teams of each culture had no designated leader while the others had a randomly selected individual designated as the group’s leader. The entire experiment for each group started with a background demographic and computer/gamingexperience survey and standard personality and culture questionnaires. The participants then completed both a single-player and multi-player familiarization session and game tutorial. After learning the game, the participants played through the mission planning and execution tasks and wrapped up with a debriefing questionnaire. Groups took from four to five hours to complete an experiment, with approximately equal amounts of time spent with the game tutorial and experiment scenario. 4.1 Participant Selection and Control In addition to the large performance and behavior variability we found between different demographic groups, we also noticed some challenges with participant selection and control. Selection criteria were particularly important because of the small number of teams studied. Because the experimental scenario involves significant text reading and typing, we found that it was necessary to restrict study eligibility to participants who were in college or were college graduates to prevent limitations in reading comprehension, reading speed, and keyboarding ability from significantly impacting team performance. To avoid gender effects on perceptions of leadership and team dynamics we used only male participants. We also restricted the participant age range to be 18 to 35 to better match the population of interest. Because we did not prescreen participants with a personality instrument, we were not able to control the distribution of individuals scoring high or low on various personality factors. represented in the game world by a specific avatar on the screen. We found a number of disparities between the American and Chinese groups. All participants were recruited from the Boston area. Participants were limited to individuals who had lived in their home country (US or mainland China) until at least age 18, with a maximum of six months spent abroad during that time. The Chinese participants were required to have English-language (TOEFL) scores or employment/study histories demonstrating a high level of English proficiency. On average, the Chinese participants were older and had more graduate school experience. They also had less game-playing experience, and were more likely to work or study full-time. A few of them had met each other prior to the experiment. Finally, they were more likely to have attended selective colleges or universities. The game tutorial we developed covered the basic game interface, the use of the in-game tools and sensors needed for the mission, and multi-player communication and coordination. Each participant first completed a self paced, single-player tutorial, then waited in an entertaining practice area until all the players were ready to start the multi-player exercises. The multi-player exercises were designed to encourage the participants to help train each other by including several “gates” where each player had to complete a task before the group could proceed. Ofte, the first players to understand how to perform the task then explained it to the other players. In addition to demographic differences between the Chinese and American groups, we noted that the Chinese groups were not necessarily representative of the Chinese average. All the Chinese participants had voluntarily relocated to the US. Although the individuals on each team did not know each other well, the participants from the Chinese teams may have identified more with each other as members of the same minority nationality. Thus, not all the observed differences in team performance and behavior during the game task can be attributed solely to cultural differences. These difficulties in population sampling are likely to affect any cultural-behavior experiments performed at one location. Since game-based experiments are easier to reproduce, one approach would be to replicate the experiment in multiple countries. Potentially the game scenario could also be translated and localized for other languages. 4.2 Game Familiarization Because we wanted to be able to include participants with minimal game playing experience, a built-in familiarization and training session was crucial to ensure that every study participant had adequate ability to play the game before going into the experimental scenario. Although Neverwinter Nights has a relatively easy game interface compared to many first person shooter, strategy, or building games, and even though we simplified the interface for use in the experiment, we found it took at least an hour for the average individual to learn to play the game. The tutorial explicitly covered some common game assumptions and paradigms for the benefit of novice game players. For example, a concept introduced early on was the idea of using the mouse cursor as a mechanism for “looking” at objects in the game world. Another basic concept that was specifically addressed based on usability group feedback was the idea that the player was Our general approach in designing the tutorial was to minimize the amount of assistance required from a human instructor. We supported both of the two basic learning styles observed during usability testing. To support participants who preferred to read step-by-step instructions we included embedded tests to make sure they actually tried taking each action after it was explained so that the experience of doing would help them retain the instructions. To support participants who instead wanted to try different actions until they achieved the desired effect, we stated the goal at the start of each instruction area and provided graphical and textual feedback in response to possible actions, including context-sensitive suggestions for improving performance. Additionally, in some cases if the player tried too many incorrect actions before achieving the required goal, the tutorial gave them some extra review and an additional “test”. Although we tried to make the tutorial as self-contained as possible, we found that human back-up was necessary. For example, at several points participants are instructed to click on a particular icon or manipulate a specific part of the game interface. The tutorial includes textual descriptions of the region of the screen where the icon resides, and in some cases we were able to make the icon blink in order to draw the player’s attention, yet occasionally someone had trouble finding the item and would need the experiment administrator to physically point to the item on the screen. A single experiment administrator could easily monitor five participants who were visually isolated but in close physical proximity, but might find it challenging to adequately monitor a group larger than ten. 4.3 Data Analysis SABRE’s approach to data collection is to log as much data as possible, so that each experiment can be studied for a variety of purposes, even beyond those initially intended by the researchers conducting the study. SABRE includes data filtering and data-formatting scripts that can extract a large number of behavior and performance statistics, and also supports raw data output in a plain format for further statistical analysis. In this section, we briefly discuss several different types of analyses we performed on the pilot experiment data. The game task specified that the players should try to maximize the team score, represented by pieces of gold, by finding the most caches to gain rewards while incurring the fewest penalties and costs. We chose to structure the game with this single dimension of success, rather than having multiple orthogonal dimensions such as goodwill versus material success. However, in addition to comparing each team’s overall point score, we also compared some contributing factors, such as the number of caches recovered (a measure of mission task effectiveness) and the number of empty crates opened (a measure of mistakes made). All these measures are examples of simple automated counting of specific categories of actions. Another type of measure involves some contextdependent calculation. For example, an individual’s tendency to initiate communication can be quantified by calculating how many of their typed messages occurred after a period of silence (in our case, 20 seconds). This measure was used to study whether designated leaders initiated more conversations, and whether this difference correlated with culture. It could also be used to determine whether individuals scoring higher on the extroversion personality trait initiated more conversations. Similarly, an individual’s tendency to personally gather data before taking action was examined by considering each building the person entered and calculating their average number of sensors usages within a certain radius and directed towards a particular building prior to entry. This was used to study whether teams from cultures with higher uncertainty tolerance gathered less data. It was also used to see whether individuals who scored higher on conscientiousness gathered more data. Also, the average division of a team into smaller subgroups was studied by calculating individual proximity at regular time intervals. This allowed us to determine the percentage of time each team spent divided into groups of 1:1:1:1, 1:1:2, 1:3, 2:2, or 4. This measure allowed us to see whether teams from a high-individualism culture spent more time walking around individually. SABRE also supports manual scoring of behavior by producing an overall communication record for the team. This was used to manually score the text communication between players for instances of requests for information, volunteered information, and directives. Additionally, the testbed has the capability for semi-automated communication analysis using Linguistic Inquiry and Word Count (LIWC) [12], which classifies utterances into categories such as positive, negative, referring to self, or referring to the team. We used these types of content analysis to study whether leaders from high powerdistance cultures used more directives, and whether teams from high individualism cultures used more negative utterances. 5. Future Directions We are currently in the process of analyzing the results from the pilot experiment comparing behavioral differences between American and Chinese teams. We are also designing and developing an experiment for the NATO Supreme Allied Command Transformation HQ, Futures and Engagement, Concept Development and Experimentation project entitled Leader and Team Adaptability in Multinational Coalitions (LTAMC), to be run at U.S. and international sites. This experiment will focus on cultural and personality effects on information sharing, situation awareness, and division of responsibilities as related to rapid formation of multinational Joint Task Forces. We are actively working in a number of areas to enhance the capabilities of the testbed. While the breadth of experimental scenarios possible in Neverwinter Nights is wide ranging, experiment construction can require specialized development skills such as programming, scripting or advanced knowledge of the Neverwinter Nights production pipeline in order to produce new types of experiments. We are expanding the existing authoring and data-analysis tools to further reduce the depth and breadth of specialized knowledge required to develop novel experimental scenarios. We are in the process of integrating existing voice communication capabilities [13] into the experimental testbed, making it possible to capture and record voice communication between participants. These capabilities would make it possible to develop and evaluate more natural communications’ capabilities between human players, as well as facilitate the setup and control of experiments by improving the ability of an experimenter to converse with experiment participants within a distributed experimental environment. We will begin work shortly to construct computercontrolled characters within the testbed that can exhibit a range of behaviors consistent with pre-determined cultural and personality factors. Character behaviors will be parameterized so that users such as experimenters or training-application developers can tune the characters to produce the type of cultural or personality response required for the chosen situation. Additional future work also includes the development of a Korean version of our scenario to support further experimentation. Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2004, Orlando, FL, December 2004. 6. References Author Biographies [1] J. P. Holmquist, "Playing Games", Military Training Technology, vol. 9 issue 5, Oct. 27. [2] T. Holt, "How Mods Are Really Built", Serious Games Summit, Oct 18-29, 2004. [3] W. Ferguson, D.E. Diller, A.M. Leung, B. Benyo, and D. Foley, Behavior Modeling in Commercial Games, BBN Interim Report Feb. 13, 2003. [4] D.E. Diller, W. Ferguson, A.M. Leung, B. Benyo, & D. Foley, “Behavior Modeling in Commercial Games”, in Proceedings of the 13th Conference on Behavior Representation in Modeling and Simulation, Arlington, VA, May 2004. [5] C.J. Bonk & V.P. Dennen, “Massive Multiplayer Online Gaming: A Research Framework for Military Training and Education”, available from Advanced Distributed Learning at http://www.adlnet.org/downloads/189.cfm. [6] A. McFarlane, A. Sparrowhawk, Y. Heald, “Report on the educational use of games”, Published by TEEM and available online, http://www.teem.org.uk/publications/teem_gamesine d_full.pdf. [7] P. Gorniak and D. Roy, “Speaking with your Sidekick: Understanding Situated Speech in Computer Role Playing Games”, in Proceedings of Artificial Intelligence and Digital Entertainment, 2005. [8] M. Carbonaro, M. Cutumisu, M. McNaughton, C. Onuczko, T. Roy, J. Schaeffer, D. Szafron, S. Gillis, S. Kratchmer, “Interactive Story Writing in the Classroom: Using Computer Games”, in Proceedings of the 2005 International Digital Games Research Conference (DiGRA 2005), June 16-20, 2005, Vancouver, B.C., Canada. [9] P. Spronck and J. van den Herik, “Game AI that Adapts to the Human Player”, ERCIM News, No. 57, April 2004, pp. 15-16. [10] The D20 Modern Mod website is http://d20mm.net./index.php and their custom content for Neverwinter Nights is available through the Neverwinter Nights Vault at http://nwvault.ign.com/. [11] The NWNX2 software is available at http://www.nwnx.org/ or through the Neverwinter Nights Vault at http://nwvault.ign.com/. [12] J.W. Pennebaker and L.A. King, “Linguistic styles: Language use as an individual difference”, Journal of Personality and Social Psychology, 77, 1296-1312, 1999. [13] D.E. Diller, B. Roberts, S. Blankenship, D. Nielson, “DARWARS Ambush! – Authoring Lessons Learned in an Training Game”, in Proceedings of the ALICE M. LEUNG is a scientist in the Intelligent Distributed Computing Department of BBN Technologies, Cambridge, MA. Her current interest is in harnessing games for behavior research. Previous projects include software for military logistics planning using distributed agent technology, small-scale simulation of information transfer economics, and infrastructure to support training through games. DAVID E. DILLER is a Senior Scientist at BBN Technologies. He holds an M.S. in Computer Science and a joint Ph.D. in Cognitive Science and Cognitive Psychology from Indiana University. His current focus includes cognitive modeling, mixed-initiative agent-based systems, and simulation-based training applications. Recently, Dr. Diller has been involved in a number of projects utilizing commercial game technology for training applications. WILLIAM FERGUSON is a Division Scientist at BBN Technologies. His background is in artificial intelligence, simulation, computer-based training, and commercialgame technology. He is currently a Co-Primary Investigator for the integration effort under the Defense Advanced Research Projects Agency’s (DARPA's) DARWARS training program. He also serves as Co-PI of the Cultural Modeling Testbed, a joint Defense Modeling and Simulation Office (DMSO) and Air Force Research Laboratory (AFRL) project involved with using commercial games to study culture and personality. He worked for many years as the technical lead for the Analysis of Mobility Platform (AMP) project, USTRANSCOM's transportation programmatics modeling system. Neverwinter NightsTM is a trademark of Bioware, Inc.
© Copyright 2026 Paperzz