A Game-Based Testbed for Studying Team Behavior

SABRE: A Game-Based Testbed for Studying Team Behavior
Alice M. Leung
David E. Diller
William Ferguson
BBN Technologies
10 Moulton Street
Cambridge, MA 02138
617-873-5617, 617-873-3390, 617-873-2208
[email protected], [email protected], [email protected]
Keywords:
Game-based Experimentation, Teamwork, Neverwinter Nights
ABSTRACT: We describe a flexible, authorable game-based testbed that has been used to study team behavior that
includes a sample scenario designed for meaningful game-play rather than realism. Although this scenario features
the use of fictional resources in pursuit of a goal under non-realistic conditions, the game experience is immersive, so
that it encourages a “suspension of disbelief,” which allows the participants to demonstrate realistic behaviors. The
scenario is designed to be used in experiments investigating general aspects of teamwork such as group decisionmaking, resource management, and information sharing, as well as more context-specific behaviors such as
negotiating, accommodating mission-irrelevant requests for assistance, and handling insults from in-game characters.
This testbed has been developed primarily to study the effect of personality and culture on behavior and performance
in a cooperative team mission, with the objective of facilitating the development of human behavior models that
account for culture and personality. Beyond its use as a research tool, features of the testbed could also make it useful
for developing training for effective teams, such as the capability for capturing participant behaviors, data analysis
aids, and scenario authoring tools.
1. Introduction
A game can be defined as a competition with explicit
rules. Participants in a game strive to accomplish an
arbitrary goal, employing strategy and taking advantage
of luck. Although typically participants play a game for
fun, the military has a long history of using games as
training tools. For many of the same reasons that games
can be good tools for training people, they can also be
good tools for studying people. The converse point is true
as well: any general tool for studying human behavior
could also be used to understand how people learn, and
thus could help to improve training.
We describe a game-based testbed called SABRE
(Situation Authorable Behavior Research Environment),
designed for behavior research, and a specific SABRE
scenario used for a pilot experiment examining
communications and decision-making in teams. Like
SABRE, many training applications must strike a balance
between making skills in scenarios realistic enough to
provide practice, yet general enough to be accessible to a
range of students who may have different levels of
expertise and disparate learning needs. Aspects of the
testbed’s design and usage that have parallels for training
applications are highlighted in the paper.
1.1 Why Use Games to Study Behavior?
As has been often observed, one motivation for using
games in training is that since people will voluntarily
spend time and effort playing games, they will be
similarly motivated to engage in game-based training.
Caught up in such an immersive and compelling activity,
the participants will be excited to learn. From the point of
view of a researcher trying to understand human behavior,
it is not so important that participants enjoy being in a
study – although it does make recruiting participants
easier. However, the immersive and motivating aspects of
game-play are important, because participants are likely
to behave more naturally when they are caught up in what
they are doing. Further, while people under observation
may tend to respond in ways that they believe will please
the observer or will present themselves in a positive light,
a game environment might help reduce a participant’s
overriding awareness of being observed. And researchers
can use the game world to make their experimental
manipulations less obvious. The game’s narrative,
immersive environment may encourage participants to
focus on game-play, suspend disbelief, and be less
conscious that they are taking part in an experiment.
Game-based environments occupy an intermediate ground
between experiences in the real world and non-interactive
media such as printed text. For both training applications
and behavior research experiments, taking part in
simulated live-action scenarios offers superior realism but
significant expense, while pencil-and-paper exercises
offer the reverse. Adequately designed computermediated game environments are certainly much easier to
control, replicate, manipulate, and record than the real
world. These same capabilities that can satisfy the
behavior researcher’s need for reproducibility and direct
data capture can contribute to the training developer’s
efforts to increase methods of demonstrating and
measuring training effectiveness.
1.2 Games and Real World Behavior
When evaluating the use of a game as an environment for
training or experimentation, a critical question involves
whether the game has an appropriate level of fidelity. This
question must be answered for each application by
considering the behaviors and skills of interest. The more
abstract the behaviors, the less realistic detail is required
from the game. Very generalized behaviors, such as
handling the unexpected or managing information, are
widely applicable in many contexts and thus can be
observed (or practiced) in game settings that present a
simplified, less realistic world. There are situations,
however, where high-fidelity simulations are of critical
importance. Specialized training and mission rehearsal
environments that study the performance of specific skills
in a near-real setting should provide highly realistic
situations. However, less realistic, game-based systems do
offer a straightforward way to utilize rich environments in
which to study a range of behaviors. Furthermore, gamebased simulations are increasingly providing levels of
fidelity matching all but the most expensive high-fidelity
simulators [1].
Yet a common concern when using games for
experimentation or training is that since games are not
“real,” participants may be encouraged to behave
significantly differently when playing a game than they
would in a real situation. Many observed differences
between in-game and real-world behaviors result from
games being a low-risk, exploratory environment.
Obviously, the consequences of one’s game character
being injured are much less significant than a real-world
injury, so participants are more likely to take chances.
Other differences result from common game conventions
and expected behaviors. For example, people may
approach and talk to strangers in a game world because
they have played previous games and have learned that
talking to strangers is a typical way to obtain necessary
information in games; whereas they might be reluctant do
so in real life.
To minimize differences between in-game and real world
behaviors, the consequences of in-game behaviors can be
increased. In a training setting, trainers can impress upon
participants that the lessons of the training game are
important for their mission readiness, and caution them
that game performance will be graded. Similarly, a
behavior researcher might set up a game to have more
significant consequences by linking real-world rewards to
game performance. However, the strategy of adding realworld consequences is unlikely to be effective unless it is
accompanied by game or scenario design that adds
realistic in-game consequences. For example, an in-game
injury could result in a participant being unable to take
additional action for the course of the exercise, depriving
his team of manpower, perhaps making it impossible for
them to accomplish their goal. Setting up the game rules
realistically to reward or punish behaviors of interest will
reduce the differences between in-game and real-world
behaviors.
Even after taking steps to reduce participants’ tendencies
to behave unrealistically in games, researchers will need
carefully to consider how to interpret behavioral data.
Behaviors can be divided in three categories based on
how they should be interpreted. Possibly the smallest set
of behaviors is the type that can be translated literally. In
order for a behavior to qualify for literal interpretation,
the participant must have the same in-game reasons to
prefer a particular decision as they would in real life. In
other words, participants would make the same choices
and perform the same behaviors in the real world and in
the game environment when faced with identical
situations. However, it can be difficult to justify
interpreting an in-game behavior literally, given the
problem of understanding all the factors which contribute
to the motivation and reasoning involved in in-game
decisions.
Many more behaviors can be interpreted as demonstrating
relative tendencies among participants. Even if the game
situation causes participants to tend towards unrealistic
behavior, if this influence applies uniformly, in-game
differences between participants can reflect real-world
differences. Behaviors that demonstrate a person’s
tendency to take risks, tolerance for uncertainty,
aggressiveness, or sensitivity to personal status can all be
interpreted as showing an individual’s relative tendencies
compared to the general tested population. Thus, someone
who consistently in a game makes an important decision
based on less information than the average player might
be considered to have demonstrated impulsiveness. Of
course, interpreting a behavior relatively requires the
researcher to have comparable data from a number of
individuals.
Finally, some behaviors observed in a game can be
considered specific instances of general behaviors that are
likely to apply in many contexts. In contrast to behaviors
that must be interpreted relatively, such general behaviors
are not as dependant on a comparison set. In contrast to
behaviors that are interpreted literally, they are less
specific to a particular scenario. Some examples of this
type of behavior are a style of communication, a
preference for a certain team organization, and a tendency
toward making extreme or moderate choices. Thus, a
researcher might conclude that an individual whose
interactions with teammates included a larger proportion
of directives or unsolicited advice during a game scenario
involving an unrealistic a mystery about a missing car,
nonetheless was demonstrating that player’s general
communication style.
1.3 Game Selection: Which Game to Use?
One commonly cited reason for using a game-based
system is that modifying an existing commercial game
can be a way to deliver a better application with less
development effort [2]. There are a wide range of gamebased environments on which to build applications, and it
makes sense to survey existing game environments to see
whether there are candidates that meet the envisioned
application requirements. For existing surveys of game
technology for experimentation and training see [3] and
[4]. A key criterion in game evaluation is the range of
possible actions and interactions supported by the game.
Large-scale differences among games include the level or
scale of decisions (for example, the strategic movement of
armies versus the tactics of clearing a room) and the
degree of support for teamwork and communication
among participants. The next major consideration is the
ease of game modification or authoring, since most
“serious” applications will require significant changes to
an entertainment product. For the majority of research or
training applications, games with better authorability will
facilitate faster scenario development and more flexible
use [5].
Games will also differ significantly in both the length of
time needed to learn to play them and the natural time
frame for game-play. Some games can be learned in
minutes and completed within an hour, while others may
take days just to learn. For experimentation, anticipated
access to participants and their expected availability will
put a maximum limit on these times. Training
applications must also fit within a curriculum’s time
budget. Further software considerations include game
availability and anticipated future availability, licensing
usage rights and costs, and the availability of support for
application use or modification. Additionally, some
commercial games have a short shelf life, making it likely
that any research or training tool using these games will
be fated to using an obsolete product.
2 SABRE: Situation Authorable Behavior
Research Environment
2.1 Motivation/Goals for the SABRE Testbed
Even though adopting an existing game can decrease
application-development effort, a significant amount of
effort remains to change a game into an experimentation
platform or training application. The SABRE project was
designed to provide a flexible, customizable
experimentation tool for behavior research while taking
advantage of an existing commercial game platform. We
chose a game that provided an immersive, interactive,
virtual world in order to greatly reduce the effort required
to conduct new behavioral research experiments. Gameplay did not need to be a high-fidelity reproduction of a
military situation, but did need to be able to reasonably
represent situations that elicited behaviors and
performance at generalized tasks with military relevance.
In addition to an environment for collecting humanbehavior data, SABRE was also envisioned as an
environment to demonstrate and test synthetic entities
whose behavior would be modeled on human data. The
particular behavior domain of interest involves cultural
effects on team communication and decision-making.
A central technical issue in using a commercial game
rather than a custom application developed for
experiments, was how to instrument the game for data
collection. Teachers who have incorporated games into
their curricula have observed that the non-linear nature of
some games and the games’ typical inability to support
detailed reconstruction of what happened during play can
make it difficult to determine what a student experienced
or learned [6]. Support for collecting detailed, meaningful
data about what a participant did during a game is even
more important for behavioral research if it is to be able
to correlate particular behaviors with overall game/task
performance. Additionally, the data recorded must be in a
format amenable to analysis at the level of behavioral
intent or performance, rather than at the low level of keystrokes or mouse-movements. For example, a researcher
might want to know how many times during an hour a
participant entered a building without informing his
teammates of his location, or what percentage of
containers a participant opened while checking for
boobytraps. Thus, the game used in a testbed must be
capable of supporting high-level data capture itself, or
must be modifiable to work within an external datacapture system
Similarly, the game itself will often need to be modified
to satisfy other experimentation requirements, whether
through changing the game code or working with built-in
game-customizing facilitites such as APIs. Ideally,
existing scenario authoring tools, guidelines, and script
libraries would make it easier for experimenters to tailor
the game experience. It is desirable for the game’s goals,
resources, events, and world maps to be customizable,
enabling researchers to focus their experiments on
particular behaviors or contexts of interest. Further, the
game should include or facilitate the implementation of a
tutorial or familiarization session.
Beyond generating and supporting data collection from
the game portion of an experiment, SABRE had to
include a number of other components needed by
researchers. For example, often researchers need to
administer pre- and post-experiment questionnaires and
be able to correlate a participant’s responses with their ingame behavior and performance data. Researchers also
need to be able to do experiment administration, such as
matching participant or team numbers with experimental
conditions. Finally, SABRE needed to include capabilities
for aggregating raw data into a description of meaningful
behaviors, and for exporting data into standard formats.
All these capabilities could also be of use to trainers who
want to observe and analyze student performance.
2.2 Game Selection
When selecting a game on which to build SABRE we
wanted one that supported a wide variety of possible
game scenarios and thus the largest set of contexts for
examining behavior. We needed an environment that
allowed for extensive modifiability; thus the existence of
authoring tools was of great value. Additionally, we were
more interested in cognition and less in motor
coordination and reaction time, and we wanted a game for
which there could be numerous kinds of actions and
behaviors, including team communications. For those
reasons we narrowed our selection to role-playing games.
While generally featuring less realistic tactical actions
than the commercially popular first-person shooter game
genre, there is more scope for deliberation and the
scenarios can be slower paced, leaving more time for
communication and decision-making prior to action.
The particular role-play game selected was Neverwinter
NightsTM. This game satisfied the testbed requirement
since it can be used to simulate cooperative team tasks
and it facilitates scenario authoring and customization.
The built-in game-editing tools allow users to customize
the size and contents of the game world, including
synthetic character behavior and dialogs, and the creation
of customized items. Furthermore, there is an extensive
API and a scripting language for additional modifiability.
The game has a well-established user community, and is
old enough (released in 2002) that it does not require
cutting edge computer hardware yet is still actively
supported with periodic software updates. These
advantages have made the game a popular choice among
other researchers [7], [8], [9].
In Neverwinter Nights players can move around the
world, pick up and use items, go into buildings, read maps
and signs, and interact with other characters. They can
communicate with other players through free form typed
text, or engage synthetic characters through dialog menus.
Players can use a journal and map, as well as a variety of
items within the environment. They may also attack
targets, but do not have much detailed control over
fighting actions.
Because Bioware, Inc., the developer of Neverwinter
Nights, has encouraged and supported user authoring of
custom-game content, the SABRE testbed was able to
take advantage of publicly available custom graphics to
change from the game’s default medieval-fantasy setting
to a modern cityscape [10]. Third-party software was also
employed to integrate the game with a relational database
for the handling of data. Although the game engine code
itself is not accessible, the built-in scripting API made it
possible to log many types of player actions in the
database. Additionally, on request, Bioware Inc. was able
to add time stamping and complete player-text-window
logging capabilities to a scheduled software update.
2.3 Testbed Architecture and Design
Our game choice influenced the overall testbed
architecture and design. Like many multi-player games,
Neverwinter Nights uses a client-server architecture where
the server is responsible for maintaining the game state
and the client primarily supports the user interface for
viewing and interacting with the game world. Thus, most
game information must be logged on the server side.
However, the game also generates some log files on the
client side, requiring the testbed to include capabilities for
merging this information into the central database. As
there is game information that is only relevant to a single
player and is therfore stored at the local client, the testbed
cannot collect complete data from just the server.
However, as shown in Figure 1, the majority of the
testbed components (such as the scenario modification
tools and analysis toolkit) are installed on the server. The
third party tool NWNX2 [11] is a wrapper for the game
server, and enables database operations from within the
game scripts. In addition to serving as a data repository
for recorded game behaviors and responses to out-of-
game questionnaires, the database is also be used to store
the questionnaire contents and information used to direct
the game scenario.
Scenario
Modification
Tools
Experiment
Construction
Game
Server
Analysis
Toolkit
Logging
DB
(MySQL)
NWNX2
Server
tools for opening doors and crates) and must decide how
to allocate those resources. Additionally, team members
have collaboration tools, allowing information to be
shared between individuals, and locations flagged or
marked within the virtual environment. Performance is
team-based, with participants able to increase the team
score by completing tasks (which have rewards) while
managing costs and penalties. The participants are guided
through a series of planning tasks before beginning the
search. The scenario also includes encounters with
synthetic entities who populate the town, as shown in
Figure 2. These characters interact with the participants,
either providing tips about cache locations, requesting
assistance, behaving rudely, or negotiating to let the
participants enter a building.
Game
Events
Clients
Game
Client
Log
Files
Log
Daemon
Figure 1: Testbed Architecture
3. Scenario Design
In order to verify the testbed as a mechanism for
experimentation and validate its utility as a means for
conducting team research into culture and personality, we
conducted a pilot experiment examining the impact of
culture and personality on team decision-making,
coordination, and performance. For this pilot experiment
we developed a scenario for group planning and task
execution, building in metrics for effectiveness and
efficiency. Because the task was envisioned as a
reasonable exercise of general skills and behaviors rather
than a high-fidelity simulation of a realistic situation, we
were able to balance elements of the scenario the better to
support the design of the experiment. This scenario was
designed partly as a demonstration of how SABRE could
elicit a wide range of possible team and individual
behaviors that might be affected by cultural/personality.
This led us to examine a larger number of hypotheses than
one might typically examine in an experiment.
In the basic experiment, team members are assigned roles
(e.g., patrol leader, weapons specialist) depending on the
experimental condition, and the team is given the highlevel task of locating and acquiring caches of weapons
hidden within a town. The team is provided with
equipment to help with the task (sensors of varying
capabilities designed to help locate weapons caches and
Figure 2: Screenshot of a player
interacting with a townsperson
3.1 Eliciting and Measuring Behaviors with the
Scenario
A principal design goal was to provide motivation and
rewards for teamwork, but allow players significant
choice over the amount, timing, and type of interaction.
Thus, participants could demonstrate their preference
towards independent or interdependent activity in
response to team goals. We provided motivation for
teamwork by assigning each participant a distinct role
with different responsibilities, providing limited numbers
of cache sensors to be shared among the team, and
controlling the flow of information from the townspeople
so that no one participant received a complete set of tips.
The players were thus motivated to share information and
coordinate their actions in order to maximize team
performance. Because we were interested in examining
communication-style preferences among team members,
we limited team-wide broadcasting and instead required
point-to-point communication.
Another design guideline was the desire to take advantage
of a game environment’s capability for gathering multiple
data points from a single person. Some vignettes, such as
negotiating with building residents, were repeated with
minor variation a number of times within the scenario so
that a participant’s average response could be calculated.
Additionally, to ensure that all participants had to react to
certain encounters, such as requests for assistance, offers
of information, or insults, the synthetic entities were
programmed to seek out each player a minimum number
of times and force certain situations to occur.
The scenario was also set up to minimize certain types of
behaviors that were irrelevant to our cultural and
personality hypotheses. For example, because we were
not interested in studying aggression in general, the server
was configured so that players were unable to attack each
other, many game objects were implemented to be
immune from attacks, and most synthetic entities were
programmed to simply flee if attacked. However, we were
interested in aggression as a response to insults, so we
included starting a fight as a possible response in those
vignettes. Although the scenario was designed to reduce
motivation for outlier behaviors like attacking, we
included data collection for these actions.
3.2 Design of Goals, Rewards, and Penalties
An experimental scenario must balance game elements in
order to elicit behaviors of interest. The task should not be
so difficult that some participants completely fail, nor so
easy that some participants perform perfectly. Often it is
useful to manipulate a game’s scoring system so that the
players demonstrate how they weigh various costs and
benefits by their choice of game strategy. These costs and
benefits must be clearly laid out in the game and
understood by the participants so that they can make
decisions informed by the situation’s expected utilities.
In our pilot scenario the team is awarded points for
recovering caches and penalized for opening empty
containers (as this can be avoided by first using the
sensors to test the container) or setting off traps.
Participants must also pay a cost for entering buildings,
and may also spend money to bribe or assist townspeople.
Caches inside buildings are harder to find, and may be
trapped, but earn a higher reward. The team starts the
game with some number of points, and their ending total
is used as a measure of performance.
The reward, costs, and penalty levels and the number of
caches were adjusted to avoid ceiling and floor effects.
Additionally, because we wanted to study risk taking
behavior, the scenario was designed so that the expected
utility of a low-risk strategy (concentrating on exterior
caches) was the same as a high-risk strategy (focusing on
interior caches). Compared to commercial games, in
which rewards are dispensed to maximize entertainment,
a behavior-research game must ensure that its reward
structure does not unnecessarily bias player decisions.
One example of this is the way that traps were
implemented in the scenario. Because initial experiences
in a new environment heavily influence later decisions,
we wanted to avoid having players experience a negative
consequence early in the game. To insure this, traps were
not pre-set to occur at fixed geographical locations, but
instead were generated to occur during an encounter
following the team’s successful recovery of several
interior caches.
3.3 Managing Free Play
In scenarios for experimentation there exist competing
desires for constraining behavior in order to produce
situations amenable to quantitative analysis and
hypothesis testing, but yet providing more realistic and
less artificial situations for participants to play out
behaviors. Scenario design must control free play without
making the game seem too linear, and without destroying
a participant’s sense that his actions and decisions affect
the game’s outcome. While this kind of control can be
achieved by using a researcher as a confederate inside the
game to actively guide or influence the participants, we
wanted to maximize reproducibility and ease of
experiment administration by including automated
controlling elements in the scenario.
Our pilot experiment scenario used several approaches to
direct participants and increase data quality. One strategy
was to structure the consequences of many decisions to
lead towards convergent outcomes. For example, whether
a participant treated a rude townsperson respectfully or
escalated tensions until a scuffle broke out, the
townsperson would eventually exit the scene and the
player would resume their mission. Similarly, if a
participant was unable to convince a resident to allow
them into a building, they could use the team’s lockpick
to eventually gain access. These convergent outcomes
made it easier to compare actions across participants and
across teams, since each group’s situation remained
comparable regardless of their decisions. Another
approach to controlling free play was to use built-in
“gates” to enforce mandatory decisions. For example,
during the planning phase of the game, the team had to
distribute a limited number of sensors and tools among its
members. They were allowed into the storeroom to pick
up the items only fter they had indicated that they
understood this task. Then they were not allowed to leave
the area until each item was in someone’s possession.
Both these gates were enforced by “smart” doors in the
game that remained locked until the condition was met,
and would provide feedback messages about the
remaining task if the players tried to open the door early.
The scenario also used periodic status updates to remind
the participants of their overall goal and to motivate them
by showing the current score and time remaining.
3.4 Usability Testing
In order to refine the scenario for improved usability,
game balance, and performance, we conducted iterative
usability testing -- ultimately studying sixteen teams. One
early observation was that participants from different age
and computer-game-experience demographics used
contrasting learning strategies and found different aspects
of the game confusing. For example, participants who did
not have experience with text message applications found
it frustrating that group conversations would move on
before they had finished typing a long response. We
focused the usability studies on participants who matched
the target study demographic, but included small numbers
of people who were older and had no game playing
experience in order to become aware of additional gameplay and familiarization issues.
Usability study participants were instructed on the general
purpose of the research game, and asked to look for
aspects of the game-play experience which were
confusing or surprising. Through both written and verbal
debriefings, we solicited comments and suggestions that
were used to refine the pilot study scenario and the
accompanying game tutorial. One approach we used was
to examine (during each spiral of scenario refinement)
what the most confusing thing was for each player. If
multiple participants were puzzled or frustrated with the
same thing, that was a strong indication that at least one
aspect of the scenario needed to be reworked. For
example, early versions of the scenario allowed the team
to vote on whether to deploy a strike team to unlock
doors. However, it was common for at least one team
member to miss or ignore the current call for a vote,
which would prevent the authorization from being
completed. Additionally, we found that calls to vote did
not trigger significant team discussion about the merits of
deploying the strike team to the suggested location.
Instead, most team members always vote yes, with a
minority of participants occasionally and seemingly
arbitrarily voting no. Because the voting element was
frustrating to many participants and was not eliciting
interesting communications between players, we
eliminated it from the scenario.
Usability testing was also helpful in identifying a number
of unexpected behaviors from the participants. For
example, when asked to describe the team’s search plan,
some participants simply entered an empty response. In
other cases, players became bored or frustrated with their
perceived lack of success at the mission and began
attacking townspeople, or undressing their characters. In
some cases, we tried to minimize the motivations for such
unexpected behaviors, such as lowering average
frustration by making it easier to find caches. In other
cases, we modified the software to enforce some
requirements, such as a minimum character length for the
typed description of a plan, or to prevent behaviors like
undressing which might interfere with the team earnestly
pursuing their mission.
4. Experimental Pragmatics: Lessons
Learned
After refining our experimental scenario and game
tutorial we began using SABRE in a pilot experiment to
study the effect of culture and personality factors on team
behavior. In this section, we describe some lessons
learned from using the testbed as a behavior research tool.
Our pilot experiment involved eight American and eight
Chinese teams in order to contrast these two cultures.
Additionally, half the teams of each culture had no
designated leader while the others had a randomly
selected individual designated as the group’s leader. The
entire experiment for each group started with a
background
demographic
and
computer/gamingexperience survey and standard personality and culture
questionnaires. The participants then completed both a
single-player and multi-player familiarization session and
game tutorial. After learning the game, the participants
played through the mission planning and execution tasks
and wrapped up with a debriefing questionnaire. Groups
took from four to five hours to complete an experiment,
with approximately equal amounts of time spent with the
game tutorial and experiment scenario.
4.1 Participant Selection and Control
In addition to the large performance and behavior
variability we found between different demographic
groups, we also noticed some challenges with participant
selection and control. Selection criteria were particularly
important because of the small number of teams studied.
Because the experimental scenario involves significant
text reading and typing, we found that it was necessary to
restrict study eligibility to participants who were in
college or were college graduates to prevent limitations in
reading comprehension, reading speed, and keyboarding
ability from significantly impacting team performance. To
avoid gender effects on perceptions of leadership and
team dynamics we used only male participants. We also
restricted the participant age range to be 18 to 35 to better
match the population of interest. Because we did not prescreen participants with a personality instrument, we were
not able to control the distribution of individuals scoring
high or low on various personality factors.
represented in the game world by a specific avatar on the
screen.
We found a number of disparities between the American
and Chinese groups. All participants were recruited from
the Boston area. Participants were limited to individuals
who had lived in their home country (US or mainland
China) until at least age 18, with a maximum of six
months spent abroad during that time. The Chinese
participants were required to have English-language
(TOEFL) scores or employment/study histories
demonstrating a high level of English proficiency. On
average, the Chinese participants were older and had
more graduate school experience. They also had less
game-playing experience, and were more likely to work
or study full-time. A few of them had met each other prior
to the experiment. Finally, they were more likely to have
attended selective colleges or universities.
The game tutorial we developed covered the basic game
interface, the use of the in-game tools and sensors needed
for the mission, and multi-player communication and
coordination. Each participant first completed a self
paced, single-player tutorial, then waited in an
entertaining practice area until all the players were ready
to start the multi-player exercises. The multi-player
exercises were designed to encourage the participants to
help train each other by including several “gates” where
each player had to complete a task before the group could
proceed. Ofte, the first players to understand how to
perform the task then explained it to the other players.
In addition to demographic differences between the
Chinese and American groups, we noted that the Chinese
groups were not necessarily representative of the Chinese
average. All the Chinese participants had voluntarily
relocated to the US. Although the individuals on each
team did not know each other well, the participants from
the Chinese teams may have identified more with each
other as members of the same minority nationality. Thus,
not all the observed differences in team performance and
behavior during the game task can be attributed solely to
cultural differences. These difficulties in population
sampling are likely to affect any cultural-behavior
experiments performed at one location. Since game-based
experiments are easier to reproduce, one approach would
be to replicate the experiment in multiple countries.
Potentially the game scenario could also be translated and
localized for other languages.
4.2 Game Familiarization
Because we wanted to be able to include participants with
minimal game playing experience, a built-in
familiarization and training session was crucial to ensure
that every study participant had adequate ability to play
the game before going into the experimental scenario.
Although Neverwinter Nights has a relatively easy game
interface compared to many first person shooter, strategy,
or building games, and even though we simplified the
interface for use in the experiment, we found it took at
least an hour for the average individual to learn to play
the game. The tutorial explicitly covered some common
game assumptions and paradigms for the benefit of novice
game players. For example, a concept introduced early on
was the idea of using the mouse cursor as a mechanism
for “looking” at objects in the game world. Another basic
concept that was specifically addressed based on usability
group feedback was the idea that the player was
Our general approach in designing the tutorial was to
minimize the amount of assistance required from a human
instructor. We supported both of the two basic learning
styles observed during usability testing. To support
participants who preferred to read step-by-step
instructions we included embedded tests to make sure
they actually tried taking each action after it was
explained so that the experience of doing would help
them retain the instructions. To support participants who
instead wanted to try different actions until they achieved
the desired effect, we stated the goal at the start of each
instruction area and provided graphical and textual
feedback in response to possible actions, including
context-sensitive suggestions for improving performance.
Additionally, in some cases if the player tried too many
incorrect actions before achieving the required goal, the
tutorial gave them some extra review and an additional
“test”.
Although we tried to make the tutorial as self-contained
as possible, we found that human back-up was necessary.
For example, at several points participants are instructed
to click on a particular icon or manipulate a specific part
of the game interface. The tutorial includes textual
descriptions of the region of the screen where the icon
resides, and in some cases we were able to make the icon
blink in order to draw the player’s attention, yet
occasionally someone had trouble finding the item and
would need the experiment administrator to physically
point to the item on the screen. A single experiment
administrator could easily monitor five participants who
were visually isolated but in close physical proximity, but
might find it challenging to adequately monitor a group
larger than ten.
4.3 Data Analysis
SABRE’s approach to data collection is to log as much
data as possible, so that each experiment can be studied
for a variety of purposes, even beyond those initially
intended by the researchers conducting the study. SABRE
includes data filtering and data-formatting scripts that can
extract a large number of behavior and performance
statistics, and also supports raw data output in a plain
format for further statistical analysis. In this section, we
briefly discuss several different types of analyses we
performed on the pilot experiment data.
The game task specified that the players should try to
maximize the team score, represented by pieces of gold,
by finding the most caches to gain rewards while
incurring the fewest penalties and costs. We chose to
structure the game with this single dimension of success,
rather than having multiple orthogonal dimensions such
as goodwill versus material success. However, in addition
to comparing each team’s overall point score, we also
compared some contributing factors, such as the number
of caches recovered (a measure of mission task
effectiveness) and the number of empty crates opened (a
measure of mistakes made). All these measures are
examples of simple automated counting of specific
categories of actions.
Another type of measure involves some contextdependent calculation. For example, an individual’s
tendency to initiate communication can be quantified by
calculating how many of their typed messages occurred
after a period of silence (in our case, 20 seconds). This
measure was used to study whether designated leaders
initiated more conversations, and whether this difference
correlated with culture. It could also be used to determine
whether individuals scoring higher on the extroversion
personality trait initiated more conversations. Similarly,
an individual’s tendency to personally gather data before
taking action was examined by considering each building
the person entered and calculating their average number
of sensors usages within a certain radius and directed
towards a particular building prior to entry. This was used
to study whether teams from cultures with higher
uncertainty tolerance gathered less data. It was also used
to see whether individuals who scored higher on
conscientiousness gathered more data. Also, the average
division of a team into smaller subgroups was studied by
calculating individual proximity at regular time intervals.
This allowed us to determine the percentage of time each
team spent divided into groups of 1:1:1:1, 1:1:2, 1:3, 2:2,
or 4. This measure allowed us to see whether teams from
a high-individualism culture spent more time walking
around individually.
SABRE also supports manual scoring of behavior by
producing an overall communication record for the team.
This was used to manually score the text communication
between players for instances of requests for information,
volunteered information, and directives. Additionally, the
testbed has the capability for semi-automated
communication analysis using Linguistic Inquiry and
Word Count (LIWC) [12], which classifies utterances into
categories such as positive, negative, referring to self, or
referring to the team. We used these types of content
analysis to study whether leaders from high powerdistance cultures used more directives, and whether teams
from high individualism cultures used more negative
utterances.
5. Future Directions
We are currently in the process of analyzing the results
from the pilot experiment comparing behavioral
differences between American and Chinese teams. We are
also designing and developing an experiment for the
NATO Supreme Allied Command Transformation HQ,
Futures and Engagement, Concept Development and
Experimentation project entitled Leader and Team
Adaptability in Multinational Coalitions (LTAMC), to be
run at U.S. and international sites. This experiment will
focus on cultural and personality effects on information
sharing, situation awareness, and division of
responsibilities as related to rapid formation of
multinational Joint Task Forces.
We are actively working in a number of areas to enhance
the capabilities of the testbed. While the breadth of
experimental scenarios possible in Neverwinter Nights is
wide ranging, experiment construction can require
specialized development skills such as programming,
scripting or advanced knowledge of the Neverwinter
Nights production pipeline in order to produce new types
of experiments. We are expanding the existing authoring
and data-analysis tools to further reduce the depth and
breadth of specialized knowledge required to develop
novel experimental scenarios.
We are in the process of integrating existing voice
communication capabilities [13] into the experimental
testbed, making it possible to capture and record voice
communication between participants. These capabilities
would make it possible to develop and evaluate more
natural communications’ capabilities between human
players, as well as facilitate the setup and control of
experiments by improving the ability of an experimenter
to converse with experiment participants within a
distributed experimental environment.
We will begin work shortly to construct computercontrolled characters within the testbed that can exhibit a
range of behaviors consistent with pre-determined cultural
and personality factors. Character behaviors will be
parameterized so that users such as experimenters or
training-application developers can tune the characters to
produce the type of cultural or personality response
required for the chosen situation. Additional future work
also includes the development of a Korean version of our
scenario to support further experimentation.
Interservice/Industry Training, Simulation, and
Education Conference (I/ITSEC) 2004, Orlando, FL,
December 2004.
6. References
Author Biographies
[1] J. P. Holmquist, "Playing Games", Military Training
Technology, vol. 9 issue 5, Oct. 27.
[2] T. Holt, "How Mods Are Really Built", Serious
Games Summit, Oct 18-29, 2004.
[3] W. Ferguson, D.E. Diller, A.M. Leung, B. Benyo,
and D. Foley, Behavior Modeling in Commercial
Games, BBN Interim Report Feb. 13, 2003.
[4] D.E. Diller, W. Ferguson, A.M. Leung, B. Benyo, &
D. Foley, “Behavior Modeling in Commercial
Games”, in Proceedings of the 13th Conference on
Behavior Representation in Modeling and
Simulation, Arlington, VA, May 2004.
[5] C.J. Bonk & V.P. Dennen, “Massive Multiplayer
Online Gaming: A Research Framework for Military
Training and Education”, available from Advanced
Distributed
Learning
at
http://www.adlnet.org/downloads/189.cfm.
[6] A. McFarlane, A. Sparrowhawk, Y. Heald, “Report
on the educational use of games”, Published by
TEEM
and
available
online,
http://www.teem.org.uk/publications/teem_gamesine
d_full.pdf.
[7] P. Gorniak and D. Roy, “Speaking with your
Sidekick: Understanding Situated Speech in
Computer Role Playing Games”, in Proceedings of
Artificial Intelligence and Digital Entertainment,
2005.
[8] M. Carbonaro, M. Cutumisu, M. McNaughton, C.
Onuczko, T. Roy, J. Schaeffer, D. Szafron, S. Gillis,
S. Kratchmer, “Interactive Story Writing in the
Classroom: Using Computer Games”, in Proceedings
of the 2005 International Digital Games Research
Conference (DiGRA 2005), June 16-20, 2005,
Vancouver, B.C., Canada.
[9] P. Spronck and J. van den Herik, “Game AI that
Adapts to the Human Player”, ERCIM News, No. 57,
April 2004, pp. 15-16.
[10] The
D20
Modern
Mod
website
is
http://d20mm.net./index.php and their custom content
for Neverwinter Nights is available through the
Neverwinter Nights Vault at http://nwvault.ign.com/.
[11] The
NWNX2
software
is
available
at
http://www.nwnx.org/ or through the Neverwinter
Nights Vault at http://nwvault.ign.com/.
[12] J.W. Pennebaker and L.A. King, “Linguistic styles:
Language use as an individual difference”, Journal of
Personality and Social Psychology, 77, 1296-1312,
1999.
[13] D.E. Diller, B. Roberts, S. Blankenship, D. Nielson,
“DARWARS Ambush! – Authoring Lessons Learned
in an Training Game”, in Proceedings of the
ALICE M. LEUNG is a scientist in the Intelligent
Distributed
Computing
Department
of
BBN
Technologies, Cambridge, MA. Her current interest is in
harnessing games for behavior research. Previous projects
include software for military logistics planning using
distributed agent technology, small-scale simulation of
information transfer economics, and infrastructure to
support training through games.
DAVID E. DILLER is a Senior Scientist at BBN
Technologies. He holds an M.S. in Computer Science and
a joint Ph.D. in Cognitive Science and Cognitive
Psychology from Indiana University. His current focus
includes cognitive modeling, mixed-initiative agent-based
systems, and simulation-based training applications.
Recently, Dr. Diller has been involved in a number of
projects utilizing commercial game technology for
training applications.
WILLIAM FERGUSON is a Division Scientist at BBN
Technologies. His background is in artificial intelligence,
simulation, computer-based training, and commercialgame technology. He is currently a Co-Primary
Investigator for the integration effort under the Defense
Advanced Research Projects Agency’s (DARPA's)
DARWARS training program. He also serves as Co-PI of
the Cultural Modeling Testbed, a joint Defense Modeling
and Simulation Office (DMSO) and Air Force Research
Laboratory (AFRL) project involved with using
commercial games to study culture and personality. He
worked for many years as the technical lead for the
Analysis of Mobility Platform (AMP) project,
USTRANSCOM's
transportation
programmatics
modeling system.
Neverwinter NightsTM is a trademark of Bioware, Inc.