Colyseus: A Distributed Architecture for Online Multiplayer Games

Colyseus: A Distributed Architecture
for Online Multiplayer Games
Ashwin Bharambe, Jeffrey Pang, and Srinivasan Seshan
Carnegie Mellon University
Online First-Person Shooters

Multiplayer architecture for multiplayer firstperson shoot games.
• low latency required
• weakly consistent state tolerable
• spatial contiguity allows pre-fetching

Traditionally this is done with a client-server
model
• leads to bottlenecks
• can only sustain user numbers in the dozens
Colyseus


Single-copy replication model
• maintains consistency by serializing updates
• mirrors existing server model
• cuts latency (reading from local copy), but sacrifices
consistency in game-wide state
DHT lookup for object queries
• both random and range-query DHTs implemented
• range-query DHT works well since object queries will
always be in contiguous spatial regions
• DHT query delay overcome by anticipating which objects
will be needed soon and pre-loading them
Replica Manager





Follows Tunable Availability and Consistency Tradeoffs (TACT) model
• depending on specific game characteristics, developers can select
either more availability or consistency - consistency lowers
availability (increases lag), availability lowers conistency.
synchronizes replicas to primary
• changes are delta-encoded, sent to primary, then distributed
serially to all replicas
• when a node becomes interested in a replica (i.e. the player is
near that object), it registers with the primary and receives updates
directly (decoupled discovery/synchronization)
creates new replicas
deletes replicas that are no longer needed
fast moving objects (missiles) use a special case attachment so that
they are automatically sent to nodes that request the object they are
attached to (the person who fired the missile)
Range-Queriable DHT

both standard random DHT and rangeDHT implemented with Mercury

adjacent nodes responsible for adjacent keys
• player x,y coordinates used as key, then game can request other objects
that are near the player from adjacent nodes

predictions based on current player motion can be used to pre-load
upcoming objects
• with a known average DHT lookup time, the prediction can be tweaked so
that objects finish loading about when they are needed
DHT comparison
Evaluation

Experimental Setup







Emulab used to simulate virtual servers
no link capacity constraint, but ties end-to-end latency to measured samples
artificially dilate time to counter slow virtual servers
model game based on density statistics for a quake III map ( Zipf distribution)
Modified Quake II to work with Colyseus
 Mercury rangeDHT
 variable size bounding box corresponding to visible objects for a
character used as area-of-interest
 client/server messaging remains intact so unmodified engines could
connect to any of the p2p nodes as if it were a server
Custom map w/ bots for workload
Emulab testbed
Results
Discussion




Colyseus enables multiplayer first-person-shooter games to handle
hundreds of players, instead of dozens
since FPS games have very high demands for low latency and
consistency, extending this architecture to other game types, like role
playing games, is very feasible
adaptation of commercial Quake II shows that this is feasible for other
games in a production environment
this method opens up many possibilities for cheating (a node could be
modified to request objects that it shouldn’t ‘see’, for instance), but more
work could be done to address those threats.
Related Work


Real-time strategy (command & conquer)
 parallel simulation often used, since consistency is very important
 often limited to less than 10 players
Online role playing (world of warcraft, second-life)



cell based
centralized server or server cluster
Distributed Virtual Reality Environments


similar goals to Colyseus, but specific
not catering to common game applications
Issues




why not just evaluate actual gameplay? (custom map, all bots seems a
little suspect)
synchronization decoupling seems to introduce a bottleneck. It’s
still better than a server model since a primary replica might be the
only one on a node, but the node still has to handle all traffic for
that replica
how are node failures (player logs off) handled?
how do you handle updates on objects in a view that is already
inconsistent, especially since the node cannot know if it’s view is
consistent.