17 From Taxi GPS Traces to Social and Community Dynamics: A

From Taxi GPS Traces to Social and Community Dynamics: A Survey
PABLO SAMUEL CASTRO and DAQING ZHANG, Institut Mines-TELECOM/TELECOM
SudParis; 9, rue Charles Fourier; 91011 Evry Cedex, France
CHAO CHEN, Institut Mines-TELECOM/TELECOM SudParis; 9, rue Charles Fourier; 91011 Evry
Cedex, France; and Universite Pierre et Marie Curie; 4, place Jussieu; 75005 Paris, France
SHIJIAN LI and GANG PAN, Zhejiang University; Hangzhou, 310027, P.R., China
Vehicles equipped with GPS localizers are an important sensory device for examining people’s movements
and activities. Taxis equipped with GPS localizers serve the transportation needs of a large number of people
driven by diverse needs; their traces can tell us where passengers were picked up and dropped off, which
route was taken, and what steps the driver took to find a new passenger. In this article, we provide an
exhaustive survey of the work on mining these traces. We first provide a formalization of the data sets, along
with an overview of different mechanisms for preprocessing the data. We then classify the existing work into
three main categories: social dynamics, traffic dynamics and operational dynamics. Social dynamics refers
to the study of the collective behaviour of a city’s population, based on their observed movements; Traffic
dynamics studies the resulting flow of the movement through the road network; Operational dynamics refers
to the study and analysis of taxi driver’s modus operandi. We discuss the different problems currently being
researched, the various approaches proposed, and suggest new avenues of research. Finally, we present a
historical overview of the research work in this field and discuss which areas hold most promise for future
research.
Categories and Subject Descriptors: A.1 [General Literature]: Survey
General Terms: Design, Experimentation, Human factors
Additional Key Words and Phrases: Taxi GPS, urban computing, smart cities
ACM Reference Format:
Castro, P. S., Zhang, D., Chen, C., Li, S., and Pan, G. 2013. From taxi GPS traces to social and community
dynamics: A survey. ACM Comput. Surv. 46, 2, Article 17 (November 2013), 34 pages.
DOI: http://dx.doi.org/10.1145/2543581.2543584
1. INTRODUCTION
The past decade has seen a dramatic increase in the number of personal devices such
as cell phones, portable computers, and GPS localizers. These devices leave digital
footprints of their user’s activities, which are a reflection of the economical and societal
interactions of a community. Using techniques from a wide range of fields, researchers
have sought to extract communal behaviours and intelligence from this data to obtain
a better understanding of the underlying dynamics of an individual, community, or
city [Martino et al. 2010; Zhang et al. 2011a]. This understanding can enable many
innovative applications in city planning, traffic management, disease containment, and
Authors’ address: P. S. Castro and D. Zhang, Institut Telecom SudParis, 9 rue Charles Fourier, EVRY Cedex,
91011, France.
This research was supported by the Institut TELECOM “Futur et Ruptures” Program, AQUEDUC–ParisRegion SYSTEM@TIC Smart City Program, and the High-Tech Program of China (863) (No. 2011AA010104).
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2013 ACM 0360-0300/2013/11-ART17 $15.00
DOI: http://dx.doi.org/10.1145/2543581.2543584
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17
17:2
P. S. Castro et al.
green computing, to name a few. Each different source of data reveals different aspects
of the underlying community dynamics, depending on the type of digital footprint.
Vehicles equipped with GPS localizers provide an important type of footprint,
given that public and private vehicles are the main transportation means for a city’s
population. People use vehicles for commuting to and from work, for regular and “irregular” chores, and for leisure activities. By careful analysis of the observed movements
of a population, researchers strive to better understand the demographics of a city, the
distribution of services around a city, the effectiveness of the various transportation
networks, the dynamics of traffic conditions, and the different driving behaviours.
Public transportation vehicles equipped with GPS localizers offer rather predictable
data, as the vehicles in question follow fixed routes under a specified schedule. Similarly, private vehicles equipped with GPS localizers are usually restricted to one user
and follow fairly predictable routes (i.e., to and from work). On the other hand, taxis
equipped with GPS localizers serve the transportation needs of a large number of people driven by diverse needs and are not constrained to a prespecified route. The GPS
traces of a taxi can tell us, with a fair amount of precision, where passengers were
picked up, where they were dropped off, which route was taken, and what steps the
driver took to find a new passenger. By virtue of the diversity of passengers served, as
well as their continual operation, a taxi GPS traces offer a rich and detailed glimpse
into the motivations, behaviours, and resulting dynamics of a city’s mobile population.
In this article, we aim to survey existing research on mining taxi GPS traces as well
as suggest some interesting avenues for future research. Through thorough analysis,
we group the existing body of work into three categories: social dynamics, traffic dynamics, and operational dynamics. We define social dynamics as the work studying the
collective behaviour of a city’s population, as observed by their movement in the city.
Researchers in this field are interested in where people are going throughout the day,
the “hottest” spots around a city, the “functions” of these hotspots, how strongly connected are different areas of the city, and so forth. These movements are motivated by
diverse needs and influenced by external factors (e.g., weather and traffic). A deep understanding of social dynamics is essential for the management, design, maintenance,
and advancement of a city’s infrastructure, and is studied by many different fields.
Giannotti et al. [2011] list a number of questions of particular interest to mobility
agencies, which are a part of the discussion we present herein.
Governed by their underlying desires or needs, people will move around the city
mainly through the road network. Whilst social dynamics aims to understand people’s movementpatterns, traffic dynamics studies the resulting flow of the population
through the city’s road network. Most of this work is generally aimed at predicting
traffic conditions and can be useful for providing real-time traffic indicators and travel
time estimation for drivers.
Operational dynamics refers to the general study and analysis of taxi drivers’ modus
operandi. The aim is to be able to learn from taxi drivers’ excellent knowledge of the
city, as well as to detect abnormal behaviours. The last two categories used mainly the
end points of a taxi’s trajectory (i.e., the pickup and dropoff locations); in the study of
operational dynamics, researchers make use of full trajectories, as the routes taken by
drivers are of utmost importance. Researchers have mined these trajectories in order
to predict future trajectories, suggest strategies for quickly finding new passengers,
and suggesting navigational routes for reaching a destination quickly. Additionally,
new trajectories can be compared against a large collection of historical trajectories to
automatically detect abnormal behaviour.
It is worth stressing that although we have classified the various works into three
categories, they are by no means independent of each other. Most papers, despite focusing on one particular category, deal with subjects relating to the other categories.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:3
Table I. Fields for a GPS Entry with a Sample
Speed
Occupied
Taxi ID Longitude Latitude (km/h) Bearing
flag
Year Month Day Hour Minute Second
10429 120.214134 30.212818 70.38 240.00
1
2010
2
7
17
40
46
As an example, consider the work of Li et al. [2011b] on discovering passenger finding
strategies. Although passenger-finding strategies fall within the Operational Dynamics category, their method requires the extraction of ‘hotspots, which falls under the
Social Dynamics category; additionally, the information produced by their method gives
some insight into the traffic conditions around different parts of the city at different
times, which would fall under the Traffic Dynamics category. This dependence across
categories is something we encounter in most of the papers we surveyed. It is thus
difficult to provide an exhaustive survey of work on mining taxi GPS traces without
surveying all three categories in conjunction.
This article is organized as follows. In Section 2, we begin by providing a formal
definition of the dataset, along with a survey of existing work in preprocessing this
data. In Section 3, we review the work relating to social dynamics; in Section 4, we
analyze the work relating to traffic dynamics; and in Section 5, we survey the papers
focusing on navigational strategies. In Section 6, we present some deployed systems
making use of taxi GPS information. We provide a brief historical overview of the
development of this type of work in Section 7, and we conclude the paper in Section 8.
In surveying these papers, we might reference certain techniques or tools with which
some readers may not be familiar. As providing a description for each of these would
disrupt the flow of the presentation and drastically increase the size of the article,
we recommend that interested readers follow the references provided to seek more
detailed information.
2. DATA PREPARATION
In this section, we define the taxi GPS dataset that will be the basis of the research
surveyed. For concreteness, we present the dataset used by our research group, but
the differences amongst different datasets are of little significance (as seen in Wang
et al. [2011],where three different datasets from three cities in China are presented).
As previously mentioned, the raw data must be prepared in order for it to be suitable
for the work discussed in the sequel. We survey a number of papers addressing the data
preparation problem, including Edelkamp and Schrödl [2003], Schroedl et al. [2004],
Cao and Krumm [2009], Haklay and Weber [2008], Qi et al. [2011], Gonzalez et al.
[2007], Yuan and Zheng [2010], and Zheng et al. [2011].
2.1. Dataset Description
We consider situations where the GPS records of a large number of taxis in a city
are routinely saved to a log file over a number of months, resulting in a very large
dataset. Our dataset was obtained from around 7,000 taxis in Hangzhou, China, over
a period of 12 months, at a sampling rate of approximately once per minute (for each
taxi), resulting in more than a billion GPS entries. Table I lists the fields for each GPS
record, along with a sample entry.
Henceforth, we shall use the variable σ to represent a GPS entry, with fields (in
the same order as in Table I) {id, lon, lat, speed, bear, occ, year, month, day, hour, minute,
second}. Let be the set of all GPS entries.
Although the data for each taxi can be considered as a continuous stream throughout
the full year, we split it up into trajectories, based on whether the taxi is occupied or
unoccupied (the occupancy flag is consistent within each trajectory). Employing the
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:4
P. S. Castro et al.
taxi occupancy flag can yield an advantage over other methods for splitting trajectories,
such as used by Giannotti et al. [2011], where they split GPS trajectories of regular
vehicles by detecting more than 30-minute “pauses” in the transmission of logs.
Definition 2.1. A GPS trajectory τ of length n consists of a series of GPS entries
j
i
σ . . . σ t+n such that for all t ≤ i, j ≤ t + n, σocc
= σocc . A trajectory τ is an occupied
i
trajectory if for all i, σocc
= 1, and a vacant trajectory otherwise.
t
We can then see that the data for a single taxi consists of a series of alternating
occupied and vacant trajectories. The end points of the occupied trajectories yield some
information that will be useful for many of the papers presented.
Definition 2.2. Given an occupied trajectory τ = σ 1 . . . σ n, we define σ 1 as the
pickup point and σ n as the dropoff point.
Let χ ⊂ R2 be the set of distinct latitude/longitude positions in , and define the
boundary latitude and longitudes as:
lonmin = min σlon
latmin = min σlat
lonmax = max σlon
latmax = max σlat
σ ∈
σ ∈
σ ∈
σ ∈
2.2. City Decomposition
The “terrain” of a city is a continuous two-dimensional area (i.e., a subset of R2 ), which
is difficult to work with. It is more practical to decompose the city into separate (usually
disjoint) areas and work with this decomposition. Let D ⊆ R2 be the area covering all
of the GPS entries. Specifically,
D = [lonmin, lonmax ] × [latmin, latmax ]
Definition 2.3. A city decomposition is a pair (, ψ), where is a finite set of “areas”
and ψ : D → is a membership function, where D ⊆ R2 , mapping any location
(x, y) ∈ D to an area given by ψ(x, y) ∈ .
We will often make use of certain functions to characterize areas based on some
statistical criterion. We formally define these functions next and present instances of
them throughout the article.
Definition 2.4. A characterization function f : P(R2 ) → Rd maps any set of points
(specified by latitude/longitude) to a characteristic vector in Rd (given a set X, we denote
its power set by P(X)).
Given a decomposition (, ψ), we can apply a characterization function to an area
A by f (ψ −1 (A)). The usefulness of a characterization function highly depends on the
decomposition used, and we should require these functions to be consistent with the
decomposition, as defined next.
Definition 2.5. A characterization function f is consistent with a decomposition
(, ψ) if, given an area A ∈ split into two “subareas” B, C ⊆ D such that B ∪ C =
ψ −1 (A) and |B| ≈ |C|, then f (ψ −1 (A)) ≈ f (B) ≈ f (C).
The idea behind consistency is that the characterization function in question should
behave uniformly throughout each area, and there should be little loss of information
introduced by the decomposition. Note that the number of characterization functions
that are consistent with a decomposition is a good indication of the quality of the decomposition. In a sense, one can use characterization functions to measure the homogeneity
of the areas in a decomposition.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:5
c
Fig. 1. A 25 × 25 grid decomposition of Hangzhou. Each area is roughly 1km2 . Google
(2011).
The purpose of decomposition is to render the continuous space D manageable for
analysis. A very simple approach could be to use χ directly (i.e., one element for every
distinct lat/lon point), which suffers from no information loss. However, the latitude
and longitude readings obtained from the GPS devices are quite sensitive, so there may
be small variations even when the vehicle is not moving. Given that we have more than
a billion GPS entries, the size of χ would be prohibitively large. With this in mind, we
will review the two most popular approaches to city decomposition: splitting the city
into equally sized grids and using a digital map.
2.2.1. Grid Decomposition. A popular and simple approach to decomposing a city is to
split up the city’s areas into equal-sized grids. This method has been used in many
papers, including Girardin et al. [2008], Phithakkitnukoon et al. [2010], Huang et al.
[2010], Liu et al. [2010b], and Zhang et al. [2011b]. The advantage of this method is
that it is very simple and easy to implement, and one can directly determine the size
of the resulting decomposition by changing the size of the grids. Figure 1 presents an
example of this type of decomposition.
An (m1 × m2 )-grid decomposition consists of an m1 × m2 matrix of grid cells
min
{(i, j)}1≤i≤m1 ,1≤ j≤m2 , where the width of each grid cell is w = lonmaxm−lon
, and the
2
latmax −latmin
. The function ψ is defined as follows, for
height of each grid cell is h =
m1
any (x, y) ∈ D:
ψ(x, y) = arg min {x < latmin + i ∗ h} , arg min {y < lonmin + j ∗ w }
1≤i≤m1
1≤ j≤m2
This type of decomposition is usually consistent with characterization functions that
are based on averages, such as average speed, average number of pickup/dropoffs,
and expected average emissions. This is because these measurements are normalized,
effectively removing the effect of the size of the area. We stress that this usually holds
but is not always the case. Consider, for instance, a grid cell that has a section of a
river on its left half and a part of the road network on the right half. The left subarea
contains no streets, so the value assigned by any characterization function to the left
subarea will be drastically different to that assigned to the original area. This problem
can be somewhat mitigated by using smaller grids, but because of the static nature
of this decomposition, one cannot guarantee that all of the grids will be homogeneous
with respect to the desired characterization functions.
Nonnormalized characterization functions (e.g., max/min speed, total number of pickup/dropoffs) will, for the most part, not be consistent with this type of decomposition, as
the value of the characterization functions will depend greatly on the size and location
of each grid cell.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:6
P. S. Castro et al.
Fig. 2. A street split into various edges.
Fig. 3. Left: Traces from 20 taxis over 1 month; Right: Digital road network drawn over raw traces, with
c
some important areas highlighted. Google
(2011).
The advantages of this type of decomposition is that it is simple and easy to implement. It splits the city into disjoint areas that are easy to visually inspect, which is
of particular interest for qualitative results. As we will discuss later, one can perform
further processing on the grid cells (e.g., clustering and removal of “empty” areas),
yielding a decomposition that is more adapted to the dataset.
One major disadvantage of this decomposition method is that the number and size
of the grid cells is determined independently of the dataset. This results in busy areas
decomposed in the same manner as areas that have very little activity. For instance,
although 1km2 areas may be suitable for suburban or rural areas, it is most likely too
coarse for the downtown area. On the other hand, using very small grids will result
in a large and will be unnecessarily small for areas with little activity (e.g., the
mountainous area in the bottom left of Figure 1, where there are no streets). Another
disadvantage is that many road segments are inevitably grouped together in the same
cell. This can be problematic if one is interested in routing information (e.g., common
routes taken by commuters) or road congestion estimation.
2.2.2. Digital Map. If a digital map of the city is available, one can map each GPS entry
onto a point on the digital map, with the main elements defined next.
Definition 2.6. A digital map is a graph (V, E), where V is a set of vertices and E is
a set of edges. Each edge e ∈ E has the following fields: two endpoint vertices (whose
coordinates correspond to lat/lon values) ev1 and ev2 ; length elength and bearing ebrng .1
A street can consist of multiple edges, as shown in Figure 2. Note that it is split into
segments wherever there is an intersection and/or a change in bearing.
In many cases, a digital map of the city is not readily available. Given the long
time span and large number of taxis available in our dataset, the city’s road network
becomes apparent by simply plotting all of the GPS entries. In the left-hand panel of
Figure 3, we display a sample of the plot obtained from 20 taxis over 1 month, around
the downtown area of Hangzhou. We only plot the pickup and dropoff points, as we
found that including full trajectories clutters the plot. A simple (but arduous) option
is to “draw” the edges and construct a digital map manually. In the right-hand panel
1 Note
that the bearing is the direction from ev1 to ev2 ; simply substract this bearing from 360 to obtain the
bearing from ev2 to ev1 .
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:7
of Figure 3, we display the resulting digital map. It consists of 2,003 edges and 1,585
vertices.
Drawing in a digital map can be an arduous process, so many researchers have
investigated constructing digital maps automatically from digital traces. Edelkamp and
Schrödl [2003] construct a digital map representation (including lanes) by matching
GPS traces to an existing base map and/or inferring road segments by clustering
traces and applying further statistical refinements. Schroedl et al. [2004] infer road
segments and intersections by clustering together the trajectories. Worrall and Nebot
[2007] construct a compact digital map from a fleet of GPS-equipped mining vehicles by
clustering according to their location and heading, and then compressing these many
segments into a set of lines and arcs (according to curvature). Cao and Krumm [2009]
automatically build a digital map from raw GPS traces by “pulling” together traces
into paths, which are then clustered to create a graph of nodes and edges. Fathi and
Krumm [2010] use a localized shape descriptor to represent the distribution of GPS
traces around a point in order to automatically detect road intersections from the GPS
traces. Biagoni and Eriksson [2012] provide a comprehensive survey and comparison
of different approaches to map generation from GPS traces.
One disadvantage of using digital maps as we have described them here is that
they lack some crucial information, such as width of each road segment, the number
of lanes, and orientation of each lane. Gathering this type of information manually
would be quite expensive. There have been a number of papers that aim to use GPS
traces to automatically enhance an existing road network. After the initial digital map
is constructed, Schroedl et al. [2004] establish lane positions by clustering offsets from
a previously found segment centerline. Similarly, Rogers et al. [1999] model roads
via their centerline and lanes as fixed offsets from this centerline. Chen and Krumm
[2010] use a modified Gaussian mixture model for estimating the distribution of GPS
traces across multiple lanes. OpenStreetMap combines GPS traces, satellite images,
and hand-labelled information to produce a very rich digital map [Haklay and Weber
2008]. Alvares et al. [2007] propose finding “stops” and “moves” common to most taxis
to enrich existing GPS trajectories with “landmarks.”
Given a digital map (V, E), one can map each GPS entry to a point on one of the
edges from E. Simply mapping to the closest edge can suffer from different types of
errors, such as mapping to a road network that is perpendicular to the true road
network. There are a number of approaches to map matching that take contextual information into consideration, such as edit distance [Yin and Wolfson 2004] and Fréchet
distance [Alt et al. 2003; Brakatsoulas et al. 2005]. The previous approaches map trajectories globally, that is, only complete trajectories are considered. This renders these
approaches unsuitable for long trajectories, as the computational expense becomes too
large. Some papers have proposed matching trajectories on a local basis, using contextual information such as distance and orientation [Greenfeld 2002; Li et al. 2007],
spatial context, and speed information [Lou et al. 2009], or the influence of neighbouring
GPS points [Yuan et al. 2010]; Chawathe [2007] proposes assigning a confidence score
to different segments of the trajectory and then proceeds to sequentially match the
different segments, beginning with those with the highest confidence score. Liu et al.
[2012a] propose first pruning a set of trajectories by using speed and orientation, then
clustering the remaining segments using distance and orientation, and finally using
B-spline fitting [Schroedl et al. 2004] to fit the clustered traces onto road segments.
A GPS entry σ that has been mapped onto a digital map will result in a new entry σ̄ .
This new entry is identical to σ , except the latitude and longitude may be different (so
that the point “sits on” the digital map), and it is augmented with the fields edge and
dir. The field edge represents what edge e ∈ E it was mapped to, and dir represents
what street direction it is following (1 if going from ev1 to ev2 and −1 if going in the
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:8
P. S. Castro et al.
¯ be the set of all mapped GPS entries. Mapped GPS trajectories
other direction). Let ¯
τ̄ can be defined in the same way as in Definition 2.1, but using entries σ̄ ∈ .
Since trajectories will be used for routing purposes further on, we will be interested
in edge and/or vertex trajectories, as defined next.
Definition 2.7. An edge trajectory ε ∈ E∗ consists of a sequence of edges e1 , e2 , . . . , en
such that all adjacent edges are connected: for all 1 ≤ i < n, {evi 1 , evi 2 } ∩ {evi+1
, evi+1
} = ∅.
1
2
Definition 2.8. A vertex trajectory ω ∈ V ∗ consists of a sequence of vertices
v , v 2 , . . . , v n such that for all adjacent vertices v i and v i+1 , there exists an edge
e ∈ E such that ev1 = v i and ev2 = v i+1 , or vice versa.
1
We will refer to the ith edge in an edge trajectory ε via ε(i) and similarly for vertex
trajectories. Note that one can extract a valid vertex trajectory from a valid edge trajectory, and vice versa. However, due to low sampling rates, the mapped GPS trajectories
may not produce valid edge nor vertex trajectories, as there may be “gaps” (defined
next) between GPS entries, or the mapped edges may not have consistent directions.
Definition 2.9. A gap exists between two mapped GPS entries σ̄ 1 and σ̄ 2 if the two
1
2
edges do not meet: {ev11 ∪ ev12 } ∩ {ev21 ∪ ev22 } = ∅, where e1 = σ̄edge
and e2 = σ̄edge
.
The conditions for valid edge and vertex trajectories can be used as guides for choosing the edges to which the GPS entries should map. After computing a mapped GPS
trajectory, it is also necessary to complete the gaps to obtain valid edge and vertex
trajectories. The way to complete these gaps is unclear, but a simple approach is to
compute the shortest path between the two entries with a gap.
We can define digital map decompositions by the pair ( DM , ψ DM ), where DM = E
and the membership function ψ DM depends on the map-matching algorithm being
used.2
This type of decomposition is consistent with the same type of characterization functions that are consistent with the grid decomposition, and because of the (usually)
small length of road segments, it overcomes many of the problems present with grid
decomposition. The size of the road segments renders this type of decomposition for the
most part consistent with nonnormalized characterization functions such as max/min
speed, number of pickup/dropoffs, and so forth. Obviously, characterization functions
that depend on enhanced information such as lane orientation will not be consistent
with the decomposition if the decomposition does not contain that type of information.
Since taxis are restricted to navigating through the road network, a digital map is
nearly optimal in terms of overall consistency and size. As will be discussed next, this
type of decomposition can be further refined by clustering road segments or splitting
based on certain types of vertices or edges. This effectively combines the advantages
of both grid and digital map decomposition. The main disadvantage of this type of
decomposition is that it relies on a pre-existing digital map, which may not always
be available. Some methods were mentioned for their automatic construction, but the
quality or usefulness of the resulting digital map is not clear.
2.3. Clustering Areas
The areas resulting from the decompositions mentioned earlier can often benefit from
further clustering. This is especially true for the grid-based decomposition, as there
may be many “empty” areas, as well as certain adjacent areas that exhibit the same
type of behaviour.
2 We
may also have situations where DM = V or DM = V × E.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:9
2.3.1. TimeFeature Clustering. The different areas can be characterized by averaging
a set of characterization functions over a fixed time window T (such as 24 hours or
a number of days or months). This time window is split into d disjoint time “bins”
{b1 , b2 , . . . , bd}, where each b j corresponds to a time interval (e.g., 15 minutes, an hour,
12 hours, a day). Let BT be the set of bins corresponding to time period T. The time
units used to define the bins depends on the level of granularity specified.
Assume that we have a set of characterization functions {c1 , c2 , . . . , cn} (see
Definition 2.5) in which we are interested, where for any 1 ≤ i ≤ n, ci : × BT → R
maps an area and time bin to a real value (e.g., average number of pickup/dropoffs,
average speed). Then, for any area A ∈ , each characterization function ci gives rise
to a vector: xi = (x1i , x2i , . . . , xdi ), where xij = ci (A, b j ), resulting in a collection of vectors
{x1 , x2 , . . . , xn}. We denote this collection the TimeFeature of area A.
One can define a distance function that measures the similarity between two
TimeFeatures. This distance function can then be used with a clustering algorithm
(such as k-means clustering) to group together similar areas. We will examine two
works that use TimeFeature clustering and present some results on our dataset.
Froehlich et al. [2009] used this type of clustering to perform a spatiotemporal analysis of a city based on a shared bicycling system. They use a 24-hour window split into
288 bins, corresponding to 5-minute intervals. They average the values of different
characterization functions over the 24-hour window, such as the number of available
bicycles, a measure of the activity of the bicycle station, and number of checked out
bicycles. They use a metric based on Dynamic Time Warping [Sakoe and Chiba 1990]
to allow for temporal pattern shifts between feature vectors. Given these feature vectors and distance function, dendogram clustering [Duda et al. 2001] was used to group
together similar bicycle stations.
Qi et al. [2011] use TimeFeature clustering to uncover the different social functions
of different regions of a city, using varying windows of 1 week, 1 month, and 1 year split
into days (coarse grained) or “quarter” days (fine grained). They use the number of taxi
dropoffs at each area as the characterization function. The resulting TimeFeatures are
normalized, and the “cosine distance” function is used, which is one minus the cosine
of the included angle between two TimeFeatures (considered as vectors in Euclidean
space). Finally, a simple agglomerative clustering method [Xu and Wunsch 2005] is
used to group similar locations. Principal Component Analysis (PCA) [Jolliffe 1986]
was used to reduce the dimensionality of the feature vectors.
We have used this type of clustering with our dataset for Hangzhou. We split the map
into a 200 × 100 matrix of grid cells, each covering around 250m2 . We consideredonly
weekdays and used a 24-hour window split into 96 bins, corresponding to 15-minute intervals. We used the dropoff frequency as the characterization function and performed
k-means clustering with the cosine distance function, setting k = 5 clusters. Figure 4
displays the resulting clusters for grid cells whose dropoff frequency is above a minimum threshold. It is interesting to note that cluster 1 (blue cells) has grouped most of
the important transportation-related destinations together (e.g., the airport, bus and
train stations).
2.3.2. Splitting via Road Hierarchy. A natural way to cluster a digital map is by using
a hierarchical road network. In the United States, roads are classified into 4 levels,
whereas in the European Union, roads are classified into 13 levels. A popular approach
first divides the road network into groups enclosed by the highest-level roads. Each of
these areas is then further subdivided into groups enclosed by the second highest level
roads, and so on. The result is a hierarchical partition of the road network that uses
roads as group boundaries. This approach was used in Gonzalez et al. [2007], Yuan
and Zheng [2010], Zheng et al. [2011], Liu et al. [2011], and Yuan et al. [2012a, 2012b].
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:10
P. S. Castro et al.
Fig. 4. Clustering using dropoff frequency as TimeFeature (only cells with a frequency above threshold
c
plotted). Google
(2011).
In a similar vein, Li et al. [2009, 2011c] constructed a three-level road hierarchy by
classifying roads according to the frequency of their use. Sanders and Schultes [2005]
use graph-theoretic methods to construct a road hierarchy automatically. However,
their approach is designed for large-scale digital maps that include various cities, as
these cities will usually be far apart, facilitating the detection of highways between
them. As such, their method will most likely not be suitable for single-city digital
maps.
2.4. Dealing with Data Problems
In working with real GPS data, there are many problems that arise, which must be
properly addressed to guarantee reliable results. We review some of these problems
along with possible approaches to overcoming them. Although far from being an exhaustive list, we hope that it provides an insight into the type of problems of which
researchers must be aware in order to reduce “tainted” results.
2.4.1. Missing Data Entries. Occasionally, the GPS data is not received properly, with a
lapse of several minutes or even hours between entries; the reasons for this lapse could
be errors in the GPS device or a lack of GPS signal due to location (e.g., in an underground parking area). This results in a big physical jump with no information about
the car’s movement throughout this time. Depending on the problem being addressed,
this data can either be left as it is, the trajectory can be split into two (or more) parts,
or the trajectory can be truncated at the point before the jump occurred.
2.4.2. Erroneous Data Entries. Certain entries contain erroneous data, such as erroneous
lat/lon or time entries. Most of the time, these are isolated entries that can be easily
identified by using contextual information from the surrounding entries. A simple
way to overcome these erroneous entries is by either adopting one of the methods
mentioned in the last subsection or fixing the erroneous entry by extrapolating from
the surrounding entries.
2.4.3. Multiple Drivers. A taxi may be driven by more than one driver, but this information cannot be obtained from the raw GPS traces. As we will see next, we are often
interested in ranking the performance of drivers. In order to do so properly, we must
be able to detect which driver is currently active. This is a fairly complex problem, and
there are no existing solutions. A possible approach is to first detect the area where the
drivers switch by monitoring areas that are always visited at the same time. However,
since driver information is not available, we have no way of verifying the success rate
of such a method.
2.4.4. Occupied Flag Improperly Set/Detected. The state of the occupied flag may not be
properly set, either because the taxi driver does not set his indicator properly or
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:11
because there is a fault in the detection device. This may result in certain taxis being
continuously occupied or vacant. When attempting to extract information from occupied/vacant trips, this type of problem can negatively impact the obtained statistics.
One possibility for dealing with this problem is to compute the proportion of time the
taxi is occupied/vacant and exclude any taxis that have extreme values (e.g., being
occupied more than 95% of the time).
A related, but more difficult, problem is when a taxi picks up a new passenger
immediately after dropping off another. Because of the rate at which GPS entries are
received, we may not be able to determine when one trip ended and when the other
began. This can be observed in popular transport areas such as the airport: a taxi
dropping off a passenger at the airport may find a new passenger immediately.
2.4.5. “Sleeping” Taxis. Although having multiple drivers allows some taxis to operate
at all hours of the day, single-driver taxis must stop to sleep at certain points. It
is important to distinguish between a sleeping taxi and a taxi that is waiting for a
passenger. In a similar manner to what was proposed previously, one could begin by
detecting areas where the taxi is always parked at the same time.
3. SOCIAL DYNAMICS
Having finished surveying the work in preparing the taxi data, we can begin the survey
of existing data mining research. We begin with social dynamics, which usually refers
to the study of the collective behavior of a group of individuals. In the context of
this work, it refers to the study of the collective behavior of the taxi drivers and
passengers of a city, based on their observed movement in the city. This encompasses
problems such as identifying hotspots, mobility patterns, and connectivity between
regions. The problems discussed in this section are by no means an exhaustive list of
what social dynamics entails. Indeed, as we will discuss in the Conclusion and Further
Work section, one of the most important issues in this field is the proper definition of
new problems and potential solutions.
Taxis provide a unique window into the dynamics of human movement in a city. As
opposed to public transportation, which follows a predefined route, taxis take passengers between any two points. One can use the location and times where passengers
were picked up and dropped off, as well as the trajectories followed, to analyze the
activity levels throughout the city and the way people move around the city.
Before reviewing the work specific to taxi data, it is worth mentioning that other
type of digital footprints, such as mobile phone data, have been used to study the
human behaviors and social dynamics. This set of work uses a combination of various
signals such as location traces (e.g., GSM and GPS), proximity information (Bluetooth),
and communication logs (phone calls and SMS) to uncover human networks [Aharony
et al. 2011], predict daily routines [Altshuler et al. 2012], investigate the relationship
between human networks and personal decision making [Aharony et al. 2011], study
long-distance travel patterns [Bekhor et al. 2011], analyze crowd mobility during special events [Calabrese et al. 2010a], investigate the occurrence of anomalous events
[Candia et al. 2008], and investigate the predictability of human behaviour [González
et al. 2008; Song et al. 2010a, 2010b]. The Reality Mining project [Eagle and Pentland
2006] is one of the representative studies and has generated a dataset of 100 users’ locations and communication logs collected over a year, used by to discover daily routines
[Farrahi and Gatica-Perez 2008, 2011] and friendship networks [Eagle and Pentland
2009]. Other well-known studies are MIT SENSEable City Lab’s Mobile Landscapes
[Ratti et al. 2009] and Eigenplaces [Calabrese et al. 2010b], which are examples of
urban analysis for identifying hotspots and clustering similar areas in metropolitan
areas based on cellular data. The Mobile Millenium project (http://traffic.berkeley.edu)
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:12
P. S. Castro et al.
uses GPS data from cellular phones to provide real-time traffic information back to the
mobile phone users.
We begin by discussing methods to extract hotspots in a city and characterize them
based on the passenger pickup-dropoff activity.
3.1. Extracting Hotspots
The ability to detect the most frequented locations in a city can be useful for urban
planning, public transportation route design, tourism and agencies, security agencies,
amongst others. Taxis provide a precise indication of people’s desired destinations. In
this section, we discuss some methods for detecting a city’s hotspots.
There have been extensive studies on using GPS trajectories from personal devices
(e.g., cell phones) to detect significant places. Ashbrook and Starner [2003] define
significant places as areas with a faded GPS signal (e.g., as would occur when inside a
building) for a minimum amount of time. Alvares et al. [2007] define stops as polygons in
R2 where trajectories spend a minimum amount of time. The authors assume that a set
of candidate stops are given and present a simple algorithm for determining which are
true stops. Palma et al. [2008] extend this by clustering sections of trajectories based on
their speed, possibly suggesting more significant places than originally provided. Zheng
et al. [2009] define stay points as areas (bounded by a distance threshold) where a user
has stayed for a minimum amount of time. Aside from using mobile data, there is some
related work using traces from other geopositioned devices for extracting interesting
places. A representative study is the GeoLife project [Zheng et al. 2010], which collected
the raw GPS trajectories (over various transportation modes) of 167 users over 3 years.
This dataset was used by a number of projects for inferring transportation modes
[Zheng et al. 2008], and for location and travel route recommendations [Zheng et al.
2009]. Calabrese et al. [2010b] use eigendecomposition on the time series frequencies
of WiFi usage around MIT’s campus and cluster regions around the campus based
on their functionality. Girardin et al. [2008] discover the presence and movement of
tourists using georeferenced photos and mobile phone data. Another well-known work
is Dartmouth’s CenceMe project [Miluzzo et al. 2008], which aims to create intelligent
mobile sensor networks capable of sensing nearby friends and their current activity.
The works of greatest relevance to this article are those that use the traces from
GPS-equipped vehicles. Wang et al. [2009] use passenger pickup and dropoff points
to analyze the location and travel patterns to and from hotspots. Liu et al. [2010b]
use vehicular speed information to quantify the “crowdedness” of an area, and define
hotspots based on these crowdedness values. Yuan and Zheng [2010] define landmarks
as the road segments most frequently traversed by taxi drivers; the purpose of these
landmarks, however, is mainly to facilitate path planning. Chang et al. [2010] proceed
by first filtering trajectories using contextual information (weather, etc.), then clustering GPS points into areas and finally defining a hotness score for each area according
to the number of taxi requests divided by the size of the area. Li et al. [2011a] define
hotspots as areas where there is a high level of passenger pickups and propose a method
for predicting the amount of pickups at each hotspot by using a variant of the AutoRegressive Integrated Moving Average (ARIMA), a well-known prediction method in
time series analysis [Box et al. 2008].
With taxi datasets, we know with reasonable accuracy where passengers have been
dropped off. This information is very useful, as it can be used to directly detect places of
interest. By simply counting the number of dropoffs at different areas from , we can
directly compare the importance of different locations. We can add further contextual
information such as time of day, season of the year, and so forth, to uncover more
meaningful results. In Figure 5, we display the weekday hotspots using the digital
map decomposition. It is somewhat surprising that the airport (at the bottom right of
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:13
Fig. 5. Weekday vertex hotspots obtained by number of dropoffs in (or close to) each vertex. Circle sizes are
proportional with the number of dropoffs.
the map) does not have a higher rating. This is most likely because when passengers
are dropped off at the airport, taxi drivers immediately pick up a new one. Since there
may not be any unoccupied entries between these two passengers, it will appear that
the original passenger was never dropped off at the airport. This was discussed in
Section 2.4.
3.2. Urban Computing
Urban computing is concerned with understanding the mobility patterns of a city’s
population. This type of research can be done in many different ways, including characterizing locations based on their functions, measuring the frequency of different
trajectories between two points, and measuring the “linkage strength” between two
areas in a city. The results uncovered are useful for many purposes, including determining the effectiveness of the road network and public transportation systems, and
providing guidance for changes or additions that need to be made to the road network
and/or public transport system in order to accommodate the population’s movement
patterns.
In order to ensure that roads are capable of satisfying traffic demand, researchers
have sought to extract frequent trajectories used by drivers. Girardin et al. [2008]
uncover the movement and presence of tourists using georeferenced photos and mobile
phone data. Wang et al. [2009] use passenger pickup and dropoff points to analyze
the travel patterns to and from hotspots. Chen et al. [2010a] use taxi trajectories to
discover frequent trajectories between a given source and destination. Zhang et al.
[2012] find patterns in origin-destination flows in a large city in China; the authors
cluster together similar traces and find that traces in the same cluster have origins
and destinations with similar social functions. Liu et al. [2009b] use taxi traces and
smart cards from the bus and metro to spatially and temporally quantify, visualize,
and examine urban mobility patterns; for taxi data, they present visualizations of the
volume of taxi trips throughout the week, the strength of connections between different
areas of city, and the change in taxi usage between the weekend and weekday.
Zheng et al. [2011] discover inefficient connectivity between two regions by looking
at actual versus expected distance required to travel between these two regions, as
well as the expected speed and actual volume of traffic, to determine whether their
level of connectivity satisfies the demand of travel between them. They evaluate their
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:14
P. S. Castro et al.
results using a taxi dataset in Beijing and demonstrate that the flawed areas uncovered
by their algorithm agree with a new subway line added in the same area at a later
date. Veloso et al. [2011b] present a visualization of the spatiotemporal variation, main
pickup and dropoff areas, and busiest periods of taxi operation in Lisbon, Portugal; the
same group also argued that trip distance, duration, and income follow Gamma and
Exponential distributions [Veloso et al. 2011c]. Liu et al. [2012c] extract the temporal
variations in taxi pickup and dropoff patterns and demonstrate that they correlate well
with the “land use” in those areas (i.e., commercial, residential, recreational). Giannotti
et al. [2011] present a platform for mobility analysis of GPS traces (M-Atlas) that can
be used to query and plot the activity levels throughout the week, trip length and
travel time distributions, clustering of routes to/from a particular point, and linkage
between different areas. The authors also provide a method for detecting special events
(such as sport matches), along with the trips related to this event. As an extension to
using solely taxi GPS data, Hu et al. [2009] propose combining taxi traces and mobile
phone positioning systems to model population density and travel time, distance, and
frequency.
Taking this idea further, Yuan et al. [2012a] combine a hierarchical road network
decomposition, general human mobility (gathered from vehicle, mobile, or social network traces) and points of interest of each region (i.e., restaurants, shopping malls) to
uncover the functionality of different regions. Their approach uses a topic model-based
method to identify the different functions by treating a region as a document, a region’s
function as a topic, human mobility amongst regions as words, and a region’s points of
interests as metadata. The model used is a generative one, and they proceed to cluster
different areas according to their “topic” distribution and quantify the “intensity” of a
region’s function using Kernel Density Estimation. The authors evaluate their algorithm using a large dataset over 2 years in Beijing consisting of a points of interest
database and a set of trajectories generated by 12,000 taxis; in comparison with other
approaches, their proposed method produces very good results.
There has also been some work in characterizing the physical laws of human movement by means of taxi trajectories. This type of work has its roots in biology where the
movement of animals are studied. It has been observed that the movement of many
animals follow a Lévy flight model, which is a random walk that generalizes Brownian
motion. It can be detected by verifying whether the jump length follows a power law
behaviour. Chen et al. [2010b] studied the distribution of travel time and distance of
taxi trips and showed that they can be approximated by a power law distribution;
additionally, they also showed that most trips are short in both time and distance.
However, Jiang et al. [2009] had previously shown that using taxi data in order to provide evidence of human mobility as a Lévy flight is mainly due to the underlying street
network. Liu et al. [2012b] study this problem on a large 7-day database of taxi GPS
traces in Shanghai and argue that trip distances do follow the power law distribution,
but that the direction distribution is not uniform.
4. TRAFFIC DYNAMICS
The collective movement of vehicles throughout a city’s network results in congestion
levels that vary throughout different areas and time periods. These congestion levels
influence important factors for drivers such as the travel time between two points,
the expected speed, and potential adverse traffic events (e.g., accidents). Although we
discuss some of these measures in this section, they are by no means exhaustive but
are merely a reflection of the work that has been published so far. An understanding of
the dynamics of these congestion levels, or the traffic dynamics, can be useful for the
type of research discussed in the previous section, as well as for operational dynamics,
which we will discuss in Section 5. Additionally, they can be used to analyze certain side
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:15
effects of vehicle use, such as estimating pollution levels in a city [Gühnemann et al.
2004]. Given that taxi drivers are continuously driving around the city, the collected
GPS traces are a natural source for estimating the travel time between two points.
Zhang et al. [2007] demonstrated that using taxi GPS data for estimating travel time
and speed conditions is practical by performing an error analysis of taking simple
averages over historical data.
Monitoring and predicting traffic conditions can provide indications of the activity
levels in a city and can be useful for streamlining the flow of vehicles to reduce congestion levels. It has been observed that traffic generally follows a regular pattern
throughout the day, and many groups have studied these patterns to obtain a better understanding of these dynamics. Researchers have used a vast array of different
methods for estimating traffic conditions.
Wen et al. [2008] use GPS-equipped taxis to analyze traffic congestion changes
around the Olympic games in Beijing; note that this is an ex post facto analysis of
traffic conditions. Schäfer et al. [2002] used GPS-enabled vehicles to obtain real-time
traffic information. By considering congested roads as those where the velocity is below
10km/hr, the authors demonstrate that a visualization of traffic conditions around the
city can be used to detect congested and blocked road segments. Giannotti et al. [2011]
detect traffic jams by searching for groups of cars close together that are all moving
slowly. Gühnemann et al. [2004] use GPS data to construct travel time and speed estimates for each road segment, which are in turn used to estimate emission levels in
different parts of the city. Their estimates are obtained by simply averaging over the
most recent GPS entries. Šingliar and Hauskrecht [2007] studied two models for traffic
density estimation: conditional autoregressive models and mixture of Gaussian trees.
This work was designed to work with a set of traffic sensors placed around the city
and not with GPS-equipped vehicles. Peng et al. [2012] decompose a city’s traffic flow
into a linear combination of three types of trips: travel between residential and work
areas, travel between work areas, and leisure trips; they propose a method for finding
these three coefficients, thereby producing a rough estimate of traffic flow. Lippi et al.
[2010] use Markov logic networks to perform relational learning for traffic forecasting
on multiple simultaneous locations, and at different steps in the future. This work
is also designed for dealing with a set of traffic sensors around the city. Su and Yu
[2007] used a Genetic Algorithm to select the parameters of a SVM, trained to predict
short-term traffic conditions. Herring et al. [2010] use Coupled Hidden Markov Models
[Brand 1997] for estimating traffic conditions on arterial roads. They propose a sophisticated model based on traffic theory that yields good results. Yuan et al. [2011a] used
both historical patterns and real-time sensory information to predict traffic conditions.
However, the predictions they provide are between a set of “landmarks” that is smaller
than the size of the road network. Pang et al. [2011] propose using an adaptation of
likelihood ratio tests (a technique mainly used in epidemiological studies) to describe
traffic patterns and uncover unexpected events. Castro et al. [2012] propose a method
to construct a model of traffic density and automatically determine the capacity of
each road segment using a large database of taxi GPS traces; by pairing these two
pieces of information, one can obtain accurate predictions of future traffic conditions
and potential traffic jams.
Liu et al. [2011] aim to uncover causal relationships between regions considered
as traffic outliers. The authors first decompose a city into a set of linked regions,
where the links are obtained from taxi trajectories, and identify as outliers those links
that are most different from both their spatial neighbours (using standard Euclidean
distance) and their temporal neighbours (using Mahalanobis distance). Using this information, the authors present an algorithm for constructing outlier trees, thereby
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:16
P. S. Castro et al.
revealing causal relationships amongst outliers. They further propose another algorithm that generates the most frequent subtree from the outlier trees, which can
potentially reveal underlying problems in the existing road network. In their experiments, the authors’ proposed method was successfully able to detect two known events
that produced anomalous traffic patterns.
Accurate estimates of the travel time between two points in a city can be used for
many different purposes, such as fare estimation and path planning. Blandin et al.
[2009] use kernel methods [Scholkopf and Smola 2002] to obtain a nonlinear estimate
of travel times on “arterial” roads; the performance of this estimate is then improved
through kernel regression. Lou et al. [2009] use taxi trajectories to estimate turn
probabilities at different intersections and the average speed of road segments at
different times. Yuan and Zheng [2010] propose constructing a graph whose nodes are
landmarks. Landmarks are defined as road segments frequently traversed by taxis.
They propose a method to adaptively split a day into different time segments, based
on the variance and entropy of the travel times between landmarks. This results in an
estimate of the distributions of the travel times between landmarks. Balan et al. [2011]
use simple arithmetical means of historical trajectories to estimate the travel distance
between two points. They use both grid decomposition and a dynamic decomposition
method based on the k-nearest neighbour method to group together similar start/end
locations.
5. OPERATIONAL DYNAMICS
The information extracted from taxi GPS traces can also be used for the benefit of
drivers. In this section, we discuss some existing methods for providing useful information to drivers and suggest some future avenues of research.
5.1. Ranking Drivers
Given the large number of taxis available, it is inevitable that some will be “better”
than others. It follows that in order to extract useful information from these drivers,
we should first rank their performance. The ranking criteria used depends on the type
of behavior being analyzed. In Section 5.2, we discuss methods to uncover efficient
strategies for finding new passengers. There are a number of methods for ranking
drivers based on their efficiency of finding passengers, and we discuss a few next.
—One approach ranks drivers based on their average daily income, since one could
assume that drivers generating high levels of profit are efficient at finding highrevenue passengers quickly. This should, of course, be normalized with respect to
the operating time of each taxi, as many taxis are not operational 24 hours per day.
This approach was used in Liu et al. [2010a] and Yuan et al. [2011a, 2011b]. Liu
et al. [2009a] performed an analysis and comparison of top and ordinary drivers,
ranked according to daily income, with the hope of revealing the top drivers’ mobility
intelligence.
—Li et al. [2011b] rank drivers based on the occupied distance covered by each taxi, as
this is an indication that they know where to pick up “good” passengers that result
in longer trips.
—An alternate method ranks drivers based on the proportion of time they spent occupied; that is, if the taxi is occupied most of the time, this is a good indication that the
driver is effective at finding new passengers [Ge et al. 2010].
—One can also rank drivers based on the average time that it takes to find a new
passenger, as this is a direct indication of the driver’s passenger-finding efficiency.
In Section 5.3, we discuss methods to use the information in our dataset for path
planning between two points. As discovered by Liu et al. [2010a], top drivers usually
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:17
take their passengers along the fastest route to their destination in order to get a new
passenger quickly. Thus, ranking drivers based on their average daily income (as mentioned earlier) would be an appropriate ranking for this problem. An alternate method
of ranking would be to rank the drivers according to the average time taken between
one or more common source-destination pairs. This method has the disadvantage that
the reliability of the ranking depends on how frequently passengers are driven between
the chosen source-destination pairs.
5.2. Passenger/Taxi-Finding Strategies
Discovering good strategies for drivers to find new passengers or for passengers to find
a taxi is a problem that has previously been investigated by a number of groups. Most
papers have focused on finding demand hotspots to direct the navigation of unoccupied
drivers (or waiting passengers). Chang et al. [2008] find demand hotspots by extracting
the time and environmental contexts of a set of taxi requests, clustering these requests
using k-means and agglomerative hierarchical clustering, and finally, ranking these
clusters. Palma et al. [2008] use the speed of vehicles in a dataset of trajectories to
find “interesting places” by means of a density-based clustering algorithm. Yue et al.
[2009] use simple nearest-neighbor clustering to group taxi pickup and dropoff points
and discover attractive areas as well as the attractiveness amongst different areas.
Veloso et al. [2011a] explore the relationships between dropoff and pickup locations.
Phithakkitnukoon et al. [2010] use a grid decomposition and a naive Bayesian classifier
to predict vacant taxis in different areas. In addition to aiding passengers seeking taxis,
this can be used to help guide drivers to areas with high demand but few vacant taxis.
Ge et al. [2010] cluster the pickup points of the top drivers to use as recommended
pickup points for other drivers. Hu et al. [2012b] extend this idea by creating a pickup
tree with the pickup points with highest probability; the authors argue that this method
is more suitable for situations where there is a set of vacant taxis (as opposed to a single
one) in the same area. Zheng et al. [2012] model the probability of taxis leaving their
current road segment as a Nonhomogeneous Poisson Process [Ross 2006] and use this
model to estimate the waiting time for taxis at different locations and at different times;
these estimates are then used to provide a recommender system for people searching
for taxis.
Liu et al. [2010a] use k-means clustering to uncover the spatiotemporal preferences
of the top drivers in a city. Statistical analysis reveals that most top taxi drivers (ranked
according to profit) choose similar spatiotemporal areas. The authors discovered the
somewhat surprising facts that top drivers strived to drop off passengers as quickly as
possible in order to serve as many passengers as possible; additionally, they chose to
operate in areas other than the Central Business District.
Takayama et al. [2011] use source/destination data of occupied taxis over a 2-year period and a series of survey results from the drivers to propose promising “waiting/cruising” locations to taxi drivers. Their method is based on surveys given to drivers, which
is a very inefficient way of obtaining data, prone to human error, and difficult to continue indefinitely. Lee et al. [2008] use k-means clustering to split a road network into
different areas. They then perform a temporal analysis to create a time-dependent
pickup pattern within each area. Their analysis suggests that taxis should go to the
nearest area with demand to pick up new customers. The simple approach is able to
find clusters with highest demand. However, as Liu et al. [2010a] demonstrated, in order to maximize profit, a taxi driver may not necessarily want to base his choice solely
on demand. A balance between profit maximization and demand coverage is necessary.
Li et al. [2011b] categorize the observed passenger-finding strategies based on time,
location, whether they are “hunting” or “waiting”, and whether the driver remains in
a local area or travels a longer distance to find a new passenger. The authors then use
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:18
P. S. Castro et al.
a form of Support Vector Machine (SVM), L1-Norm SVM [Bi et al. 2003], to determine,
based on the current time and location, whether the driver should hunt, wait, stay
locally, or travel a longer distance.
Yuan et al. [2011a, 2011b] automatically extract “waiting areas” for taxis based on
the distance between consecutive GPS points. The authors then compute the probability of picking up a passenger based on the current time and the road segment or
waiting area. This information is used to provide a recommendation system for drivers
and passengers. Similarly, Li et al. [2011a] use extracted hotspots and distance to
hotspots to help drivers find new passengers. Qian et al. [2012] extract pickup points
and formulate the taxi-routing problem as a Markov Decision Process [Puterman 1994]
between pickup points; however, this approach ignores the destination desired by each
new potential customer and is evaluated only on success probabilities extracted from
prior data.
As mentioned previously, most of these works focus on finding areas where taxis
can wait for new passengers. However, as shown in Takayama et al. [2011] and Li
et al. [2011b], a driver may often want to hunt or cruise. Powell et al. [2011] construct
a spatiotemporal profitability map based on historical data to guide taxis on a local
basis. Yamamoto et al. [2010] provide routing strategies for multiple taxis using fuzzy
clustering mechanisms. Hu et al. [2012a] formulate the taxi driver’s task of hunting
for new passengers as a decision problem at each intersection and propose solving it
using probabilistic dynamic programming. Nevertheless, it is not clear whether a clear
“microstrategy” for finding passengers can be extracted by mining past taxi trajectories:
the strategies employed by top drivers would have to be fairly consistent or predictable.
Veloso et al. [2011b] perform a predictability analysis of the next pickup area given
dropoff features. Their results show there is only a 54% predictability rate, suggesting
that hunting/cruising trips are largely random.
A new research direction is taxi ridesharing, which has been explored by a number
of groups. Tao [2007] give an overview of one such service, algorithms for rideshare
matching, and an empirical evaluation of a field trial in Taipei. Chen [2010] propose
a dynamic ride-sharing system for Vehicular Ad-hoc Networks (VANETs) and conduct
a simulation to estimate the fuel savings resulting from such a system. d’Orey [2012]
also perform an empirical evaluation of ridesharing using simulations. Lin et al. [2012]
propose an algorithm for optimizing routing of a rideshare service that aims to minimize operating costs while maximizing customer satisfaction. Finally, Ma et al. [2013]
analyze the potential passenger coverage increase and travel mileage decrease that
could result from offering a taxi-ridesharing service. They further propose a dynamic
ridesharing service that achieves its goal by fast taxi searching and lazy shortest path
algorithms.
5.3. Route Planning
The travel time and traffic estimates discussed in Section 4 can be used directly to
perform path planning from a source to a destination location. The generalized routing
problem has been studied extensively for (at least) four decades. Popular techniques
that have often been used are dynamic programming [Cooke and Halsey 1966], variations of Djikstra’s algorithm [Ding et al. 2008], and variations of the A∗ algorithm
[Kanoulas et al. 2006].
Edelkamp and Schrödl [2003] construct a map and use existing methods (such as
Djikstra’s algorithm) to perform planning. Šingliar and Hauskrecht [2008] propose
several approximation strategies based on Monte Carlo sampling that can be used
to solve route-planning problems in stochastic transportation networks, formalized as
semi-Markov Decision Processes [Puterman 1994]. Li et al. [2009a, 2011c] construct a
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:19
hierarchy of roads based on frequency of use and perform planning from a source to
a target by trying to travel through the highest hierarchy roads. Yuan et al. [2011a]
combine historical traffic patterns, real-time traffic information, and driver behaviours
to compute shortest-time driving routes.
There has been some work on finding routes within a prespecified time frame.
Kanoulas et al. [2006] extends A∗ for finding shortest paths within a specified interval (rather than a fixed time). Using stochastic modeling of travel delay on road
networks, Lim et al. 2010 present a stochastic motion-planning algorithm that can be
used to find paths that maximize the probability of reaching a destination within a
particular travel deadline.
Since top drivers usually know the city’s road network very well, by observing their
behavior we may be able to uncover good path-planning strategies that go beyond the
techniques mentioned previously. Gonzalez et al. [2007] find the fastest path between
two points by partitioning the map into a road hierarchy, extracting frequently traveled
road segments to use as “hints” and precomputing high-benefit paths for each area.
The data is mined from road speed sensors. Yuan and Zheng [2010] and Yuan et al.
[2013] construct a time-dependent landmark graph based on a large dataset of taxi
trajectories. The routing algorithm first finds a rough route on the landmark graph,
then this is refined to a route on the underlying road network. By estimating travel
time distributions, the authors allow travel times to behave stochastically, which may
yield more accurate representations.
Bastani et al. [2011] propose defining new transportation routes by mining through
and combining multiple taxi trajectories. The authors suggest that these new routes
could be used by a mini-shuttle transportation system that lies somewhere between
taxis and buses.
Chen et al. [2013] leverage taxi GPS traces to suggest nightly bus routes. Their
approach works by first clustering dense passenger pickup/dropoff areas into hotspots
and then using these hotspots as candidate bus stops. The authors propose two heuristic
algorithms to generate candidate bus routes and evaluate their effectiveness on a large
taxi GPS dataset.
5.4. Anomaly Detection
Given a fixed source and destination, there are undoubtedly certain trajectories
favoured by most taxi drivers. By collecting the trajectories from many taxis over
an extended period of time, we may be able to automatically recognize not only these
“normal” trajectories but also “anomalous” trajectories. An anomalous trajectory can
be caused by external factors such as accidents or the closure of a main road, but it
may also be caused by fraudulent drivers trying to charge more money from passengers. The ability to automatically detect anomalous trajectories can thus enable one
to automatically detect adverse traffic events, as well as preventing drivers to take
advantage of passengers unfamiliar with the city.
Li et al. [2009b] identify outlier road segments by detecting drastic changes between
current data and historical trends. Liao et al. [2010] use conditional random fields to
label anomalous taxis, coupled with an active learning scenario, where human interaction can help guide the learning. Balan et al. [2011] label trajectories as anomalous by
a simple mechanism: any trajectory with a distance twice as long as the straight line
distance between the start and end positions, or any trajectory with an average speed
lower than 20 km/hr or higher than 100 km/hr.
Chawla et al. [2012] propose a two-step method for inferring the root cause of traffic
anomalies. First, the city is divided into disjoint regions, and anomalous links between
different regions are identified by detecting deviations from historical norms. In the
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:20
P. S. Castro et al.
second step, the authors combine a generative model of the traffic flow in the city and
link adjacency matrices to uncover the probable sources of the observed anomalies.
Zhang et al. [2011b] propose iBAT, a method based on isolation trees and a grid
decomposition, to solve this problem. The authors maintain a set of historical trajectories and determine whether new trajectories are isolated from this set by randomly
selecting grid cells from the new trajectory and determining how many of the historical trajectories also contain this grid cell. Since the method is based on sampling,
the process must be repeated a number of times for each trajectory in order to obtain
an anomaly score that indicates the degree of anomalousness of the new trajectory.
Through the use of a testing set of manually labelled trajectories, the authors verified
the accuracy of their proposed method.
Despite its accuracy, iBAT has a number of shortcomings. The most important one
is that it only works with completed trajectories, disqualifying it from being used for
real-time fraud detection. Additionally, because it is selecting grid cells independently,
it does not maintain contextual information and is thereby unable to detect looping
trajectories—that is, trajectories that go back on its steps and return along the same
route. To address these issues, Chen et al. [2011] propose iBOAT, an algorithm that
can detect anomalous trajectories in real time and avoid the shortcomings of iBAT.
The main idea behind iBOAT is to compare subsequences of the new trajectory against
subsequences of the historical trajectories. If there is enough support, they increase
the size of the subtrajectory from the trajectory we are testing; otherwise, the point is
labelled as anomalous and the process is repeated from the next point. By doing so,
contextual information is preserved without sacrificing accuracy, as confirmed by the
empirical evaluation. Sun et al. [2012] extend this work by proposing the use of an
inverted indexing mechanism to guarantee real-time monitoring of anomalous trajectories; the authors further perform an analysis of the types of anomalous trajectories
observed in the dataset.
More recently, Ge et al. [2011] proposed a similar method for detecting taxi fraud.
Their method uses a grid decomposition and complete trajectories in a similar way
as done in Zhang et al. [2011b] and Chen et al. [2011]. They compute two pieces of
evidence for detecting anomalous trajectories. The first involves computing the independent components (using Independent Component Analysis) of a set of trajectories
and then computing the coding cost (which is essentially the entropy) of a trajectory’s
independent components. The second is a method for determining the expected distance
for the most common routes and computing how much a trajectory’s distance differs
from the norm. These two pieces of evidence are combined using Dempster-Schafer theory. Although their experimental results fail to convince the reader that their method
provides an advantage over standard density-based methods, they provide mechanisms
for differentiating between malicious detours and detours due to traffic interruptions
or poor knowledge of the area.
5.5. Route Prediction
GPS devices are mainly used for providing driving directions after specifying a desired
destination. There has been some recent work on using GPS trajectories to predict
a user’s route and/or destination based on historical information. The ability to predict routes can be useful for a number of problems, including automatically providing
driving directions without user input and providing warnings about possible hazards
along the predicted path. Lou et al. [2009] compute a simple distribution over possible
destinations given the taxi’s source. Patterson et al. [2003] learn a Bayesian model of a
traveller’s current mode of transportation and most likely route. Krumm and Horvitz
[2006] also use Bayesian inference to predict the drivers destination as the trip progresses based on both the drivers history of destinations and the trips for a group
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:21
of drivers. Liao et al. [2007] use a hierarchical Markov model to predict a user’s daily
movements, as well as automatically detect significant locations. Froehlich and Krumm
[2008] exploit the regularity of regular drivers to predict a driver’s end-to-end route,
using the driver’s history of trajectories. Ziebart et al. [2008] use inverse Reinforcement Learning [Ng and Russell 2000] to predict turns, routes, and the destinations
of a driver. Monreale et al. [2009] construct a decision tree from trajectory patterns
[Giannotti et al. 2007] to predict the next location of moving objects; however, they use
the information from all previous trajectories through a particular area to form their
predictions.
More recently, Xue et al. [2013] proposed an approach for overcoming data sparsity
issues when predicting destinations. Their approach decomposes historical trajectories
into subtrajectories and then connects these to produce “synthesized” trajectories, allowing them to give predictions for an exponentially larger number of trajectories than
possible when using only complete historical trajectories.
6. DEPLOYED SYSTEMS
Many of the ideas discussed so far can be exploited in the development of a commercial
product. However, this has not been greatly pursued. In this section, we wish to list a
few of the existing deployed systems that are used to manage GPS-enabled vehicles.
Schäfer et al. [2002] use the real-time traffic information obtained from GPS-enabled
vehicles to provide a realistic travel time and optimal route suggestion for individual
and commercial users in Berlin, Vienna, and Nuremberg. Liao [2003] describes a
taxi-dispatching system used in Singapore that uses the GPS location of taxis and the
locations of nearby passengers. Balan et al. [2011] deployed their system to predict
fares and trip durations in Singapore. CabSense (http://www.cabsense.com) is a smartphone app that uses the real-time information from GPS-equipped taxis to aid the
passenger in finding a taxi. Cabspotting (http://stamen.com/clients/cabspotting) traces
taxis in the San Francisco Bay area, which has been used by artists and researchers
to visualize the underlying social dynamics of the city. TaxiTrackTM is a real-time taxi
tracking software that can facilitate taxi dispatching. Waze (http://www.waze.com) is
a community-based GPS traffic and navigation app.
7. HISTORICAL PERSPECTIVE
We grouped the existing work into three main categories, and we would like to complement this with a historical overview of these works. Although GPS devices have
been around since around the 1970s, for the first 30 years or so they were used almost
exclusively for military purposes, since the signal for civilian use was purposefully
degraded (known as Selective Availability). This degradation was removed in 2000,
enabling their widespread use in civil sectors. It should be noted that the limitations
imposed by Selective Availability had been overcome a few years earlier by the use of
Differential GPS; however, their widespread commercial success followed the removal
of Selective Availability. Since then, the use of GPS devices has risen dramatically, and
they are present in a significant number of vehicles and mobile phones. In this section,
we will only discuss works related to GPS-equipped taxis, as this is the focus of this
article.
Although research papers using GPS-equipped vehicles began appearing around
2000, papers using GPS-equipped taxis only began appearing in 2003–2004, where
they were used for a real-time dispatching system in Singapore [Liao 2003] and for
real-time monitoring of traffic emissions [Gühnemann et al. 2004]. In 2007, Zhang
et al. demonstrated that GPS-equipped taxis can be effectively used for estimating
speed and travel time. In 2008, there were a couple of papers on uncovering hotspots for
finding passengers [Chang et al. 2008; Lee et al. 2008], predicting the turns, routes, and
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:22
P. S. Castro et al.
destinations of a particular user [Ziebart et al. 2008], and analyzing traffic conditions
during the Olympic games in Beijing [Wen et al. 2008].
In 2009, there was a big surge in the number of papers using GPS-equipped taxis
for a number of purposes, in particular for urban computing [Hu et al. 2009; Jiang
et al. 2009; Liu et al. 2009b; Wang et al. 2009]. In addition, there was some work on
detecting outlier road segments [Li et al. 2009] and automatic map construction [Lou
et al. 2009].
In 2010, Microsoft Research Asia addressed the problems of finding hotspots
and estimating travel time in order to provide driving directions [Yuan and Zheng
2010]; Chang et al. [2010] and Liu et al. [2010b] also proposed mechanisms for
finding hotspots, whereas Herring et al. [2010] came out with a method for modeling
traffic conditions on arterial roads. Researchers from MIT performed an analysis of
passenger-finding strategies [Liu et al. 2010a] and methods for predicting vacant taxis
in a city [Phithakkitnukoon et al. 2010]; Yamamoto et al. [2010] also provided routing
strategies for taxis searching for new passengers. Finally, researchers from Zhejiang
University produced a couple of papers on urban computing [Chen et al. 2010a, 2010b].
In 2011, there was an increased number of papers focusing on this area of research,
especially in regards to passenger-finding strategies [Li et al. 2011a, 2011b; Powell et al.
2011; Takayama et al. 2011; Veloso et al. 2011a, 2011b; Yuan et al. 2011b; Yuan
et al. 2011c] and urban computing [Qi et al. 2011; Veloso et al. 2011b, 2011c; Zheng
et al. 2011b]. Additionally, there was some work on route planning [Yuan et al. 2011;
Bastani et al. 2011], traffic monitoring [Liu et al. 2011; Pang et al. 2011], anomaly
detection [Zhang et al. 2011; Ge et al. 2011; Chen et al. 2011], and traffic dynamics
[Balan et al. 2011].
We have seen a continuation of a large body of work throughout 2012 and 2013, with
papers related to map construction [Yuan et al. 2012b; Biagioni and Eriksson 2012; Liu
et al. 2012a], urban computing [Yuan et al. 2012a; Zhang et al. 2012; Liu et al. 2012; Liu
et al. 2012], traffic dynamics [Peng et al. 2012; Castro et al. 2012], passenger-finding
strategies [Hu et al. 2012a, 2012b; Qian et al. 2012; Zheng et al. 2012; Ma et al. 2013],
route planning/prediction [Xue et al. 2013; Chen et al. 2013; Yuan et al. 2013], and
anomaly detection [Chawla et al. 2012].
For ease of reference, we summarize the previous research work according to the
time line in Tables II and III. In addition, we group them by categories in Tables IV,
V, VI, VII, and VIII.
8. CONCLUSION AND FUTURE WORK
GPS-equipped taxis can be viewed as pervasive sensors, and the large-scale GPS traces
produced allow us to reveal many valuable facts about the social and community dynamics. In this article, we performed an exhaustive survey of existing work and grouped
them into three categories: social dynamics, traffic dynamics, and operational dynamics. We first provided a formal specification of the datasets and presented an overview
of different mechanisms for preprocessing the data to decompose the city.
Grid decompositions are a popular decomposition used by many research groups, but
there can be many improvements on this simple decomposition that can better adapt to
the road network structure and dynamics. As a simple example, a more adaptive grid
decomposition can be achieved if one uses the “shape” of the GPS entries to guide the
size and placement of the various grids, providing more accuracy to areas with higher
GPS entry frequency. This effect can be achieved by using data structures such as Quad
Trees [Finkel and Bentley 1974], Binary Space Partitioning (BSP) Trees [Fuchs et al.
1980], or R-Trees [Guttman 1984] (which allows overlapping regions).
Social dynamics refers to the study of the collective behaviour of a city’s population,
based on their observed movement in the city. The research along this line includes
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
Table II. Timeline of Papers Discussed
Year
2003
2004
2007
2008
2009
2010
Category
Deployed systems
Traffic dynamics
Traffic dynamics
Map construction
Passenger/taxi-finding
strategies
Passenger/taxi-finding
strategies
Traffic dynamics
Route prediction
Map construction
Urban computing
Traffic dynamics
Passenger/taxi-finding
strategies
Anomaly detection
Route planning
Route prediction
Map construction
Urban computing
Finding hotspots
Traffic dynamics
Passenger/taxi-finding
strategies
2011
Route planning
Anomaly detection
Urban computing
Traffic dynamics
Passenger/taxi-finding
strategies
Route planning
Anomaly detection
Reference
[Liao 2003]
[Gühnemann et al. 2004]
[Zhang et al. 2007]
[Li et al. 2007]
[Tao 2007]
[Chang et al. 2008]
[Lee et al. 2008]
[Wen et al. 2008]
[Ziebart et al. 2008]
[Lou et al. 2009]
[Hu et al. 2009]
[Jiang et al. 2009]
[Liu et al. 2009b]
[Wang et al. 2009]
[Lou et al. 2009]
[Yue et al. 2009]
[Li et al. 2009b]
[Li et al. 2009a]
[Lou et al. 2009]
[Yuan et al. 2010]
[Chen et al. 2010a]
[Chen et al. 2010b]
[Liu et al. 2010b]
[Herring et al. 2010]
[Liu et al. 2010a]
[Phithakkitnukoon et al. 2010]
[Yamamoto et al. 2010]
[Ge et al. 2010]
[Chen 2010]
[Yuan and Zheng 2010]
[Liao et al. 2010]
[Qi et al. 2011]
[Veloso et al. 2011c]
[Veloso et al. 2011b]
[Zheng et al. 2011]
[Balan et al. 2011]
[Liu et al. 2011]
[Pang et al. 2011]
[Li et al. 2011b]
[Li et al. 2011a]
[Powell et al. 2011]
[Takayama et al. 2011]
[Veloso et al. 2011a]
[Yuan et al. 2011a]
[Li et al. 2011c]
[Bastani et al. 2011]
[Zhang et al. 2011b]
[Ge et al. 2011]
[Chen et al. 2011]
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:23
17:24
P. S. Castro et al.
Table III. Timeline of Papers Discussed
Year
2012
Category
Map construction
Urban computing
Traffic dynamics
Passenger/taxi-finding
strategies
Anomaly detection
2013
Passenger/taxi-finding
strategies
Route prediction
Route planning
Reference
[Yuan et al. 2012b]
[Biagioni and Eriksson 2012]
[Liu et al. 2012a]
[Yuan et al. 2012a]
[Zhang et al. 2012]
[Liu et al. 2012c]
[Liu et al. 2012b]
[Peng et al. 2012]
[Castro et al. 2012]
[Hu et al. 2012a]
[Hu et al. 2012b]
[Zheng et al. 2012]
[d’Orey 2012]
[Lin et al. 2012]
[Chawla et al. 2012]
[Sun et al. 2012]
[Ma et al. 2013]
[Xue et al. 2013]
[Yuan et al. 2013]
[Chen et al. 2013]
Table IV. Data Preparation Papers
Data preparation
[Alvares et al. 2007]
[Wang et al. 2011]
Map construction
[Rogers et al. 1999]
[Greenfeld 2002]
[Edelkamp and Schrödl 2003]
[Yin and Wolfson 2004]
[Schroedl et al. 2004]
[Brakatsoulas et al. 2005]
[Worrall and Nebot 2007]
[Chawathe 2007]
[Li et al. 2007]
[Cao and Krumm 2009]
[Lou et al. 2009]
[Fathi and Krumm 2010]
[Chen and Krumm 2010]
[Yuan et al. 2010]
[Wang et al. 2011]
[Yuan et al. 2012b]
[Biagioni and Eriksson 2012]
[Liu et al. 2012a]
Map decomposition [Gonzalez et al. 2007]
[Li et al. 2009a]
[Yuan and Zheng 2010]
[Li et al. 2011c]
[Zheng et al. 2011]
Data processing
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
Table V. Traffic Dynamics Papers
Traffic dynamics
[Schäfer et al. 2002]
[Gühnemann et al. 2004]
[Šingliar and Hauskrecht 2007]
[Su and Yu 2007]
[Zhang et al. 2007]
[Wen et al. 2008]
[Blandin et al. 2009]
[Lou et al. 2009]
[Herring et al. 2010]
[Lippi et al. 2010]
[Yuan and Zheng 2010]
[Yuan et al. 2011a]
[Balan et al. 2011]
[Giannotti et al. 2011]
[Liu et al. 2011]
[Pang et al. 2011]
[Peng et al. 2012]
[Castro et al. 2012]
Table VI. Social Dynamics Papers
Social dynamics
Extracting hotspots [Ashbrook and Starner 2003]
[Alvares et al. 2007]
[Girardin et al. 2008]
[Palma et al. 2008]
[Calabrese et al. 2010b]
[Zheng et al. 2009]
[Wang et al. 2009]
[Liu et al. 2010b]
[Yuan and Zheng 2010]
[Chang et al. 2010]
[Li et al. 2011a]
Urban computing
[Girardin et al. 2008]
[Jiang et al. 2009]
[Froehlich et al. 2009]
[Liu et al. 2009b]
[Hu et al. 2009]
[Wang et al. 2009]
[Chen et al. 2010b]
[Chen et al. 2010a]
[Zheng et al. 2011]
[Qi et al. 2011]
[Veloso et al. 2011b]
[Veloso et al. 2011c]
[Giannotti et al. 2011]
[Yuan et al. 2012a]
[Zhang et al. 2012]
[Liu et al. 2012c]
[Liu et al. 2012b]
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:25
17:26
P. S. Castro et al.
Table VII. Operational Dynamics Papers
Operational dynamics
Passenger/taxi-finding [Tao 2007]
strategies
[Chang et al. 2008]
[Lee et al. 2008]
[Yue et al. 2009]
[Liu et al. 2010a]
[Yamamoto et al. 2010]
[Phithakkitnukoon et al. 2010]
[Ge et al. 2010]
[Chen 2010]
[Li et al. 2011b]
[Takayama et al. 2011]
[Yuan et al. 2011b]
[Yuan et al. 2011c]
[Powell et al. 2011]
[Li et al. 2011a]
[Veloso et al. 2011b]
[Veloso et al. 2011a]
[Hu et al. 2012a]
[Hu et al. 2012b]
[Zheng et al. 2012]
[d’Orey 2012]
[Lin et al. 2012]
[Ma et al. 2013]
Route planning
[Edelkamp and Schrödl 2003]
[Kanoulas et al. 2006]
[Gonzalez et al. 2007]
[Ziebart et al. 2008]
[Li et al. 2009a]
[Lim et al. 2010]
[Yuan and Zheng 2010]
[Li et al. 2011c]
[Yuan et al. 2011a]
[Bastani et al. 2011]
[Yuan et al. 2013]
[Chen et al. 2013]
Anomaly detection
[Li et al. 2009b]
[Liao et al. 2010]
[Zhang et al. 2011b]
[Ge et al. 2011]
[Chen et al. 2011]
[Chawla et al. 2012]
[Sun et al. 2012]
Route prediction
[Patterson et al. 2003]
[Krumm and Horvitz 2006]
[Liao et al. 2007]
[Giannotti et al. 2007]
[Froehlich and Krumm 2008]
[Ziebart et al. 2008]
[Monreale et al. 2009]
[Lou et al. 2009]
[Xue et al. 2013]
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:27
Table VIII. Deployed Systems Papers and Web Sites
Deployed systems
[Schäfer et al. 2002]
[Liao 2003]
[Balan et al. 2011]
http://www.cabsense.com
http://stamen.com/clients/cabspotting
http://www.waze.com
hotspot extraction and urban computing. The approach chosen for extracting hotspots
depends on their ultimate use (i.e., urban computing, route planning) and can be obtained by simply counting the pickup and dropoff events in an area or using more
sophisticated clustering techniques. There are few papers using taxi-GPS data for urban computing. This is a very promising area of research that has many open avenues
for exploration. As opposed to other problems where sophisticated algorithms are usually necessary, fairly simple algorithms are usually sufficient for urban computing. The
challenge comes in properly defining questions and problems that one can answer by
mining these trajectories. The methodologies used to answer these questions might be
specific to this type of data, but the questions raised can be valuable on their own and
may motivate research using other types of data.
Traffic dynamics studies the resulting flow of the vehicle movement through the
road network. Monitoring and predicting traffic conditions can provide indications of
the activity levels in a city. Accurate estimates of the congestion level in each road
segment and the travel time between two points in a city are still open but valuable
research topics.
Operational dynamics refers to the general study and analysis of taxi drivers’ modus
operandi. This new line of research focuses mainly on providing optimal route planning
and passenger-finding strategies to taxi drivers by analyzing the tactics of good drivers.
The performance of many of the algorithms presented in Section 5 directly depends
on the quality of the ranking function. Of the rankings presented, the advantages and
disadvantages of each are not clear. The algorithms we presented for anomaly detection
are fairly simple, yet they yield fairly good results. We believe more sophisticated
algorithms that take more factors into consideration (e.g., speed, weather conditions)
can yield even superior performance.
Additionally, as mentioned previously, the ability to detect multiple drivers per taxi
is an open problem that has not yet been addressed. We presented a simple approach
for this problem in Section 2.4.3, but there are certainly more sophisticated methods
to be discovered. The major difficulty with this problem is determining how to verify
the accuracy of any algorithm proposed.
Taxi ridesharing [Tao 2007; Chen 2010; d’Orey 2012; Lin et al. 2012; Ma et al. 2013] is
a promising new research direction that can have significant social and environmental
impact. Given the rapid growth of cities and use of motorized vehicles, taxi ridesharing
provides a promising mechanism to mitigate increasing road congestion.
Given that smartphones have become an essential element in the lives of most people
in developed countries, it is clear that the usefulness of the research surveyed here will
need to be integrated with the research in mobile data. Social networking services such
as Facebook and Twitter have also become an essential part of people’s lives that both
reveals and influences people’s behaviour. As such, an important avenue for future
work will be integrating the research surveyed here with the research in social media.
In particular, the work in social dynamics presented here has a strong relationship to
some research on social networks such as Twitter [Demirbas et al. 2010; Culotta 2010;
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:28
P. S. Castro et al.
Bollen et al. 2011], Facebook [Backstrom et al. 2010], Flickr [Crandall et al. 2010], and
Brightkites [Li and Chen 2009].
The common underlying goal shared amongst the different works discussed is to
obtain deep insights and useful applications to meet the real-world needs of end
users, such as urban planners, taxi operators, taxi drivers, passengers, and regular
city dwellers. Taxi GPS traces, despite being a very specialized type of digital trace,
have already provided us with a rich dataset to uncover many hidden facts about the
city, including social dynamics, traffic dynamics, and operational dynamics. Coupling
this data source with other complementary data sources such as mobile phone logs,
private and public transportation usage records, road network sensors, social media,
and Internet applications will allow us to uncover many more interesting facets about
the the city and its inhabitants. The ability to obtain these results in real time along
with appropriate visualization interfaces can offer city planners and dwellers many
insights and services that bring them closer to the vision of a smart city.
REFERENCES
AHARONY, N., PAN, W., IP, C., KHAYAL, I., AND PENTLAND, A. 2011. Social fMRI: Investigating and shaping social
mechanisms in the real world. Perv. Mobile Comput. 7, 6, 643–659.
ALT, H., EFRAT, A., ROTE, G., AND WENK, C. 2003. Matching planar maps. J. Algorithms 49, 262–283.
ALTSHULER, Y., AHARONY, N., PENTLAND (“SANDY”), A., FIRE, M., AND ELOVICI, Y. 2012. Incremental learning with
accuracy predictions of social and individual properties from mobile-phone data. In Proceedings of the
1st International Workshop on Wide Spectrum Social Signal Processing.
ALVARES, L. O., BOGORNY, V., KUIJPERS, B., DE MACEDO, J. A. F., MOELANS, B., AND VAISMAN, A. 2007. A model for
enriching trajectories with semantic geographical information. In Proceedings of the 15th Annual ACM
International Symposium on Advances in Geographical Information Systems (GIS’07). 22:1–22:8.
ASHBROOK, D. AND STARNER, T. 2003. Using GPS to learn significant locations and predict movement across
multiple users. Pers. Ubiq. Comput. 7, 5, 275–286.
BACKSTROM, L., SUN, E., AND MARLOW, C. 2010. Find me if you can: Improving geographical prediction with
social and spatial proximity. In Proceedings of the 19th International Conference on World Wide Web
(WWW’10). 61–70.
BALAN, R. K., KHOA, N. X., AND JIANG, L. 2011. Real-time trip information service for a large taxi fleet. In Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services (MobiSys’11).
99–112.
BASTANI, F., HUAN, Y., XIE, X., AND POWELL, J. 2011. A greener transportation mode: Flexible route discovery
from GPS trajectory data. In Proceedings of the 19th ACM SIGSPATIAL International Conference on
Advances in Geographic Information Systems (GIS’11). 405–408.
BEKHOR, S., COHEN, Y., AND SOLOMON, C. 2011. Evaluating long-distance travel patterns in Israel by tracking
cellular phone positions. J. Adv. Transport. 47, 4, 435–446.
BI, J., BENNETT, K., EMBRECHTS, M., BRENEMAN, C., AND SONG, M. 2003. Dimensionality reduction via sparse
support vector machines. J. Mach. Learn. Res. 3, 1229–1243.
BIAGIONI, J. AND ERIKSSON, J. 2012. Inferring road maps from GPS traces: Survey and comparative evaluation.
Transport. Res. Rec. 2291, 61–71.
BLANDIN, S., GHAOUI, L. E., AND BAYEN, A. 2009. Kernel regression for travel time estimation via convex
optimization. In Proceedings of the 48th IEEE Conference on Decision and Control. 4360–4365.
BOLLEN, J., MAO, H., AND PEPE, A. 2011. Modeling public mood and emotion: Twitter sentiment and socioeconomic phenomena. In Proceedings of the 5th International AAAI Conference on Weblogs and Social
Media. 450–453.
BOX, G. E. P., JENKINS, G. M., AND REINSEL, G. C. 2008. Time Series Analysis. Wiley.
BRAKATSOULAS, S., PFOSER, D., SALAS, R., AND WENK, C. 2005. On map-matching vehicle tracking data. In
Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05). 853–864.
BRAND, M. 1997. Coupled hidden Markov models for modeling interacting processes. Techrep. The Media
Lab, Massachusetts Institute of Technology.
CALABRESE, F., PEREIRA, F. C., LORENZO, G. D., LIU, L., AND RATTI, C. 2010a. The geography of taste: Analyzing
cell-phone mobility and social events. In Proceedings of the 8th International Conference on Pervasive
Computing (Pervasive’10). 22–37.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:29
CALABRESE, F., READES, J., AND RATTI, C. 2010b. Eigenplaces: Segmenting space through digital signatures.
Pervasive Comput. 9, 1, 78–84.
CANDIA, J., GONZÁLEZ, M. C., WANG, P., SCHOENHARL, T., MADEY, G., AND BARABÁSI, A.-L. 2008. Uncovering
individual and collective human dynamics from mobile phone records. J. Phys. A-Math. Theor. 41, 22.
CAO, L. AND KRUMM, J. 2009. From GPS traces to a routable road map. In Proceedings of the 17th ACM
SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). 3–12.
CASTRO, P. S., ZHANG, D., AND LI, S. 2012. Urban traffic modelling and prediction using large scale taxi GPS
traces. In Proceedings of the 10th International Conference on Pervasive Computing (Pervasive’12). 57–72.
CHANG, H.-W., TAI Y.-C., AND HSU, J. Y. J. 2010. Context-aware taxi demand hotspots prediction. Int. J. Bus.
Intell. Data Min. 5, 1, 3–18.
CHANG, H.-W., TAI Y.-C., CHEN, H. W., AND HSU, J. Y.-J. 2008. iTaxi: Context-aware taxi demand hotspots
prediction using ontology and data mining approaches. In Proceedings of the 13th Conference on Artificial
Intelligence and Applications (TAAI’08).
CHAWATHE, S. S. 2007. Segment-based Map Matching. In Proceedings of the IEEE Intelligent Vehicles Symposium. 1190–1197.
CHAWLA, S., ZHENG, Y., AND HU, J. 2012. Inferring the root cause in road traffic anomalies. In Proceedings of
the IEEE 12th International Conference on Data Mining (ICDM’12). 141–150.
CHEN, C., ZHANG, D., CASTRO, P. S., LI, N., SUN, L., AND LI, S. 2011. Real-time detection of anomalous taxi trajectories from GPS traces. In Proceedings of the International ICST Conference on Mobile and Ubiquitous
Systems. 63–74.
CHEN, C., ZHANG, D., ZHOU, Z.-H., LI, N., ATMACA, T., AND LI, S. 2013. B-Planner: Night bus route planning
using large-scale taxi GPS traces. In Proceedings of the IEEE International Conference on Pervasive
Computing and Communications (PerCom’13).
CHEN, G., CHEN, B., AND YU, Y. 2010a. Mining frequent trajectory patterns from GPS tracks. In Proceedings of
the International Conference on Computational Intelligence and Software Engineering (CiSE’10). 1–6.
CHEN, G., JIN, X., AND YANG, J. 2010b. Study on spatial and temporal mobility pattern of urban taxi services. In Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering
(ISKE’10). 422–425.
CHEN, P.-Y. 2010. A fuel-saving and pollution-reducing dynamic taxi-sharing protocol in VANETs. In Proceedings of the IEEE 72nd Vehicular Technology Conference Fall (VTC 2010-Fall). 1–5.
CHEN, Y. AND KRUMM, J. 2010. Probabilistic modeling of traffic lanes from GPS traces. In Proceedings of
the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.
81–88.
COOKE, K. L. AND HALSEY, E. 1966. The shortest route through a network with time-dependent internodal
transit times. J. Math. Anal. Appl. 14, 493–498.
CRANDALL, D. J., BACKSTROM, L., COSLEY, D., SURI, S., HUTTENLOCHER, D., AND KLEINBERG, J. 2010. Inferring social
ties from geographic coincidences. PNAS 107, 52, 22436–22441.
CULOTTA, A. 2010. Towards detecting influenza epidemics by analyzing Twitter messages. In Proceedings of
the Workshop on Social Media Analytics. 115–122.
DEMIRBAS, M., BAYIR, M. A., AKCORA, C. G., YILMAZ, Y. S., AND FERHATOSMANOGLU, H. 2010. Crowd-sourced
sensing and collaboration using Twitter. In Proceedings of the IEEE International Symposium on a
World of Wireless, Mobile, and Multimedia Networks. 1–9.
DING, B., YU, J. X., AND QIN, L. 2008. Finding time-dependent shortest paths over large graphs. In Proceedings
of the International Conference on Extending Database Technology: Advances in Database Technology.
205–216.
D’OREY, P. M. 2012. Empirical evaluation of a dynamic and distributed taxi-sharing system. In Proceedings
of the 15th International IEEE Conference on Intelligent Transportation Systems (ITSC’12). 140–146.
DUDA, R. O., HART, P. E., AND STORK, D. G. 2001. Pattern Classification. John Wily & Sons, New York.
EAGLE, N. AND PENTLAND, A. 2006. Reality mining: Sensing complex social systems. Pers. Ubiquit. Comput. 10,
4, 255–268.
EAGLE, N., PENTLAND, A., AND LAZIS, D. 2009. Inferring friendship network structure by using mobile phone
data. PNAS 106, 36, 15274–15278.
EDELKAMP, S. AND SCHRÖDL, S. 2003. Route Planning and Map Inference with Global Positioning Traces.
Springer-Verlag, New York, NY, 128–151.
FARRAHI, K. AND GATICA-PEREZ, D. 2008. What did you do today?: Discovering daily routines from large-scale
mobile data. In Proceedings of the ACM International Conference on Multimedia. 849–852.
FARRAHI, K. AND GATICA-PEREZ, D. 2011. Discovering routines from large-scale human locations using probabilistic topic models. ACM TIST 2, 1, 3:1–3:27.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:30
P. S. Castro et al.
FATHI, A. AND KRUMM, J. 2010. Detecting road intersections from GPS traces. In Proceedings of the International Conference on Geographic Information Science. 56–69.
FINKEL, R. AND BENTLEY, J. L. 1974. Quad trees: A data structure for retrieval on composite keys. Acta Inf. 4,
1, 1–9.
FROEHLICH, J. AND KRUMM, J. 2008. Route prediction from trip observations. In Proceedings of the Intelligent
Vehicle Initiative (IVI) Technology Advanced Controls and Navigation Systems, SAE World Congress and
Exhibition.
FROEHLICH, J., NEUMANN, J., AND OLIVER, N. 2009. Sensing and predicting the pulse of the city through shared
bicycling. In Proceedings of the International Joint Conference on Artificial Intelligence. 1420–1426.
FUCHS, H., KEDEM, Z. M., AND NAYLOR, B. F. 1980. On visible surface generation by a priori tree structures. In
Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques. 124–133.
GE, Y., XIONG, H., LIU, C., AND ZHOU, Z.-H. 2011. A taxi driving fraud detection system. In Proceedings of the
IEEE International Conference on Data Mining. 181–190.
GE, Y., XIONG, H., TUZHILIN, A., XIAO, K., GRUTESER, M., AND PAZZANI, M. J. 2010. An energy-efficient mobile
recommender system. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. 899–908.
GIANNOTTI, F., NANNI, M., PEDRESCHI, D., AND PINELLI, F. 2007. Trajectory pattern mining. In Proceedings of the
ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 330–339.
GIANNOTTI, F., NANNI, M., PEDRESCHI, D., PINELLI, F., RENSO, C., RINZIVILLO, S., AND TRASARTI, R. 2011. Unveiling
the complexity of human mobility by querying and mining massive trajectory data. VLDB J. 20, 5,
695–719.
GIRARDIN, F., CALABRESE, F., FIORRE, F. D., BIDERMAN, A., RATTI, C., AND BLAT, J. 2008. Uncovering the presence
and movements of tourists from user-generated content. In Proceedings of the International Forum on
Tourism Statistics.
GONZALEZ, H., HAN, J., LI, X., MYSLINSKA, M., AND SONDAG, J. P. 2007. Adaptive fastest path computation on a
road network: A traffic mining approach. In Proceedings of the International Conference on Very Large
Data Bases. 794–805.
GONZÁLEZ, M. C., HIDALGO, C. A., AND BARABÁSI, A.-L. 2008. Understanding individual human mobility patterns.
Nature 453, 779–782.
GREENFELD, J. 2002. Matching GPS observations to locations on a digital map. In Proceedings of the 81st
Annual Meeting of the Transportation Research Board.
GÜHNEMANN, A., SCHÄFER, R.-P., THIESSENHUSEN, K.-U., AND WAGNER, P. 2004. Monitoring traffic and emissions
by floating car data. Working paper ITS-WP-04-07. Institute of Transportation Studies, University of
Sydney, Australia.
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM
SIGMOD International Conference on Management of Data. 47–57.
HAKLAY (MUKI), M. AND WEBER, P. 2008. OpenStreetMap: User-generated street maps. IEEE Pervasive Comput.
7, 4, 12–18.
HERRING, R., HOFLEITNER, A., ABBEEL, P., AND BAYEN, A. 2010. Estimating arterial traffic conditions using sparse
probe data. In Proceedings of the International IEEE Conference on Intelligent Transportation Systems.
929–936.
HU, H., WU, Z., MAO, B., ZHUANG, Y., CAO, J., AND PAN, J. 2012b. Pick-up tree based route recommendation. In
Proceedings of the International Conference on Web-Age Information Management. 471–483.
HU, J., CAO, W., LUO, J., AND YU, X. 2009. Dynamic modeling of urban population travel behavior based on
data fusion of mobile phone positioning Ddata and FCD. In Proceedings of the International Conference
on Geoinformatics. 1–5.
HU, X., GAO, S., CHIU, Y.-C., AND LIN, D.-Y. 2012a. Modeling routing behavior for vacant taxi cabs in urban
traffic networks. Transport. Res. Rec. 2284, 81–88.
HUANG, H., ZHU, Y., LI, X., LI, M., AND WU, M.-Y. 2010. META: A mobility model of MEtropolitan TAxis
extracted from GPS traces. In Proceedings of the Wireless Communications and Networking Conference.
1–6.
JIANG, B., YIN, J., AND ZHAO, S. 2009. Characterizing the human mobility pattern in a large street network.
Phys. Rev. E 80, 021136.
JOLLIFFE, I. T. 1986. Principal Component Analysis. Springer-Verlag.
KANOULAS, E., DU, Y., XIA, T., AND ZHANG, D. 2006. Finding fastest paths on a road network with speed patterns.
In Proceedings of the International Conference on Data Engineering. 10.
KRUMM, J. AND HORVITZ, E. 2006. Predestination: Inferring destinations from partial trajectories. In Proceedings of the International Conference on Ubiquitous Computing. 243–260.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:31
LEE, J., SHIN, I., AND PARK, G.-L. 2008. Analysis of the passenger pick-up pattern for taxi location recommendation. In Proceedings of the International Conference on Networked Computing and Advanced Information
Management. 199–204.
LI, B., ZHANG, D., SUN, L., CHEN, C., LI, S., QI, G., AND YANG, Q. 2011b. Hunting or waiting? Discovering
passenger-finding strategies from a large-scale real-world taxi dataset. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops).
63–68.
LI, N. AND CHEN, G. 2009. Multi-layered friendship modeling for location-based mobile social networks. In
Proceedings of the International Conference on Mobile and Ubiquitous Systems: Computing Networking
and Services. 1–10.
LI, Q., ZHENG, Z., YANG, B., AND ZHANG, T. 2009a. Hierarchical route planning based on taxi GPS-trajectories.
In Proceedings of the International Conference on Geoinformatics. 1–5.
LI, Q., ZHENG, Z., ZHANG, T., LI, J., AND WU, Z. 2011c. Path-finding through flexible hierarchical road networks:
An experiential approach using taxi trajectory data. Int. J. Appl. Earth Obs. 13, 1, 110–119.
LI, X., LI, M., SHU, W., AND WU, M. 2007. A practical map-matching algorithm for GPS-based vehicular
networks in Shanghai urban area. In Proceedings of the IET Conference on Wireless, Mobile, and Sensor
Networks. 454–457.
LI, X., LI, Z., HAN, J., AND LEE, J.-G. 2009b. Temporal outlier detection in vehicle traffic data. In Proceedings
of the International Conference on Data Engineering. 1319–1322.
LI, X., PAN, G., QI, G., AND LI, S. 2011a. Predicting urban human mobility using large-scale taxi traces. In
Proceedings of the First Workshop on Pervasive Urban Applications.
LIAO, L., FOX, D., AND KAUTZ, H. 2007. Learning and inferring transportation routines. Artif. Intell. 171, 5–6,
311–331.
LIAO, Z. 2003. Real-time taxi dispatching using global positioning systems. Commun. ACM 46, 5, 81–83.
LIAO, Z., YU, Y., AND CHEN, B. 2010. Anomaly detection in GPS data based on visual analytics. In Proceedings
of the IEEE Symposium on Visual Analytics Science and Technology. 51–58.
LIM, S., BALAKRISHNAN, H., GIFFORD, D., MADDEN, S., AND RUS, D. 2010. Stochastic motion planning and application to traffic. Int. J. Robot. Res. 30, 699–712.
LIN, Y., LI, W., QIU, F., AND XU, H. 2012. Research on optimization of vehicle routing problem for ridesharing taxi. In Proceedings of the 8th International Conference on Traffic and Transportation Studies
(ICTTS’12).
LIPPI, M., BERTINI, M., AND FRASCONI, P. 2010. Collective traffic forecasting. In Proceedings of the European
Conference on Machine Learning and Knowledge Discovery in Databases: Part II. 259–273.
LIU, L., ANDRIS, C., BIDERMAN, A., AND RATTI, C. 2009a. Uncovering taxi driver’s mobility intelligence through
his trace. IEEE Pervasive Comput.
LIU, L., ANDRIS, C., AND RATTI, C. 2010a. Uncovering cabdrivers’ behavior patterns from their digital traces.
Comput. Environ. Urban Syst. 34, 6, 541–548.
LIU, L., BIDERMAN, A., AND RATTI, C. 2009b. Urban mobility landscape: Real time monitoring of urban mobility
patterns. In Proceedings of the International Conference on Computers in Urban Planning and Urban
Management.
LIU, S., LIU, Y., NI, L. M., FAN, J., AND LI, M. 2010b. Towards mobility-based clustering. In Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 919–928.
LIU, W., ZHENG, Y., CHAWLA, S., YUAN, J., AND XIE, X. 2011. Discovering spatio-temporal causal interactions in
traffic data streams. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. 1010–1018.
LIU, X., ZHU, Y., WANG, Y., FORMAN, G., NI, L. M., FANG, Y., AND LI, M. 2012a. Road recognition using coarsegrained vehicular traces. Tech. rep. HPL-2012-26. HP Laboratories.
LIU, Y., KANG, C., GAO, S., AND XIAO, Y. 2012b. Understanding intra-urban trip patterns from taxi trajectory
data. J. Geogr. Syst. 14, 4, 463–483.
LIU, Y., WANG, F., XIAO, Y., AND GAO, S. 2012c. Urban land uses and traffic ‘source-sink areas’: Evidence from
GPS-enabled taxi data in Shanghai. Landscape Urban Plan. 106, 1, 73–87.
LOU, Y., ZHANG, C., ZHENG, Y., XIE, X., WANG, W., AND HUANG, Y. 2009. Map-matching for low-sampling-rate
GPS trajectories. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in
Geographic Information Systems. 352–361.
MA, S., ZHENG, Y., AND WOLFSON, O. 2013. T-Share: A large-scale dynamic taxi ridesharing service. In Proceedings of the IEEE Conference on Data Engineering (ICDE’13).
MARTINO, M., BRITTER, R., OUTRAM, C., ZACHARIAS, C., BIDERMAN, A., AND RATTI, C. 2010. Senseable City. In
Digital Urban Modelling and Simulation.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:32
P. S. Castro et al.
MILUZZO, E., LANE, N. D., FODOR, K., PETERSON, R., LU, H., MUSOLESI, M., EISENMAN, S. B., ZHENG, X., AND
CAMPBELL, A. T. 2008. Sensing meets mobile social networks: The design, implementation and evaluation
of the CenceMe application. In Proceedings of the ACM Conference on Embedded Networked Sensor
Systems. 337–350.
MONREALE, A., PINELLI, F., TRASARTI, R., AND GIANNOTTI, F. 2009. WhereNext: A location predictor on trajectory
pattern mining. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. 637–646.
NG, A. Y. AND RUSSELL, S. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning. 663–670.
PALMA, A. T., BOGORNY, V., KUIJPERS, B., AND ALVARES, L. O. 2008. A clustering-based approach for discovering
interesting places in trajectories. In Proceedings of the ACM Symposium on Applied Computing. 863–
868.
PANG, L. X., CHAWLA, S., LIU, W., AND ZHENG, Y. 2011. On mining anomalous patterns in road traffic streams. In
Proceedings of the International Conference on Advanced Data Mining and Applications: Volume Part II.
237–251.
PATTERSON, D. J., LIAO, L., FOX, D., AND KAUTZ, H. 2003. Inferring high-level behavior from low-level sensors.
In Proceedings of the International Conference on Ubiquitous Computing. 73–89.
PENG, C., JIN, X., WONG, K.-C., SHI, M., AND LIÒ, P. 2012. Collective human mobility patter from taxi trips in
urban area. PLoS ONE 7, 4, e34487.
PHITHAKKITNUKOON, S., VELOSO, M., BIDERMAN, A., BENTO, C., AND RATTI, C. 2010. Taxi-aware map: Identifying
and predicting vacant taxis in the city. In Proceedings of the International Conference on Ambient
Intelligence. 86–95.
POWELL, J. W., HUANG, Y., BASTANI, F., AND JI, M. 2011. Towards reducing taxicab cruising time using spatiotemporal profitability maps. In Proceedings of the International Conference on Advances in Spatial and
Temporal Databases. 242–260.
PUTERMAN, M. L. 1994. Markov Decision Processes. John Wily & Sons, New York, NY.
QI, G., LI, X., LI, S., PAN, G., AND WANG, Z. 2011. Measuring social functions of city regions from largescale taxi behaviors. In Proceedings of the IEEE International Conference on Pervasive Computing and
Communications Workshops (PERCOM Workshops). 384–388.
QIAN, S., ZHU, Y., AND LI, M. 2012. Smart recommendation by mining large-scale GPS traces. In Proceedings
of the Wireless Communications and Networking Conference. 3267–3272.
RATTI, C., PULSELLI, R. M., WILLIAMS, S., AND FRENCHMAN, D. 2009. Mobile landscapes: Using location data from
cell phones for urban analysis. Environ. Plann. B 33, 5, 727–748.
ROGERS, S., LANGLEY, P., AND WILSON, C. 1999. Mining GPS data to augment road models. In Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 104–113.
ROSS, S. 2006. Simulation. Academic Press.
SAKOE, H. AND CHIBA, S. 1990. Dynamic programming algorithm optimization for spoken word recognition. In
Readings in Speech Recognition, A. Waibel and K.-F. Lee, Eds., Morgan Kaufmann Publishers, 159–165.
SANDERS, P. AND SCHULTES, D. 2005. Highway hierarchies hasten exact shortest path queries. In Proceedings
of the European Conference on Algorithms. 568–579.
SCHÄFER, R.-P., THIESSENHUSEN, K.-U., AND WAGNER, P. 2002. A traffic information system by means of real-time
floating-car data. In Proceedings of the World Congress on Intelligent Transport Systems.
SCHOLKOPF, B. AND SMOLA, A. J. 2002. Learning with Kernels. MIT Press.
SCHROEDL, S., WAGSTAFF, K., ROGERS, S., LANGLEY, P., AND WILSON, C. 2004. Mining GPS traces for map refinement. Data Min. Knowl. Disc. 9, 59–87.
SONG, C., KOREN, T., WANG, P., AND BARABÁSI, A.-L. 2010a. Modelling the scaling properties of human mobility.
Nature Phys. 6, 818–823.
SONG, C., QU, Z., BLUMM, N., AND BARABÁSI, A.-L. 2010b. Limits of predictability in human mobility. Science
327, 5968, 1018–1021.
SU, H. AND YU, S. 2007. Hybrid GA based online support vector machine model for short-term traffic flow
forecasting. In Proceedings of the International Conference on Advanced Parallel Processing Technologies.
743–752.
SUN, L., ZHANG, D., CHEN, C., CASTRO, P. S., LI, S., AND WANG, Z. 2012. Real time anomalous trajectory detection
and analysis. In Mobile Netw. Appl. 18, 3, 341–356.
TAKAYAMA, T., MATSUMOTO, K., KUMAGAI, A., SATO, N., AND MURATA, Y. 2011. Waiting/cruising location recommendation for efficient taxi business. Int. J. Syst. Appl. Eng. Dev. 5, 2, 224–236.
TAO, C.-C. 2007. Dynamic taxi-sharing service using intelligent transportation system technologies. In Proceedings of the International Conference on Wireless Communications, Networking, and Mobile Computing (WiCom’07).
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
From Taxi GPS Traces to Social and Community Dynamics
17:33
VELOSO, M., PHITHAKKITNUKOON, S., AND BENTO, C. 2011a. Sensing urban mobility with taxi flow. In Proceedings
of the ACM SIGSPATIAL International Workshop on Location-Based Social Networks. 41–44.
VELOSO, M., PHITHAKKITNUKOON, S., AND BENTO, C. 2011b. Urban mobility study using taxi traces. In Proceedings
of the International Workshop on Trajectory Data Mining and Analysis. 23–30.
VELOSO, M., PHITHAKKITNUKOON, S., BENTO, C., FONSECA, N., AND OLIVIER, P. 2011c. Exploratory study of urban
flow using taxi traces. In Proceedings of the 1st Workshop on Pervasive Urban Applications.
ŠINGLIAR, T. AND HAUSKRECHT, M. 2007. Modeling highway traffic volumes. In Proceedings of the European
Conference on Machine Learning. 732–739.
ŠINGLIAR, T. AND HAUSKRECHT, M. 2008. Approximation strategies for routing in dynamic stochastic networks.
In Proceedings of the International Symposium on Artificial Intelligence and Mathematics.
WANG, H., ZOU, H., YUE, Y., AND LI, Q. 2009. Visualizing hot spot analysis result based on Mashup. In
Proceedings of the International Workshop on Location Based Social Networks. 45–48.
WANG, Y., ZHU, Y., HE, Z., YUE, Y., AND LI, Q. 2011. Challenges and opportunities in exploiting large-scale GPS
probe data. Tech. rep. HPL-2011-109, HP Laboratories.
WEN, H., HU, Z., GUO, J., ZHU, L., AND SUN, J. 2008. Operational analysis on Beijing Road network during the
Olympic Games. J. Trans. Sys. Eng. Info. Tech. 8, 6, 32–37.
WORRALL, S. AND NEBOT, E. 2007. Automated process for generating digitised maps through GPS data compression. In Proceedings of the Australasian Conference on Robotics and Automation.
XU, R. AND WUNSCH, D. 2005. Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678.
XUE, A. Y., ZHANG, R., ZHENG, Y., XIE, X., HUANG, J., AND XU, Z. 2013. Destination prediction by sub-trajectory
synthesis and privacy protection against such prediction. In Proceedings of the IEEE International
Conference on Data Engineering (ICDE’13).
YAMAMOTO, K., UESUGI, K., AND WATANABE, T. 2010. Adaptive routing of cruising taxis by mutual exchange of
pathways. In Proceedings of the International Conference on Knowledge-Based Intelligent Information
and Engineering Systems: Part II. 559–566.
YIN, H. AND WOLFSON, O. 2004. A weight-based map matching method in moving objects databases. In Proceedings of the International Conference on Scientific and Statistical Database Management. 437–438.
YUAN, J. AND ZHENG, Y. 2010. T-Drive: driving directions based on taxi trajectories. In Proceedings of the
SIGSPATIAL International Conference on Advances in Geographic Information Systems. 99–108.
YUAN, J., ZHENG, Y., XIE, X., AND SUN, G. 2011a. Driving with knowledge from the physical world. In Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 316–324.
YUAN, J., ZHENG, Y., XIE, X., AND SUN, G. 2013. T-Drive: Enhancing driving directions with taxi drivers’
intelligence. IEEE Trans. Knowl. Data Eng. 25, 1, 220–232.
YUAN, J., ZHENG, Y., ZHANG, C., XIE, X., AND SUN, G. 2010. An interactive voting-based map matching algorithm.
In Proceedings of the International Conference on Mobile Data Management. 43–52.
YUAN, J., ZHENG, Y., ZHANG, L., XIE, X., AND SUN, G. 2011b. Where to find my next passenger? In Proceedings
of the International Conference on Ubiquitous Computing. 109–118.
YUAN, J., ZHENG, Y., AND XIE, X. 2012a. Discovering regions of different functions in a city using human mobility
and POIs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. 186–194.
YUAN, N. J., ZHENG, Y., AND XIE, X. 2012b. Segmentation of urban areas using road networks. Tech. rep.
MSR-TR-2012-65.
YUAN, N. J., ZHENG, Y., ZHANG, L., AND XIE, X. 2011c. T-Finder: A recommender system for finding passengers
and vacant taxis. IEEE Trans. Knowl. Data Eng.
YUE, Y., ZHUANG, Y., LI, Q., AND MAO, Q. 2009. Mining time-dependent attractive areas and movement patterns
from taxi trajectory data. In Proceedings of the International Conference on Geoinformatics. 1–6.
ZHANG, D., GUO, B., AND YU, Z. 2011a. The emergence of social and community intelligence. Computer 44, 7,
21–28.
ZHANG, D., LI, N., ZHOU, Z.-H., CHEN, C., SUN, L., AND LI, S. 2011b. iBAT: Detecting anomalous taxi trajectories
from GPS traces. In Proceedings of the International Conference on Ubiquitous Computing. 99–108.
ZHANG, W., LI, S., AND PAN, G. 2012. Mining the semantics of origin-destination flows using taxi traces. In
Proceedings of the Workshop of Ubiquitous Computing. 943–949.
ZHANG, W., XU, J., AND WANG, H. 2007. Urban traffic situation calculation methods based on probe vehicle
data. J. Transport. Syst. Eng. Inform. Technol. 7, 1, 43–49.
ZHENG, X., LIANG, X., AND XU, K. 2012. Where to wait for a taxi? In Proceedings of the ACM SIGKDD International Workshop on Urban Computing. 149–156.
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.
17:34
P. S. Castro et al.
ZHENG, Y., LIU, L., WANG, L., AND XIE, X. 2008. Learning transportation mode from raw GPS data for geographic
applications on the Web. In Proceedings of the International Conference on World Wide Web. 247–256.
ZHENG, Y., LIU, Y., YUAN, J., AND XIE, X. 2011. Urban computing with taxicabs. In Proceedings of the International Conference on Ubiquitous Computing. 89–98.
ZHENG, Y., XIE, X., AND MA, W.-Y. 2010. GeoLife: A collaborative social networking service among user, location
and trajectory. IEEE Data Eng. Bull. 33, 2, 32–39.
ZHENG, Y., ZHANG, L., XIE, X., AND MA, W.-Y. 2009. Mining interesting locations and travel sequences from
GPS trajectories. In Proceedings of the International Conference on World Wide Web. 791–800.
ZIEBART, B. D., MAAS, A. L., DEY, A. K., AND BAGNELL, J. A. 2008. Navigate like a cabbie: Probabilistic reasoning
from observed context-aware behavior. In Proceedings of the International Conference on Ubiquitous
Computing. 322–331.
Received April 2012; revised November 2012; accepted February 2013
ACM Computing Surveys, Vol. 46, No. 2, Article 17, Publication date: November 2013.