Low-Complexity Detection of POI Boundaries Using Geo

Low-Complexity Detection of POI Boundaries Using
Geo-Tagged Tweets: A Geographic Proximity Based
Approach
Dung D. Vu
Won-Yong Shin
Dankook University
Yongin 448-701, Republic of Korea
Dankook University
Yongin 448-701, Republic of Korea
[email protected]
[email protected]
ABSTRACT
1.
Users tend to check in and post their statuses in locationbased social networks (LBSNs) to describe that their interests are related to a point-of-interest (POI). Since the
relevance of the data to the POI varies according to the geographic distance between the POI and the locations where
the data are generated, it is important to characterize an
area-of-interest (AOI) that enables to utilize the location
information in a variety of businesses, services, and place
advertisements. While previous studies on discovering AOIs
were conducted based mostly on density-based clustering
methods with the collection of geo-tagged photos from LBSNs, we focus on detecting a POI boundary, which corresponds to only one cluster containing its POI center. Using
geo-tagged tweets recorded from Twitter users, this paper
introduces a low-complexity two-phase strategy to detect a
POI boundary by finding a suitable radius reachable from
the POI center. We detect a polygon-type boundary of the
POI as the convex hull (i.e., the outermost region) of selected geo-tags through our two-phase approach, where each
phase proceeds on with different sizes of radius increment,
thus yielding a more precise boundary. It is shown that our
approach outperforms the conventional density-based clustering method in terms of runtime complexity.
Location-based social networks (LBSNs) such as Foursquare and Flickr have grown rapidly in recent years. They
provide a platform for millions of users to share their locationtagged media contents such as photos, videos, music, and
texts. Owing to the location information from geo-tags,
there has been a steady push to study a variety of pointof-interest (POI) issues [1–3] through LBSNs. In general,
when users visit a POI, they are likely to check in online
and post their statuses to describe that their interests are
related to the POI. There have been two types of studies in
the literature to reveal and utilize the characteristics of POIs
for various applications: POI recommendation and area-ofinterest (AOI) discovery.
From the fact that the relevance of the data to the POI
varies according to the geo-tagged data between the POI and
the positions where the data are generated, it is of fundamental importance to detect AOIs [4–11]. Previous studies
on discovering AOIs were conducted in the literature based
mostly on density-based clustering methods along with the
collection of geo-tagged photos from LBSNs. Density-based
spatial clustering of application with noise (DBSCAN) [5–
7, 11, 12] is the most commonly used density-based clustering algorithm even if it was not originally designed for
AOI discovery. DBSCAN can find arbitrarily-shaped multiple clusters with an overall average runtime complexity
of O(nin lognin ) [12] (the worst case complexity of O(n2in )),
where nin denotes the number of input records.
In our work, by reflecting the geographic proximity for
POIs, we focus on characterizing “POI boundary”, which
is defined as only one cluster having a convex hull shape
that contains the corresponding POI center. That is, this
POI boundary represents one high-density cluster within the
discovered AOI (possibly the cluster with the highest density) on a much smaller scale. We aim at detecting such
a boundary that corresponds to the most attractive cluster
to which users pay attention for the POI only at the cost
of linear scaling runtime complexity in nin . By detecting
POI boundaries, one can think of a variety of applications
including, but not limited to
Categories and Subject Descriptors
J.4 [Computer Applications]: Social and Behavioral sciences
General Terms
Algorithms, Human Factors, Measurement
Keywords
Area-of-Interest (AOI), Geographic Distance, Geo-Tagged
Tweet, Point-of-Interest (POI) Boundary, Two-Phase Approach, Twitter
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
LBSN’15, November 03, 2015, Bellevue, WA, USA
c 2015 ACM. ISBN 978-1-4503-3975-9/15/11...$15.00
DOI: http://dx.doi.org/10.1145/2830657.2830663.
INTRODUCTION
• Location advertisements: As a marketing strategy of
companies aimed at a POI (e.g., shopping malls), leaflets
/brochures online will be disseminated only to the people who come to visit the place. Thus, company managers will not only be aware of the explicit marketing
zone but also reduce the marketing cost.
• Traffic control: When there is a festival at a POI, traffic congestion will be significantly reduced by recommending the best route based on the POI boundary
for festival participants.
Instead of geo-tagged photos collected from LBSNs, we utilize geo-tagged tweets on Twitter [13–15], which holds a
more substantial amount of user accounts and records than
those from LBSNs. More specifically, to determine how POI
boundaries are geographically formed with a convex hull
shape, we analyze a single-source dataset that contains a
huge number of geo-tagged tweets from users in both the
United Kingdom (UK) and the United States (US). These
two location sets were selected as demographically comparable, leading adopters of Twitter with sufficient data to enable meaningful comparative analysis for our intentionally
exploratory study.
Then, this paper introduces a new two-phase strategy to
detect a POI boundary, whose computational complexity
scales linearly with the number of input records for the
algorithm (i.e., O(nin )) and thus is much lower than that
in [5–7, 11]. The proposed detection algorithm is composed
of the following steps:
• We first describe the POI with Wikipedia concepts.
• Unlike exploring the topical similarities with check-in
data in LBSNs, from the fact that users are allowed to
tweet within 140-character limit, we collect all the relevant geo-tagged tweets through query processing whose
text contains the POI name (e.g, full name and abbreviated name).
• Thereafter, we compute the distance between the POI
center and the location where each tweet is posted.
• text: actual UTF-8 text of the status update containing a POI name
• lat: latitude of the tweet’s location
• lon: longitude of the tweet’s location
We first represent each POI with Wikipedia concepts.
Through query processing, we then obtain the filtered geotagged tweets whose text field is associated with the POI
names. Two POIs located in London and Los Angeles, respectively, are used for our analysis. In our work, POIs are
not only restricted to the above we consider, but also can be
a whole variety of points potentially interesting to tourists
and places where people tend to check in. We selected these
POIs whose boundary size is expected to differ from each
other.
POI name
London Eye
Table 1: Four POIs
The number
(latitude,
of geo-tagged
longitude)
tweets
(51.503300o ,
2,178
−0.119700o )
Initial
radius
∆r1 (m)
80
Victoria and
Albert Museum
(V&A)
(51.496667o ,
−0.171944o )
1,098
120
Dodger stadium
(34.072686o ,
−118.240603o )
3,666
120
Los Angeles
International
Airport (LAX)
(34.053718o ,
−118.242642o )
4,100
2,000
Owing to a huge amount of geo-tagged Twitter data, our
analysis provides a much more fine-grained POI boundary
with linear scaling runtime complexity.
Representative attributes of the four POIs are summarized in Table 1. The second and the third columns of Table
1 represent the POI center’s coordinate and the total number of geo-tagged tweets whose text contains the POI names,
respectively. As a rule of thumb, an approximate minimum
distance from the POI center covering the geographic area
of the POI is obtained from Google Maps and is referred to
as the initial radius ∆r1 , which is shown in the last column
of Table 1. As depicted in the table, since we selected different types of POIs, the number of relevant geo-tagged tweets
and the initial radius are significantly different according to
the POIs.
2.
3.
• After simply finding a suitable radius reachable from
the POI center, we detect a polygon-type boundary of
the POI as the convex hull (i.e., the outermost region)
of selected geo-tags through the two-phase approach.
To provide a more precise result, each phase proceeds
on with different sizes of radius increment.
DATASET
We use a dataset collected via Twitter Streaming API.
The dataset consists of a huge amount of geo-tagged tweets
recorded from Twitter users from July 29, 2015 to August
29, 2015 (about one month) in the following two countries:
the US and the UK. Note that this short-term (one month)
dataset is sufficient to detect boundaries for widely-known
POIs. The dataset collected in the UK consists of 18,682,819
geo-tagged tweets from 629,881 different users, while another
dataset in the US consists of 58,118,361 geo-tagged tweets
from 2,139,483 different users. We see that each tweet contains a number of entities that are distinguished by their
attributed field names. For data analysis, we adopted the
following four essential fields from the metadata of tweets:
• user id str : string representation of the unique identifier for a certain user
METHODOLOGY
We start by introducing the following definition of “POI
boundary”.
Definition 1. POI boundary is one convex hull type highdensity cluster such that contains the corresponding POI
center. Within the boundary, all annuli created with a predetermined radius increment from the POI center should
include at least one geo-tag.
Note that our definition differs from the conventional definition of AOI [6, 7, 11], which may consist of more than two
high-density clusters on a larger scale. This POI boundary
commonly reveals the highest density among all the clusters.
Now, we describe our two-phase detection for a precise
POI boundary. As mentioned earlier, we are interested in
finding a suitable radius for each POI. Let (tu , lu ) and c
denote the geo-tagged textual data of user u and the coordinate of a POI center, respectively, where tu and lu are the
text and the coordinate, respectively, of user u. We assume
that Din and Dout indicate the set of all geo-tagged tweets
whose text includes a POI name and the set of geo-tagged
tweets within the detected POI boundary, respectively. We
also denote the set of geo-tagged data within a circle having
center c and radius r by D(c,r) . The geographic distance
between the POI center c and the location of user u, denoted by d(c, lu ), can be computed using the spherical law
of cosines, which gives a well-conditioned result of the estimated distance down to distances as small as 1 meter.
For the first phase, given a POI, we start by using a circle
centered at c with radius ri = ∆r1 > 0, which is a discretely increasing variable. The radius ri is increased by
∆r1 for each step if a certain condition is satisfied, and the
update history is archived. When d(c, lu ) is smaller than ri ,
it follows that (tu , lu ) ∈ D(c,ri ) . Now, we focus on describing the update condition for the first phase. Let us define
dn1 , |D(c,ri ) \ D(c,ri−1 ) |, which is the number of geo-tagged
tweets in the corresponding annulus. If dn1 is greater than
a given threshold ∆η, then three variables ri , D(c,ri ) , and
dn1 for the POI are updated iteratively. Otherwise, this
process is terminated. Here, the threshold ∆η > 0 can be
determined adaptively according to the radius ∆r1 , which
will be specified later.
Let us turn to the second phase, which yields a more finegrained result, compared to the single phase method. Let us
start by using a circle centered at c with radius ri,j = ri−1 >
0. We denote the increasing radius interval for this phase by
∆r2 , which is set to a value greater than the distance due
1
, where
to the GPS error. More precisely, we set ∆r2 = ∆r
Q
Q > 1 is the parameter representing the interval granularity.
Likewise, ri,j is increased by ∆r2 for each step under another
condition, and the update history is also archived. In this
phase, we define dn2 , |D(c,ri,j ) \ D(c,ri,j−1 ) |, which is the
number of geo-tagged tweets in the corresponding annulus.
If dn2 ≥ 1, then ri,j , D(c,ri,j ) , and dn2 are updated iteratively. Otherwise, this process is terminated, and the set
Dout = D(c,ri,j ) is finally obtained. The overall procedure is
summarized in Algorithm 1.
Thereafter, by solving the convex-hull problem (e.g., quickhull ), we can find the smallest convex polygon that contains
the given points in Dout as well as the POI center, corresponding to the POI boundary.
4.
Algorithm 1 Two-phase detection algorithm.
Input: Din , c, ∆r1 , and ∆r2
Output: Dout
Initialization: i ← 1; j ← 0; ri ← 0; ri,j ← 0; dn1 ←
0; dn2 ← 0; D(c,ri ) ← {(tu , lu )|d(c, lu ) < ∆r1 , (tu , lu ) ∈
Din }
1: do
2:
i←i+1
3:
ri ← ri + ∆r1
4:
D(c,ri ) ← {(tu , lu )|d(c, lu ) < ri , (tu , lu ) ∈ Din }
5:
dn1 ← |D(c,ri ) \ D(c,ri−1 ) |
6: while dn1 > ∆η
7: i ← i − 1
8: ri,j ← ri
9: do
10:
j ←j+1
11:
ri,j ← ri,j + ∆r2
12:
D(c,ri,j ) ← {(tu , lu )|d(c, lu ) < ri,j , (tu , lu ) ∈ Din }
13:
dn2 ← |D(c,ri,j ) \ D(c,ri,j−1 ) |
14: while dn2 ≥ 1
15: Dout ← D(c,ri,j )
16: return Dout
(a) London Eye
(b) Victoria and Albert Museum (V&A)
(c) Dodger Stadium
(d) Los Angeles International
Airport (LAX)
ANALYSIS RESULTS
Using the proposed detection algorithm in Section 3, our
experimental results are first shown. In our work, we simply
assume Q = 10. Then, a suitable ∆η can be set to 100
according to the relationship between ∆r1 and ∆r2 . The
detected POI boundaries on Google Maps are illustrated in
Figure 1. One blue circle and red pins indicate the POI
center and the selected geo-tagged records in the set Dout ,
respectively (note that some pins are almost overlapped).
Next, we perform comparative studies between the stateof-the-art DBSCAN algorithm and the proposed two-phase
algorithm in terms of computational complexity. We evaluated the overall average runtime, referred to as the CPU
time charged for the execution of instructions of the calling
processing system, in detecting the LAX boundary using the
dataset in Section 2. Using the DBSCAN algorithm returns
37 clusters, where the radius and the minimum number of
neighbors, which are the two parameters of DBSCAN, are
set to 2 km and 5, respectively. To obtain the POI boundary,
Figure 1: Detection of POI boundaries
one high-density cluster out of 37 clusters that includes the
POI center is then chosen (that is, the rest of the clusters
are filtered out). In this case, the maximum distance from
the POI center within the boundary is given by 3.782 km.
On the other hand, using the proposed algorithm immediately returns only one cluster. In our scheme, by setting
∆r1 = 2 km, Q = 10, and ∆η = 100, the maximum distance
reachable from the POI center is given by 3.872 km, which
is quite similar to the DBSCAN case. In Figure 2, the horizontal and vertical axes represent the number of geo-tagged
tweets in the set Din , nin , |Din |, and the execution time
in seconds, respectively. As long as |Dout | scales relatively
slower than nin (i.e., |Dout | = o(nin )), one can see that the
complexity of the proposed algorithm is given by O(nin ). On
the other hand, the complexity of the DBSCAN algorithm
is known to scale as nin lognin . For large nin , a performance
gap between these two methods gets significantly increased.
Asymptotic curves are also shown in Figure 2, where they
show trends consistent with our experimental results.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Figure 2: Runtime complexity
[11]
5.
ACKNOWLEDGMENT
This research was supported by the Basic Science Research Program through the National Research Foundation
of Korea (NRF) funded by the Ministry of Education (2014R1A1A2054577).
[12]
6.
REFERENCES
[1] Q. Yuan, G. Cong, Z. Ma, A. Sun, and N.
Magnenat-Thalmann. Time-aware point-of-interest
recommendation. In Proceedings of the 36th
International ACM SIGIR Conference on Research
and Development in Information Retrieval
(SIGIR’13), pages 363–372. July 2013.
[2] M. Ye, P. Yin, W.-C. Lee, and D.-L. Lee. Exploiting
geographical influence for collaborative
point-of-interest recommendation. In Proceedings of
the 34th International ACM SIGIR Conference on
Research and Development in Information Retrieval
(SIGIR’11), pages 325–334. July 2011.
[3] J.-W. Son, A.-Y. Kim, and S.-B. Park. A
location-based news article recommendation with
explicit localized semantic analysis. In Proceedings of
[13]
[14]
[15]
the 36th International ACM SIGIR Conference on
Research and Development in Information Retrieval
(SIGIR’13), pages 293–302, July 2013.
M. Berg, W. Muelemans, and B. Speckmann.
Delineating imprecise regions via shortest-path
graphs. In Proceeding of the 19th ACM SIGSPATIAL
International Conference on Advances in Grographic
Information Systems (SIGSPATIAL2011), pages
271–280, November 2011.
J.-K. Parket and J.-A. Downs. Footprint generation
using fuzzy-neighborhood clustering. Geoinformation,
17(2): 285–299, April 2013.
J. Liu, Z. Huang, L. Chen, H.-T. Shen, and Z. Yan.
Discovering areas of interest with geo-tagged images
and check-ins. In Proceeding of the 20th ACM
International Conference on Multimedia (MM’12),
pages 589–598, October 2012.
D. Laptev, A. Tikhonov, P. Serdyukov, and G. Gusev.
Parameter-free discovery and recommendation of
areas-of-interest. In Proceedings of the 22nd ACM
SIGSPATIAL International Conference on Advances
in Geogaphic Information Systems,
SIGSPATIAL2014, pages 113–122, November 2014.
P. Brindley, J. Goulding, and M.-L. Wilson. A data
driven approach to mapping urban neighbourhoods. In
Proceedings of 22nd ACM SIGSPATIAL International
Conference on Advances in Geographic Information
Systems, SIGSPATIAL2014, pages 437–440,
November 2014.
E. Cunha and B. Martins. Using one-class classifiers
and multiple kernel learning for defining imprecise
geographic regions. Geographical Information Science,
28(11): 2220–2241, November 2014.
C. Grothe and J. Schaab. Automated footprint
generation from geotags with Kernel Density
Estimation and Support Vector Machines. Spatial
Cognition & Computation: An Interdisciplinary, 9(3):
195–211, August 2009.
S. Kisilevich, F. Mansmann, and D. Keim.
P-DBSCAN: A density based clustering algorithm for
exploration and analysis of attractive areas using
collections of geo-tagged photos. In Proceedings of the
1st International Conference and Exhibition of
Computing for Geospatial Research & Application
(COM.Geo2010), June 2010.
M. Ester, H.-P. Keriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. Data Mining and
Knowledge Discovery, 96(34): 226–231, 1996.
Y. Takhteyev, A. Gruzd, and B. Wellman. Geography
of Twitter networks. Social Networks, 34(1): 73–81,
January 2012.
J. Kulshrestha, F. Kooti, A. Nikravesh, and K. P.
Gummadi. Geographic dissection of the Twitter
network. In Proceedings of the 6th International AAAI
Conference on Weblogs and Socail Media
(ICWSM-12), pages 202–209, June 2012.
W.-Y. Shin, B. C. Singh, J. Cho, and A. M. Everett.
A new understanding of friendships in space: Complex
networks meet Twitter. Journal of Information
Science, to appear.