1 平成23年度 組織的な若手研究者等海外派遣プログラム 帰国報告書

平成23年度
組織的な若手研究者等海外派遣プログラム
帰国報告書(研究留学)
提出日:平成 23 年 9 月 30 日
フリガナ
氏
名
派遣先名
(国名)
ワタナベ
渡辺
チエミ
知恵美
所属
人間文化創成科学研究科 講師
ジ ョ ー ジ ア 工 科 大 学 (ア メ リ カ )
(日 本 出 発 日 )
派遣期間
平成 23 年 3 月
24
(日本到着日)
日
~
平成 23
年 9
月
16 日
①
私は 3 月 24 日から 9 月 15 日までの 5 カ月半,ジョージア工科大学コンピュータサイエンス学科の LingLiu
教授の研究室に所属し,共同研究を行った.短い期間であるため,滞在中に行う研究のテーマについて事前に
ある程度打合せし,滞在中は週に 1 度のミーティングを行うことで研究の内容をディスカッションした.滞在
期間の成果は CollaborateCom2011 という国際会議にて 10 月 16 日にフロリダ・オーランドで発表予定である.
滞在中の研究活動
今滞在期間中,ホストの教授である LingLiu の研究室に在籍し,Ling とのディスカッションを中心に研究活
動を行った.Visiting Faculty の居室が割り当てられていたが,研究室の学生の部屋と離れていたことと Ph.D
の学生たちとの対話や議論を行うことを希望し,空いていた学生室の机を借りてデスクワークを行った.5 月
下旬までは週に 1 度研究室全体のミーティングがあり,研究室のメンバーのうち研究発表を控えている学生な
どの発表とそれに対する議論が行われた.研究室には Ph.D の 4 年,5 年で Ph.D 取得を間近に控えていたものが
多くおり,彼らの Ph.D Proposal や Ph.D Defense,2,3 年相当の学生が受ける中間審査などにも同席させても
らい,日本とは違うアメリカの Ph.D の仕組みを実感することができた.また,彼らの発表練習などにも参加し
て内容や発表に関して議論し私なりのアドバイスを行った.彼らとの会話や議論の中で,就職活動や Ph.D 取得
に関する実情や悩み,希望などの興味深い話を聞くことができた.6 月から 8 月にかけては大学は夏休み期間
に入り,私自身は論文の執筆に向けて周辺研究のサーベイや提案手法の詳細な詰め,論文の執筆を開始した.
この期間 Ling は海外の共同研究者との打合せ等のため数週間の出張をすることが多く,基本的にはメールベー
スでのやり取りをした.世界の第一線で活躍している教授から直接論文の大まかな構成や具体的なアイデアに
ついて直接修正案やアドバイスをもらうことができたのは非常に貴重であった.8 月下旬にはアメリカ・シア
トルにてデータベースの国際会議 VLDB2011 に出席し最新の研究について見識を広めた.また,この会議に出席
する日本の数名の研究者を会議の事前にジョージア工科大学に招待し Ling の研究室を始めデータベースに関
連する研究室との研究交流会を開いた.交流会で滞在中の研究内容を発表し,研究室のメンバーや日本からの
研究者との議論から今後の進捗に関して良きアドバイスを得ることができた.5 か月という短い期間であった
ため既存の研究に対する問題提起およびそのモデル化部分をまとめるまでで滞在期間を終えることとなった
が,帰国後も本内容についての共同研究を続けていくこととなり,今後の研究の進め方に加えて研究費の取得
や論文執筆予定等も打ち合わせることができた.
生活について
アトランタでは夜間の一人での外出が危険なことから日が暮れる 8 時半前には必ず帰宅したこともあり,規則
正しい生活を送ることができた.
また LingLiu は夫である CaltonPu 教授と共同で研究室を運営しており,家事・
育児との両立や夫婦での研究活動,メリハリのある生活について二人の活動から多くのヒントを得ることがで
きた.また食生活や運動などにも気を使った.特に運動面では週に 1 度地元のジョギングクラブで走り,アト
ランタの街を自らの足で体感するとともに,地元の方々とのコミュニケーションを図ることもできた.
1
②
氏名:
渡辺 知恵美
Privacy Risks and Countermeasures in Publishing and Mining Social Network Data
ABSTRACT
Social network analysis is gaining growing attraction as a tool of creating innovative marketing strategies, developing
new social computing applications, and carrying out sociological research and field studies for historians and genealogists.
With the continued revolution of social computing technologies, many social network providers and enterprises and
government organizations are interested in privacy preserving publishing and mining of social network data. However,
sharing and mining social network data should not intrude the personal privacy of individuals. Thus, data anonymization
techniques are considered essential for safe and secure publishing and mining of social network data. Nowadays, many
researchers have proposed anonymization techniques for sanitizing social network data before releasing it for third party
mining services. It is widely recognized that the primary privacy risks in publishing social network data is centered on the
inference attacks. However, anonymization is meaningless if the utility of anonymized data is close to zero or completely
lost. Thus, privacy preserving publishing of social network data should aim at preserving privacy required by users (social
network entities) while maintaining the maximum utility of released data.
We discuss privacy risks in publishing social network data and the design principles for developing countermeasures.
First, we make the first attempt to define the utility of released data in terms of exposure levels and query types, assuming
queries are the most fundamental operations in social network analysis. Second, we identify two types of background
knowledge based inference attacks that can break some of most representative graph permutation based anonymization
techniques in terms of anonymity violations. Third but not the least, we describe some design considerations for
developing countermeasures in privacy preserving social network data publishing.
SOCIAL NETWORK REFERENCE MODEL
Conceptually, a social network can be represented as a graph G = (V,E), where V is a set of nodes and E is a set of
edges, each representing a type of relationship between a pair of nodes in G. When we model a social network of people
by two types of nodes: member nodes and activity-based group nodes, the edges are representing the engagement or a
participation of a member in a specific activity or group, G = (V,E) now presents an activity-group based social network.
We refer to the graph as a user-group link graph. Formally, we model a user-group link graph using we the bipartite graph
G = (V,W,E), where V is a set of user-node, W is a set of group-nodes, and E is a set of user-group links which establish a
connection between a user-node v∈V and a group-node w∈W. Furthermore, social network data typically include
information about users, such as user’s age, gender, address, hobbies, education and professional experience. We refer to
such user-specific information as a user profile, which is linked to the corresponding user-node.
SOCIAL NETWORK UTILITY MODEL
In general, utility measures how usable the data is for social network analysis. If a piece of data can be used for answering
many types of queries with high accuracy, then its utility is high. We define the utility of social network data based on the
exposure levels in terms of graph structure and node profiles as well as query types.
Definition 1 (Exposure Level)
Level 1: Exposure of only graph structure All profile data are deleted from every node prior to publishing the SN data.
Level 2: Exposure of only profiles of nodes Only the profile data of nodes are exposed in SN data publishing
Level 3: Exposure of graph structure and profiles of nodes
Definition 2 (Query Types):
Type 0: Queries using graph structure only.
2
氏名:
渡辺 知恵美
Type 1: Queries using only node profiles.
Type 2: Queries with property-wise aggregation over a specific graph structure.
Type 3: Queries with graph-wise aggregation over specific condition matched profiles.
Many common and basic SN analyses, such as community detection, node classification, queries of type 2 and/or type 3
are required. For satisfying such requirements, a dataset published in exposure 2 can answer all types of queries and thus
has high utility.
SOCIAL NETWORK DATA PUBLISHING MODELS
We can classify existing anonymization techniques into two broad categories: perturbation based approaches and
permutation based approaches. Figure 1 shows these two categories of anonymization techniques and how the sensitive
data is protected. This report describes permutation based approaches. Permutation techniques prevent from revealing
sensitive values by breaking links between a node in the graph and its corresponding profile and structure. Cormode et al.
(2009)i proposed a permutation approach to graph data. Figure 2 shows the permutated social network data by using (k,
l)-grouping permutation (k = 2, l = 2). The (k, l)- grouping of bipartite graph G(V,W,E) uses an edge augmentation based
approach to partition V (W) into overlapping subsets of size = k (l) and the publish edges E' is isomorphic to E, where
mapping from E to E' is anonymized based on augmented partitions of V , W such that spurious semantic associations
between user node and group node are added to satisfy (k, l) grouping. By (k, l) partitioning of V and W, all the user-group
links are preserved but the user node and group node in each original user-group link is now permuted with a set of at
least k user nodes and a set of at least l group nodes.
Fig.1 Categories of Publishing Approaches
Fig.2 Applying permutation technique for user-group affiliation network
VULNERABILITIES AND RISK ANALYSIS
Here, we point out the vulnerabilities found in existing social network data anonymizaiton mechanisms by using an
example. For presentation convenience, we will use the (k, l)-grouping permutation approach as a representative example
technique. If the exposure level is 1, no attackers can infer from the (k, l) anonymized SN graph the existence of an
association of u with g with the probability higher than 1/max(k, l). However, the utility of the anonymized SN data is
very limited and only queries of type 0 can be served. On the other hand, when the exposure level is 2 or 3, we can gain
much higher utility with (k, l) anonymized SN graph and enjoy all four types of queries with high accuracy.
3
氏名:
渡辺 知恵美
However, we can no longer enjoy the safe condition guarantee. Concretely, in Figure 2, user profiles u1, u4 and u6 are
attached to v1 by (k, l)-anonymization algorithm. By utilizing common sense background knowledge, one can
dramatically reduce the search space for matching possible worlds. For instance, if g1 refers to the research meeting of
database laboratory in a university and g3 refers to a swim meet of Atlanta teens, u1 is a Ph.D student, u4 is a lawyer and
u6 is an Italian chef. The attacker can utilize the profile data as such and common sense knowledge to infer with high
confidence that g3 has close to zero probability to be the true group with which u1, u4 or u6 is associated.
ATTACK MODEL AND ANALYSIS
Based on the observation, we investigate attack model for social network data, and we defined two types of background
knowledge attacks: user-group constraint attack and skewed distribution attack. Both types of attacks utilize background
knowledge about user nodes and group nodes in the published SN graph to eliminate the number of possible worlds that
are clearly irrelevant. We describe two types of attacks by using a running example shown in Figure 3. This graph is
anonymized by using (k, l)-grouping permutation with k = 2 and l = 1. v1 connected w1, v1 maps to a list of user ids {u1, u4, u6}
and w3 maps to group id g1. g1 denotes the event of sharing email between v1 and v2 at 18:23 on 2011/06/23.
Figure 3. An Example of Permutated Social Network Data
We assume that an attacker wants to know who attends the meeting at 14:00 EDT on 2011/05/06, namely which user nodes
have true association to g3
Possible Worlds
Let G=(V,W,E) denote a social network graph and G' =(V',W',E') denote an (k, l) anonymized graph of G as defined in
Section III-B. Let PW(G,G') denote a set of possible worlds of G'. Given a possible world pwi of the anonymized graph
G' =(V',W',E') where V' ={v3, v4, v5}, W' ={g3} and E' ={(v3, g3), (v4, g3), (v5, g3)}, a mapping of this possible
world to the real world in G , denoted by Mi, is defined as Mi = {(v3, u3), (v4, u1), (v5, u2), (w3, g3)}. For presentation
convenience, we describe a possible world with mapping Mi = {(v1, u1), (v2, u2), (v3, u3), (w3, g3)} as pw(u1, u2, u3, g3)
when no confusion occurs. In this example, the attacker can find the following 12 possible worlds from the sub-graph
with three user-nodes v3,v4,v5 and one group-node w3.
PW(G,G′) = {pw(u3, u1, u2, g3), pw(u3, u1, u5, g3), pw(u3, u4, u2, g3), pw(u3, u4, u5, g3), pw(u3, u6, u2, g3),
pw(u3, u6, u5, g3), pw(u7, u1, u2, g3), pw(u7, u1, u5, g3),pw(u7, u4, u2, g3), pw(u7, u4, u5, g3),
pw(u7, u6, u2, g3),pw(u7, u6, u5, g3)}
User-group constraint violation attack
An adversary makes use of his background knowledge to define a set of constraints between user nodes and group
nodes and between user nodes that participate in the same group activity. In running example, event g3 refers to a meeting
which started 14:00 EDT. By time difference between Atlanta and Japan, we know that 14:00EDT is 3:00 JST in Japan.
Thus the adversary can introduce a time-difference constraint between user and group such that for any activity group that
4
氏名:
渡辺 知恵美
has short time window associated with, any user whose time zone is 12 hour difference will not be possible to be
associated with this group. Using this constraint, we can easily detect that (u1, g3) and (u2, g3) are violating this
constraint since u1,and u2 have Japan as their current residences in their profiles, and thus it is very difficult if not
impossible for u1,and u2 to attend this meeting. After removing the inappropriate possible worlds, there remain 4 possible
worlds, and they are shown as follows:
PW(G,G') = {pw(u3, u4, u5, g3), pw(u3, u6, u5, g3), pw(u7, u4, u5, g3), pw(u7, u6, u5, g3)}
Skew Probability Distribution attack
If an adversary uncovers the skewed probability distribution over the set of possible worlds for an anonymized SN
graph, the adversary may leverage the skewed distribution of the probabilities to launch a successful inference attack. For
example, an adversary may define a scoring function f(u, g) between a possible user node u and a possible activity group
node g based on his background knowledge. This function calculates the probability of this association to be true in the
original SN graph. For example, g3 in Figure 3 refers to a meeting, and then an attacker may use his background
knowledge to assume that the partifipants in the meeting have the same or similar professional profiles among them.
Based on this assumption, the attacker defines a score function so that the possible world that closely and mutually
matches will have higher probability to be mapped to the true world. We define the score function f(pw(G,G')) for each
possible world by calculating the sum of the values for all attributes as follows;
Scores of all other possible worlds are as follows;
f(pw(u3, u4, u5, g3)) = 1 + 2 + 2 + 3 + 3 = 11
f(pw(u7, u4, u5, g3)) = 2 + 3 + 1 + 2 + 2 = 10
f(pw(u7, u6, u5, g3)) = 2 + 2 + 1 + 2 + 2 = 9
Based on the scoring function and the results, the attacker identifies the possible world with the highest similarity score as
the most probable matching to the true world. From the above example, given that pw(u3, u4, u5) has the highest
similarity score of 11=(11 + 10 + 9 + 8) = 11=38, thus it is identified by the attacker, most likely, as the true world.
DISCUSSION AND COUNTERMEASURES
We have described the privacy risks in publishing anonymized social network data and two types of background
knowledge based attacks: constraint violation attack and probability skew attack. One of the fundamental vulnerabilities in
the design of graph permutation techniques is the lack of consideration on the background knowledge and the risks of
combining background knowledge with published profile data and graph structure data. Concretely, take (k, l) grouping
permutation approach as an example, the algorithm for composing k user groups and l activity groups from input social
network G = (V,W,E) focuses on meeting the safe condition that nodes in same group of V have no common neighbors in
W, which is a condition for higher utility but it does not guarantee background knowledge attack resilience. A
straightforward approach to the (k, l) grouping algorithm is to revise the (k, l) grouping algorithm in terms of how to add
uncertainty through inserting spurious user-group links.
i
Graham Cormode, Divesh Srivastava, Smriti Bhagat, Balachander Krishnamurthy: Class-based graph anonymization for
social network data. PVLDB 2(1): 766-777 (2009)
5