数位典藏计划(目录“ontology”) Prof. Huang`s comment on

Ontology 和 HowNet
董振东
董强
[email protected]
[email protected]
www.keenage.com
Research Centre of Computer & Language Engineering
Chinese Academy of Sciences
哈尔滨
2003.08
提纲

Ontology

HowNet vs SUMO/WordNet/VerbNet
Ontology

什么是Ontology

Ontology与IT/NLP
什么是Ontology
Ontology是学问
 Ontology是资源

Ontology是学问






哲学上的Ontology
AI/KR上的Ontology
数学上的Ontology
软件工程上的Ontology
语言学上的Ontology
IT上的Ontology
Ontology定义涉及的问题



内在的涵义
外在的表示
作为术语的中文翻译
Ontology与IT/NLP

similar to a dictionary or glossary, but with greater
detail and structure that enables computers to
process its content. An ontology consists of a set of
concepts, axioms, and relationships that describe a
domain of interest. An upper ontology is limited to
concepts that are meta, generic, abstract and
philosophical …
-- Standard Upper Ontology (SUO) Working Group

是一个以汉语和英语的词语所代表的概念为描述对
象,以揭示概念与概念之间以及概念所具有的属性
之间的关系为基本内容的常识知识库。
--《知网》
典型的Ontology
 Cyc:
http:// www.cyc.com
 IFF: The IFF Foundation Ontology
 WordNet: http://www.cogsci.princeton.edu
 EuroWordNet: http: //www.hum.uva.nl/ewn/
 HowNet: http://www.keenage.com
 SUMO: http://ontology.teknowledge.com
 EDR: http://www.iijnet.or.jp
 VerbNet: http://www.cis.upenn.edu/verbnet/
 Prototype(sinica):
http://ckip.iis.sinica.edu.tw/CKIP/ontology/
HowNet vs
SUMO/WordNet/VerbNet

SUMO –
Suggested Upper Merged Ontology

Mapping WordNet to SUMO
SUMO – Suggested Upper Merged
Ontology

SUMO Sources

SUMO Subclass Hierarchy Tree
SUMO Subclass Hierarchy Tree
making
constructing
manufacture
publication
cooking
searching
pursuing
investigating
diagnostic process
social interaction
change of possession
giving
unilateral giving
lending
getting
unilateral getting
borrowing
Motivation for Mapping

How can a formal ontology be used effectively by
those who lack extensive training in logic and
mathematics?

How can an ontology be used automatically by
applications?

How can we know when an ontology is complete?
《知网》的架构
D-relation Trigger
(Application Tools)
S-relation Trigger
(Browser)
Basic Data
(Concept Definitions / Taxonomies)
Basic Data – Sememes
Sememes
Entity
thing (physical, mental, fact)
component (part, fitting)
time
space (direction, location)
Event (relation, state、action)
Attribute
Value
Secondary feature
2219
154
818
248
892
107
Basic Data – Concept Definition
NO.=020957
W_C=大学生
G_C=N
E_C=
W_E=college student
G_E=N
E_E=
DEF={human|人:{study|学习:agent={~},location={InstitutePlace|场
所:domain={education|教育},modifier={HighRank|高等},{study|
学习:location={~}},{teach|教:location={~}}}}}
Basic Data – Taxonomies
- {thing|万物} {entity|实体:{ExistAppear|存现:existent={~}}}
- {physical|物质} {thing|万物:HostOf={Appearance|外观},
{perception|感知:content={~}}}
- {animate|生物} {physical|物质:HostOf={Age|年龄},
{alive|活着:experiencer={~}},{die|死: experiencer={~}},
{metabolize|代谢: experiencer={~}},
{reproduce|生殖:agent={~},PatientProduct={~}}}
- {AnimalHuman|动物} {animate|生物:HostOf={Sex|性别},
{AlterLocation|变空间位置:agent={~}},{StateMental|精神
状态:experiencer={~}}}
- {human|人} {AnimalHuman|动物:HostOf={Name|姓名}
{Wisdom|智慧}{Ability|能力},
{think|思考:agent={~}},{speak|说:agent={~}}}
S-relation Trigger -- Browser
D-relation Trigger -- Application Tools
 Relevant Concept Field Builder (相关概念场构造器)
Cf. “seed list” Bonnie Dorr & Tiejun Zhao: “化学”/“射击”

Sense Similarity Calculator (语义相似度计算器)
“毛衣”Vs“手套”/“醋”

Chinese Chunk Extractor (中文语块抽取器)
知网在海内外的应用 (1)

Semantic Web
ontology annotation
 thesaurus

陈文鋕: Semantic Processing && Semantic Web Service
(台湾财团法人资讯工业策进会)

Named Entity Recognition
Tianfang Yao, Wei Ding, Gregor Erbach: CHINERS: A
Chinese Named Entity Recognition System for the
Sports Domain
知网在海内外的应用 (2)

Word Sense Disambiguation
Chi-Yung Wang: Knowledge-based Sense Pruning using the
HowNet: an Alternative to Word Sense
Disambiguation
Wong Ping Wai: A Maximum Entropy Approach to HowNetBased Chinese Word sense Disambiguation

Word Similarity Computing
Liu Qun Li Su Jian: Word Similarity Computing Based on
HowNet
知网在海内外的应用 (3)

Sense Annotation

Dependency Relation Annotation
Li MingQin, LI Juanzi : Building A Large Chinese Corpus
Annotated with Semantic Dependency

Cross-language Developing
授权给台湾中央研究院资讯所合作开发HowNet Big5+版
数位典藏国家型计划(NDAP)
http://ndap.org.tw/NewsLetter/content.html?subuid=559&uid=26
Thank you
当前研究的趋势




理论或哲学上的探索
做mapping、linking、merging
在应用中研究
建设常识性的或专门领域的知识体系
关于建设知识体系的一些看法



理论与工程的关系 – 把工程放在首位
研究与应用的关系 – 着眼于应用
分清什么是接轨和什么是“接鬼”
五年前有人建议我们把知网改成WordNet
 最近有人建议我们按SUMO来改知网的义原
 把知网这件旗袍改成两件套的西服裙 – 就是接鬼

Chinese WordNet or English Hownet?
在中文方面,也已有了一个类似词汇网路的资源,叫做《知网》
(HowNet, http://www.keenage.com)。由大陆的董振东先生在
1995年自力着手进行。它是中英/英中的一个双语词汇网路。早
期版是开放不用收费的。2002起新版改由中国科学院软件所管理
后,就需要付费使用了。
《知网》做法的特色是独树一帜;不采用英文词汇网路的架构只
要采取他自己的架构。而且他先把世界知识本体做个定义,在这
定义里再去做区分。这个由上而下的方法,与英语与欧语词汇网
路由下而上的方法不同,当然有其可取之处。可惜的是,由于当
年资源与讯息的限制,董振东教授与它的儿子董强,基本上是凭
着信念与热诚完成《知网》的,过程中绝少外界的奥援,也并未
与世界相关的研究接轨。他跟他儿子花了约有七、八年的功夫来
做这个事。但是,基本上跟其他语言的词汇网路连接,并无架构
上的基础,而其上层知识分类,也是两人的自由心证,不能说错,
却也缺乏理论的基础,面临一些其他系统互通性(interoperability)的问题。
Records in WordNet / HowNet
Record in WordNet
03592879 06 n 02 watch 0 ticker 1 012 @ 03506835 n 0000 ~
02187181 n 0000 %p 02529205 n 0000 ~ 02570752 n 0000 %p
02659936 n 0000 ~ 02841320 n 0000 %p 03021820 n 0000 ~
03104263 n 0000 ~ 03150171 n 0000 ~ 03410656 n 0000 %p
03593482 n 0000 ~ 03636122 n 0000 | a small portable timepiece
Record in HowNet
NO.=007738
W_C=表
G_C=N
E_C=手~,怀~,钟~,电子~,机械~,带钻石的~,这块~不防水
W_E=watch
G_E=N
E_E=
DEF={tool|用具:{tell|告诉:content={time|时间},instrument={~}}}
Axiom in SUMO / HowNet (1)
See SUMO_buy.doc
Cf. HowNet Event Relation & Role shifting
{buy|买} <----> {obtain|得到} [consequence];
agent OF {buy|买}=possessor OF {obtain|得到};
possession OF {buy|买}=possession OF {obtain|得到}.
{buy|买} (X) <----> {sell|卖} (Y) [mutual implication];
agent OF {buy|买}=target OF {sell|卖};
source OF {buy|买}=agent OF {sell|卖};
possession OF {buy|买}=possession OF {sell|卖};
cost OF {buy|买}=cost OF {sell|卖}.
Axiom in SUMO / HowNet (2)
{buy|买} [entailment] <----> {choose|选择};
agent OF {buy|买}=agent OF {choose|选择};
possession OF {buy|买}=content OF {choose|选择};
source OF {buy|买}=location OF {choose|选择}.
{buy|买} [entailment] <----> {pay|付};
agent OF {buy|买}=agent OF {pay|付};
cost OF {buy|买}=possession OF {pay|付};
source OF {buy|买}=taget OF {pay|付}.
Thematic Roles in VerbNet / HowNet
See VerbNet_buy.doc
Thematic Roles
Agent[+animate OR +organization]
Asset[+currency]
Beneficiary[+animate OR +organization]
Source[+concrete]
Theme[]
Cf. HowNet Event Role with Typical Actors
│ ├ {buy|买} {take|取:agent={human|人}{group|群体->},
possession={artifact|人工物->},source={human|人}
{InstitutePlace|场所},cost={money|货币},
beneficiary={human|人}{group|群体->},
domain={economy|经济}}
Components of HowNet





Taxonomy(义原层级规范)
Roles and Features(角色与特征规范)
Specifications of KDML(知识描述语言规范)
Knowledge Database(知识库)
Event Relations & Role Shifting
(事件关系与角色转换)


Maintenance Tools(维护管理工具)
APIs (应用接口)
Nature of HowNet
An online knowledge-base which reveals
the relationship among concepts, and the
relationship among attributes of
concepts
-- Dong Zhendong, "Knowledge Description: What,
How and who?", Proceedings of International Symposium
on Electronic Dictionary, Tokyo, 1988, p.18
Theory of HowNet

Knowledge is a system of relationships among
concepts and among attributes of concepts

Everything is constantly changing in a specific
time and space, and converts from one state to
another. The conversion embodies the change of
its attributes
Guidelines of Design






Computer-oriented
Relationship is the key; to reveal the
relationship is the main objective of HowNet
Based on sememes
Use of KDML
Defining concepts in a static & isolate way
Relationship is activated in a dynamic way
Concept Definitions in HowNet (1)
医生:DEF={human|人:domain={medical|医},
HostOf={Occupation|职位},{doctor| 医治:
agent={~}}}
患者:DEF={human|人:domain={medical|医},
{SufferFrom|罹患:experiencer={~}},
{doctor|医治:patient={~}}}
医院: DEF={InstitutePlace|场所:{doctor|医治:
location={~},content={disease|疾病}},
domain={medical|医}}
Concept Definitions in HowNet (2)
病历:DEF={document|文书:{record|记录:
content={disease|疾病},LocationFin={~}},
domain={medical|医}}
健康:DEF={Health|健康:
host={AnimalHuman|动物}}
多病:DEF={unhealthy|不健}
│ │ ├ {HealthValue|健康值}
│ │ │ ├ {healthy|康健}
│ │ │ └ {unhealthy|不健}
Concept Definitions in HowNet (3)
病:{disease|疾病} {phenomena|现象:
{doctor|医治:content={~}},{SufferFrom|罹患
:content={~}},RelateTo={medicine|药物}
{Health|健康}{HealthValue|健康值},
domain={medical|医}}
药: {medicine|药物} {artifact|人工物:{doctor|医治
:instrument={~}},RelateTo={disease|疾病},
domain={medical|医}{chemistry|化学}}
Identity of description in different
language structures (1)
W_C=劫
G_C=V
E_C=
W_E=rob
G_E=V
E_E=
DEF={rob|抢}
W_C=飞机
G_C=N
E_C=
W_E=plane
G_E=N
E_E=
DEF={aircraft|飞行器}
Identity of description in different
language structures (2)
W_C=劫机
G_C=V
E_C=
W_E=hijack a plane
G_E=V
E_E=
DEF={rob|抢:possession={aircraft|飞行器}}
Identity of description in different
language structures (3)
W_C=劫机犯
G_C=N
E_C=
W_E=hijacker
G_E=N
E_E=
DEF={human|人:{rob|抢:agent={~},
possession={aircraft|飞行器}}}
Identity of description in different
language structures (4)
W_C=抓获劫机犯
G_C=V
E_C=
W_E=catch a hijacker
G_E=V
E_E=
DEF={catch|捉住:patient={human|人:
{rob|抢:agent={~},
possession={wealth|钱财}}}}
Identity of description in different
language structures (1)
W_C=机敏地抓获女劫机犯
G_C=V
E_C=
W_E=catch a woman hijacker cleverly
G_E=V
E_E=
DEF={catch|捉住:manner={clever|灵},
patient={human|人:{rob|抢:agent={~},
possession={wealth|钱财}},
modifier={female|女}}}
Applications of HowNet
1. Semantic tagging
2. WSD,Sense Pruning
3. Sensitive information detection
4. Information filtering
5. Similarity of words
6. Semantic Web
7. Match of WordNet
Future work

Construction of resouces
 English


HowNet
Chinese message structure bank
Increase of languages
Developing more APIs and tools
 Administration


Membership
Ontology定义的附录 (1)
a specification of a conceptualization
 the theory of objects and their ties
 similar to a dictionary or glossary, but with greater
detail and structure that enables computers to
process its content. An ontology consists of a set of
concepts, axioms, and relationships that describe a
domain of interest. An upper ontology is limited to
concepts that are meta, generic, abstract and
philosophical …

Ontology定义的附录 (2)
the study of what there is, an inventory of what
exists …What we may call ontology is the attempt to
say what entities exist. Metaphysics, by contrast, is the
attempt to say, of those entities, what they are.
 the study of the categories of things that exist or may
exist in some domain
 The word ontology comes from the Greek ontos for
being and logos for word.

Cost for French in EuroWordNet
For the development of French language, here were 2 partners:
Avignon (AVI) and Memodata (MEM). The following was requested :
Personnel
Equipment
Travel & assistance
Consumables & computing
Overheads
Total
AVI
72000
3000
5000
3000
16600
99600
MEM
85000
0
1500
300
17100
104400
Since Memodata was a private company, only50% of its request could be funded by
the EC. So the total of the request was:
Total
AVI
99600
MEM
52200
Notes: 1) validation is not included in this table. This has be done by Xerox and
Bertin globallyfor several languages.
2) These amounts constitued a previsional budget corresponding to some
20 000 synsets.
Demo of Tools
(1) Relevant Concept Field
(2) Similarity of Words
(3) Chinese Chunk Extractor
(4) Smart Word finder
Overview of HowNet
Components of HowNet
 Nature of HowNet
 Theory of HowNet
 Guidelines of Design
 Sememes and Relations

需要的备用文件
HowNet Browser (桌面)
Relevant concept field (桌面) – “行”
Similarity computing (桌面) – 数位典藏计划 (目录
“ontology”)
Prof. Huang’s comment on HowNet (桌面)
U32下:Taxonomy Event Relation & Role Shifting
Taxonomy Typical Actors
Papers (Applications about HowNet)