statistical analysis of japanese characters

STATISTICAL
ANALYSIS
OF JAPANESE
CHARACTERS
by
Takushi
Tanaka
Language
Research
Institute
3-9-14 Nishigaoka
Kita-ku,
Tokyo
The National
Summary
The purpose of this study
is to
analyze
the
statistical
property
of
Japanese characters for
computer
processing.
Sentences in high school textbooks and newspapers have been investigated in this study.
This paper contains the following points : the number
of different words written in each character, position of characters in a word,
relation
between
word
boundaries
and
character
strings,
relation
between
parts of speech and patterns of character strings,
relation between parts of
speech and each character.
The results of these investigations
can be applied
to the
processing
of
written Japanese for practical purpose.
The
following
Japanese
character
strings, (A) to (D), are the same sentenCes written by using KANJI to different degrees.
(D) is quoted from a high school textbook (world history).
While (A), (B) and (C) are transliterated
from (D) by computer. 1,2
(Example of Japanese
sentence)
(A)
i. Introduction
There are several different aspects
between
E n g l i s h and Japanese
in the
information processing of natural language.
The first concerns the number of
characters.
In order to write Japanese
more
than 2,000 characters
are
used.
The second concerns the way of writing.
A Japanese sentence consists of a continuous
character
string
without
any
space between words.
The third concerns
word order and other syntactic features.
Among
these
aspects,
the
second
and
third
features are closely related to
the characters.
Japanese
characters
consist
of
three kinds.
A KANJI(Chinese character)
is used to write nouns and the principal
part of a predicate,
and expresses the
concepts
contained
in
the
sentence.
A HIRAGANA
(traditional Japanese character) is used to write
conjunctions,
adverbs, JODOSHI (mainly expresses many
modalities of a predicate)
and
JOSHI
(post-position,
mainly
expresses
case
relations).
A KATAKANA (traditional
Japanese character)
is used mainly
as
phonetic signs to write foreign words.
Accordingly,
Japanese
characters
are regarded as elements
of words, at
the same time,
they function to characterize the syntactic or semantic classes
of words and express word boundaries in
a character string.
(s)
(c)
l~l~),D~t~.s:~
(D)
--315--
~s<:. 2.D 6 { 2 0 t ~ t ~ O
'0-9'(1.9X)
(A) is w r i t t e n in K A T A K A N A (only for
' ~--D~,~ ') and H I R A G A N A (the rests)
w i t h o u t using KANJI.
(B) is w r i t t e n in HIRAGANA, K A T A K A N A
and 200 KANJI of high f r e q u e n c y
in
J a p a n e s e writing.
(C) is w r i t t e n in HIRAGANA, K A T A K A N A
and the s o - c a l l e d e d u c a t i o n a l KANJI
(996 c h a r a c t e r s ) .
Low
graders
in e l e m e n t a r y
school
tend to write s e n t e n c e s like (A).
The
older they get the more KANJI they learn
and they begin to write s e n t e n c e s like
(D) in high school.
W h e n we read sentences like (A),
we realize
it is v e r y
d i f f i c u l t to read them,
because
we cannot find word b o u n d a r i e s easily.
On the
other hand,
in (B), (C) and (D) we find
less d i f f i c u l t y in this order.
Because
we can e a s i l y
find out word b o u n d a r i e s
by m e a n s of KANJI in a c h a r a c t e r string.
B o u n d a r i e s b e t w e e n a H I R A G A N A part and a
KANJI part p l a y a role to indicate word
b o u n d a r i e s in m a n y cases.
We can also
g r a s p main c o n c e p t s
in a
sentence
by
focusing
our
attention
to
the
KANJI
parts of the sentence.
T h e r e f o r e , it is v e r y i m p o r t a n t
to
use H I R A G A N A and KANJI a p p r o p r i a t e l y in
a c h a r a c t e r string. It is, however, hard
to say the rules for the a p p r o p r i a t e use
of H I R A G A N A
and KANJI have been established. Due to the fact, it is n e c e s s a r y
for us to study more about the actual
use of J a p a n e s e c h a r a c t e r s .
Because,
e x p l i c a t i o n of rules for the a p p r o p r i a t e
use of the c h a r a c t e r s
is a p r e r e q u i s i t e
for i n f o r m a t i o n
processing
in c o m m o n l y
written Japanese.
2. O u t l i n e
of J a p a n e s e
--
.
','<3.7~>-.-X ~
KAHJI
"
~
(36.3~)
k
RAGAHA
~
~
(47,1~)
=48096
I 00~(
Fig.l
Rate
of
total
ALPHAIgET(
2 . 5~. ) I
~- S Y M B O L S + (
H IRAGANA
~
I
I
(4. 6Z )
~
~
-
)
characters
' , ' , ' . ' )
< I . 2~ )
-
' O - 9 ' ( 0. 7~, )
KATAKAHA
"
(5.B~:)
KANJI
(B6.0~)
100~(
Fig.2
Rate
of d i f f e r e n t
=1525
)
characters
A c c o r d i n g to the i n v e s t i g a t i o n of N o m u r a
3,213 KANJI were found in the n e w s p a p e r ~
The largest
Japanese
KANJI
dictionary
(edited by Morohashi)
contains
about
50,000 characters.6
Fig.3 shows r e l a t i o n
between
freq u e n c y and order of f r e q u e n c y in every
kind of c h a r a c t e r s .
From Fig.3 we see
that a few H I R A G A N A have high frequency.
T h e y play an i m p o r t a n t
role in w r i t i n g
grammatical
elements
in a s e n t e n c e
as
JOSHI and JODOSHI.
characters
Fig.l i l l u s t r a t e s the rate of total
characters contained
in the high school
textbooks
(9 s u b j e c t s X 1/20 sampling).
The data c o n t a i n s
48,096 c h a r a c t e r s in
total. 3
HIRAGANA
occupies
the
first
place a c c o u n t i n g for 47.1%. A c c o r d i n g to
the result of N a k a n o ' s study which will
be p r e s e n t e d here, KANJI takes the first
place
in the
newspaper,
because
they
have T V - p r o g r a m s
and mini a d v e r t i s e m e n t
which are both w r i t t e n m a i n l y in KANJI. 4
Fig.2 i l l u s t r a t e s the rate of different c h a r a c t e r s
in the data of textbooks. The data c o n t a i n s 1,525 d i f f e r e n t
characters.
KATAKANA
and H I R A G A N A are
composed
of basic 47 c h a r a c t e r s respectively, however
the data also c o n t a i n s
variations
like small l e t t e r s and letters with special symbols, and both kind
of KANA exceed 70.
M o s t of H I R A G A N A and
KATAKANA
were a p p e a r e d
in the data of
textbooks.
The data c o n t a i n s 1,312 different KANJI.
The more data is investigated
the
more
KANJI appear,
and the
rate of KANJI i n c r e a s e s in the graph.
(Y)
l
1000
("HIRAGANA) X : Order
%",.,,,
Y : Frequency
"°',-,2
",...,
100
,.
"'.'".
,..,,.
",.
............::
.....
•, ""-.,.,
",,
5
I (KANJ I )
•
.....
"',.3
..........
,.........
,..........
"%
"..(Num.) ............... "....
•,
10
•
"",. ,, ,.-,°
(K A T A K A N A )'"'"'-:.:..
4
(Alphabet)
Fig. 3
316
I
i
25
50
Frequency
and
i
75
Their
X)
order
Fig.4
shows the r e l a t i o n
between
o r d e r of f r e q u e n c y and total n u m b e r of
c h a r a c t e r s up to their order.
In this
graph,
we see a b o u t
twelve
different
HIRAGANA occupy
50% of total H I R A G A N A .
A b o u t 120 d i f f e r e n t KANJI o c c u p y 50% of
total KANJI.
3. N u m b e r of d i f f e r e n t w o r d s
w r i t t e n in e a c h c h a r a c t e r
As we have m o r e than 50,000 c h a r a c ters,
it is n e c e s s a r y
to
decide
the
d e g r e e of i m p o r t a c e of them.
In o r d e r
to d e c i d e
the d e g r e e s
two c r i t e r i a are
a s s u m e d here.
One is the f r e q u e n c y of
the c h a r a c t e r s .
The other one i s t h e
n u m b e r of d i f f e r e n t w o r d s
in w h i c h the
same c h a r a c t e r
is used.
The similar
c o n c e p t has been p r o p o s e d by A. Tanaka. 7
In
Fig.5,
axis X r e p r e s e n t s
the
f r e q u e n c y of the c h a r a c t e r as first c r i terion.
A x i s Y r e p r e s e n t s the number of
different words
in w h i c h the same c h a r acter
is used.
The
graph
shows
the
distribution
of c h a r a c t e r s in the textb o o k s e x c e p t KANJI.
Each c h a r a c t e r on
Y=I is used for o n l y one word.
For instance, H I R A G A N A ' & ' (o) on Y=i is used
for
only
one
word (one of c a s e - J O S H I ,
i n d i c a t i n g a c c u s a t i v e case) e x c l u s i v e l y .
E a c h c h a r a c t e r on Y=X is used for a n e w
w o r d in e v e r y o c c u r r e n c e
of the c h a r acter.
(Y)
38888
i Order
Total
number
.................
28888
(H IRAGANA)
..'"
/.."
1~B888
(KANJI)
I
°°..................................................
""
•
. ,..°......~ .......
[ ..
. 3
....~'::::..................
::
;"
(KATAKANA)
e
Fig. 4
e
2~
Order
and T o t a l
se
75
lee(x)
up to the order
(Y)
Fig.5
188
Distribution
e x c e p t KANJI
of c h a r a c t e r s
•l~ "~
>
X : Frequency
Y : N u m b e r of d i f f e r e n t
<
words
--X
:
-- :
k :
I
:
HIRAGANA
KATAKANA
Alphabet
N u m e r a l or
.~ P
2--
~-~
tl
Symbol
-J~
"~
~2~
-
.~
..~
18
• [C
]\
•
~I~
,I,
V'/Lvx
up;
r
I
I
~
V
c
n
C
ffs o
•
•
m
•
a I
~XRM TPFqdSkt.
G i e
O,T,
Et
f
816
£ Ill
O/b
P H~
=~
I
~
~
~
DA
I II
768
I I I I il I
59403r2
ill
() 1
I
| °
I
I
I
18
188
18~8
317--
(x)
1080
(Y)
Fig.6
Distribution of KANJI
for daily use
X : Frequency
Y
Number of different
._h
.dJ "~H m . ~
"~
._.
:~:~ ~,.., :.,.~.._.~'A
"H
*AT~['~':" ~'~'~. ":" ":[~ i~ ~
words
,-_p
•
~. %% ~ . ' . . . , . ~ , ~ . ~ ,
~ ' . .
~¢~ ~:., -~ {:~.'~ .~.¢::',~-?,.-.%.
'~k. ~ . . . . . , . ~ ' 4 . b . ~ - ' ~ - r - ~ - A ; - ' .
18
~.
~
1
.
. .~'~
:.- . . . . . . . . . . . . . . .
~{~ . . . . . . . . . . . . . . . . . . . . . .
I
, J ' ."
..~
~
"
I
I
I
10
100
IOOO
I
1000~3
(x)
1000
(Y)
Fig.7
Distribution of KANJI
not for daily use
X : Frequency
Y : Number of different
words
180
"6
~'.~...*~
"~
"i~
•
Q
~] " ~ . . - , ~ .
18
~
Q
•
=•
./lJ L ' I ~
-. :-
,~,,~,,.~':.-.:::..:-. -: : ~.+~"~..~
~ I ~
..............
~V ~
•
,k,~
~'~,
O
overlap of characters
on the same point
length of the diagonal
( 500 / scale )
I
!
!
I
I
I
10
100
1000
10000
318
(x)
KATAKANA
appear
near
Y=X, b e c a u s e
of a word.
The c h a r a c t e r s on Y=0 are
KATAKANA
are m a i n l y
used
for
writing
never used at the final position.
proper nouns of f o r e i g n words.
The same
KANJI, r e p r e s e n t e d with dots, spread
over the area of Y ~ - X + i 0 0 .
Namely, the
words of such a c a t e g o r y
do not appear
value of X+Y are always g r e a t e r than or
frequentry.
equal to i00~
In other words, rates of
HIRAGANA,' ~ ' (ru),' ~ ' ( i ) , ' ~ ' ( s h i ) ,
the initial p o s i t i o n plus final p o s i t i o n
' ~ '(tsu), ' ~ '(ka) and ' < '(ku) are loare a l w a y s g r e a t e r than or equal to 100%.
c a l i z e d on the upper right side.
These
It m e a n s that all KANJI have a t e n d e n c y
are o f t e n used for w r i t i n g some parts of
to be used for the initial p o s i t i o n or
i n f l e c t i o n a l forms
of
verbs (e.g. ' %~'
the final p o s i t i o n
or both p o s i t i o n (as
for ' ~
', ' D' for ' 5 ~ ~ ', ' ~ '
for
a word of
one
character)
of
a word
'~ ' ) .
' ~)'(i), ' ~ ' ( k a ) and ' < '
(short unit *).
M o s t KANJI on Y = -X+100
(ku) are also
often
used
for
writing
form o n l y words of two KANJI.
The tendsome
parts
of
inflectional
forms
of
ency o r i g i n a t e s
in the c o m p o s i t i o n
of
adjectives.
' ~ '(no), ' ¢:'(ni), ' % ' ( o ) ,
words w r i t t e n by KANJI. This m a t t e r will
' ~ '(wa), ' a '(to), ' ~ ' ( g a ) and '~'(de)
be o b s e r v e d in s e c t i o n 6.
The g r o u p of
on the right side
are
frequently
used
H I R A G A N A in the upper right area has a
for
JOSHI
(post-position,
expressingt e n d e n c y to be used for JOSHI.
KATAKANA
case r e l a t i o n s or other g r a m m a t i c a l
rerepresented
by '~' appear
around
the
lations).
' ~ '(ta) on the upper right
under left area on the graph.
Words
side is o f t e n
used
for
J O D O S H I of the
w r i t t e n in K A T A K A N A have r e l a t i v e l y long
past tense.
' ~ '(na) on the upper right
length (See s e c t i o n 6).
T h e r e f o r e , the
side is o f t e n used for the initial sylrates of the initial p o s i t i o n
and the
lable of J O D O S H I of negative.
final p o s i t i o n are r e l a t i v e r y decreased.
Fig.6 and Fig.7 show the same inv e s t i g a t i o n into the KANJI of n e w s p a p e r s
(the o r i g i n a l
work was c a r r i e d out by
Nomura).5
Fig.6
shows
the
distribution
of (y)
the
so-called
"TOYOKANJI"
selected
by
the J a p a n e s e g o v e r n m e n t for d a i l y use in
1946.
The upper right area on the g r a p h I~8
is o c c u p i e d by the s o - c a l l e d e d u c a t i o n a l
%
KANJI.
Each KANJI o n Y=i is used o n l y
_
;.
".
~.7.%
for
one
word (e.g. ' ~ '(tai) for '~{~'
(taiho : arrest),
' ~ '(bou)
for
' ~'
~3 ----"¢ :.
---"
(boueki : trade),
' ~ '(kai)
for
'~'
g
•
(kikai : m a c h i n e ) ) .
The same as Fig.5,
c h a r a c t e r s used
for
persons' names are
*
i: {..,"
l o c a l i z e d near Y=X.
Fig.7
shows
the
distribution
of
5~
KANJI other than TOYOKANJI.
The m o s t of
•
"
(U~
>' ~
I,.
~%'--'"
• , ""
characters
in
upper
right part of the
~7 I•'
b
D
%..
".
• •
b
~
D
"'" "?
y=
g r a p h are the ones w h i c h are used for
persons' names or for place names. (e.g.
~
~..:
. ..
i~
' ~ ' and ' ~ ' for ' ~
'(Eujisaki:person)
~
I~
~
"
"% ""
' ~ ' for ' ~ '
(Fukuoka:place).
•
•
.
q¢
4. P o s i t i o n
of c h a r a c t e r s
o
.,
in a word
For the i n f o r m a t i o n
processing
of
J a p a n e s e sentences, at first,
it is imp o r t a n t to find out word b o u n d a r i e s in a
c o n t i n u o u s c h a r a c t e r string.
If there
are some c h a r a c t e r s w h i c h a l w a y s come to
the initial p o s i t i o n
or the final p o s i tion
of a word,
these c h a r a c t e r s
are
a v a i l a b l e to find the boundaries.
Fig.8 shows the p o s i t i o n of c h a r a c ters in words. In the data of textbooks,
there are 399 c h a r a c t e r s
w h i c h are used
for
more
than
6
kinds
of
different
words.
The c h a r a c t e r s on X=i00 a l w a y s
come to the initial p o s i t i o n of a word.
The c h a r a c t e r s on X=0 are never used at
the initial p o s i t i o n .
The c h a r a c t e r s on
Y=i00 a l w a y s come to t h e final p o s i t i o n
I
I
50
Fig.8
I
IO0 %
X : Rate
Y : Rate
of initial p o s i t i o n
of final p o s i t i o n
Position
of c h a r a c t e r
(x)
in a Word
* word (long unit) : ~ m ~
(National-language-research-institute)
word (short unit) : []~ , ~ ,
~,
(National,Language,Research,Institute)
--319
-
The length of d i a g o n a l of '~' is p r o p o r tionate
to the f r e q u e n c y
of the KANJI.
In the graph, the length of 10% of axis
is equal to i00 times of the frequency.
5. R e l a t i o n b e t w e e n word b o u n d a r i e s
and c h a r a c t e r strings
(Simple
Japanese
grammar)
N, JiN2J ~ ... V.
Ni: N o u n
Ji : C a s e - J O S H I
V : Verb
(i)
6. Parts of speech and p a t t e r n s
of c h a r a c t e r strings
for N~
In the i n v e s t i g a t i o n of n e w s p a p e r s ,
20 parts of speech were assumed. 8
Each
part of speech has a p a r t i c u l a r
pattern
of c h a r a c t e r strings.
It is p o s s i b l e to
decide
the part
of speech of a word
based on the k n o w l e d g e
of such p a t t e r n s
in c o m p u t e r p r o c e s s i n g
of J a p a n e s e sentences.
In Fig.10,
'K' in the
column
of
pattern
represents
a KANJI, 'H' represents a HIRAGAN~,
and
'I' r e p r e s e n t s a
KATAKANA. The left side of the bar chart
shows the rate of total words. The right
side of the bar c h a r t shows the rate of
d i f f e r e n t words.
Fig.10-(1)
shows
the
pattern
of
c o m m o n nouns.
The left side of the bar
chart shows that K K - p a t t e r n a c c o u n t s for
68.0% of total c o m m o n nouns in the newspapers.
The right side of the bar c h a r t
shows that K K - p a t t e r n a c c o u n t s for 68.5%
of d i f f e r e n t c o m m o n nouns
in the
newspapers.
Fig.10-(2)
shows
the p a t t e r n
of
proper nouns.
M o s t of the proper nouns
also have KANJI strings.
The rest of
proper
nouns have K A T A K A N A strings exp r e s s i n g f o r e i g n words.
Fig.10-(3)
shows
the
pattern
of
verbal nouns which change
to verbs w i t h
s u c c e e d i n g c h a r a c t e r s ' ~ ' (se), ' 8' (sa)
' b'(shi),
' ~ ' (su), ' ~
'(suru), ' ~
'
(sure), ' ~ ' ( s e y o ) .
The v e r b a l nouns
c o n s i s t of K K - p a t t e r n
up
to
97.1%
of
total.
If K K - p a t t e r n and s u c c e e d i n g
c h a r a c t e r s ' ~ '(se),' ~ '(sa), ' L '(shi )
...are found,
such
a c h a r a c t e r string
can be treated as a form
of this
kind.
Fig.10-(4)
shows
the
pattern
of
verbs.
The v e r b of H - p a t t e r n is often
used w i t h p r e c e d i n g v e r b a l nouns.
Most
d i f f e r e n t verbs have K H - p a t t e r n .
Fig.10-(5)
shows
the
pattern
of
adjective.
M o s t of the a d j e c t i v e s are
w r i t t e n with K H - p a t t e r n or K H H - p a t t e r n .
Fig.10-(6)
shows
the
pattern
of
adverbs. M o s t of the a d v e r b s are w r i t t e n
with H H H - p a t t e r n or H H H H - p a t t e r n . N a m e l y
they are w r i t t e n in HIRAGANA.
A Japanese
sentence
fundamentally
belongs to p a t t e r n (i).
M a n y nouns (Ni)
tend to be w r i t t e n
in KANJI
(See next
section).
All the c a s e - J O S H I are written in HIRAGANA. Stems of verbs are often
w r i t t e n in KANJI and their i n f l e c t i o n a l
parts in HIRAGANA.
So both a p h r a s e of
N~J& and V have such a p a t t e r n that the
initial p o s i t i o n
is o c c u p i e d by a KANJI
and the final p o s i t i o n is o c c u p i e d by a
HIRAGANA.
T h e r e f o r e , the c h a n g i n g point
from H I R A G A N A
to KANJI
in a c h a r a c t e r
string
is always
regarded
as
a word
boundary.
On the other hand, a word
b o u n d a r y is not always a c h a n g i n g point
from H I R A G A N A
to KANJI.
One of the
exception
is J a p a n e s e nouns (long unit)
which are c o m p o s e d of some c o n c a t e n a t i o n
of nouns (short unit).
(See page 5 *)
Fig.9
shows one of the r e l a t i o n s
between
word
boundaries
and c h a r a c t e r
strings.
The g r a p h c o n t a i n s 902 KANJI
(total : 1,546) in the textbooks. The axis X r e p r e s e n t s the rate that the changing p o i n t s from H I R A G A N A
to KANJI correspond to word boundaries.
Each KANJI
on X=i00
is c o n s i d e r e d
as the initial
character
of a word if it is p r e c e e d e d
by a HIRAGANA. The axis Y r e p r e s e n t s the
rate that the word b o u n d a r i e s c o r r e s p o n d
to c h a n g i n g p o i n t s from H I R A G A N A to KANJI. The symbol of '~' r e p r e s e n t s a KANJI.
(Y)
9O2
15461
le8
50
,,,
~r__]
~
7. R e l a t i o n b e t w e e n each c h a r a c t e r
and part of s p e e c h
0
I
e
I
se
lee~
We have a s s u m e d p a t t e r n s of c h a r a c ter s t r i n g s and the p a t t e r n s
are basically available
for c l a s s i f i n g
part of
speech in actual data. However, the patterns do not p r o v i d e s u f f i c i e n t c r i t e r i a
for the c l a s s i f i c a t i o n .
For example, the
(x)
x : Rate of word b o u n d a r y
y : Rate of H-K b o u n d a r y
Fig. 9
Character
string
and
boundary
320
~l
(i) Common noun
68.8
19.8
2.4
2.3
(pattern)
68.5
I
8.4
3.9
4.1
15.1
7.5
108~(=288144)
3
lII
4
5
IllI
noun
L
60
9!
4.3
3.4
1 86. 36
4.9
4.3
8.5
13.0
t
t
100~,(=46196)
1
KK
2
3
4
KKK
K
Iili
5
III
OTHERS
I
0.6
I.I
188>.( = 5 7 7 9
I ,~
0.9
2.3
OP.
)
J 1 KK
2 HHHH
3 III
4 OTHERS
I
188~(=679)
(pattern)
(4) Verb
26.1
• 4,3
24.5
8.5
7.0
8.6
6.2
16,3
14,3
10,2
I
1007.(=:38829)
1 H
2 KH
3 HH
4 HHH
KHH
6 OTHERS
0,7
2~.3
(5) Adjective
%
!
25
O
2
2
7' 4
4
8
?
9
L
lOOP.( =3~48
I
J
f
l
32
20
12
12
6
16
180,~(=5044)
2
3
8
7
3
23.7
38.:3
6.3
1.6
1
?
7'
3
4
2
O
5
3
(example)
t,
i~, ~"
r~l<, {-<
• ~~, ~,¢
o< ~, ,9)/~
~ ~ ~, ~ i ~
& &0~i~
('si','sa',su')
(open,write)
(do,say)
(make,understand)
(continue,give)
(prepare)
8
7
(pattern)
(example)
1
KH
2
:3
4
5
KHH
HHH
HHHH
HHHHH
6
OTHERS
~L~, ~
(many, strong)
~b~,. } < ~
(beautiful,big)
L~£'~,, l:~k~ (cruel,hard)
%~b~. ~ , C <~(merry, tasty)
t'~b~,
(difficult)
~]~
(funny)
100~(=251
(6) Adverb
26
19
~
(study,success)
~<9, ~<~@O(amaze,greeting
U-V. 7°9x
(lead,plus)
~ f
(shelving)
~,
I
)
31
(example)
J
188P.(=1427)
e~
4z
~
)
(pattern)
1 .2
(Tokyo,Nippon)
(Chiyoda,Akihabara)
~. ~
(U.S.A.,England
79~z, e~(France,Moscow)
F47, ~ Y
(Deutsch,TOYOTA
==-~-~
(New York)
~ ,
6
(3) Verbal noun
i
(example)
• ~, 51~
I
loam.( =3472
0~
(language,world)
(station,person)
e ~ . ~e~u
(television,hotel
==--z, xw--~"(news,speed)
7°~x~,~
(plastics)
~,
~
,~, )~
(pattern)
78.0
7.7
6.1
KK
K
OTHERS
I
IlBI~P.( = 9 4 3 6 )
I
(2) Proper
1
2
(example)
6.7
6.3
17.1
O~
Fig.10
)
(pattern)
(example)
1
2
3
4
5
6
7
~,r~0. 9--CI: (fairly,already)
~ ~{'~. I~~/~6(each, almost)
~tt', 69
(yet,now)
~
(about)
~ U , ~]>C
(again)
%7]~~, ~ E
(fir st, immediatly)
~,o@L,~z
(simultaneously)
HHH
HHHH
HH
K
KH
KHH
OTHERS
I
108P.(=253)
.Pattern of character
- 321
string of word
same p a t t e r n
was found among different p a r t s of speech.
In order
to o b t a i n more a c c u r a t e
results,
we a n a l y z e d r e l a t i o n s b e t w e e n each
character
and each part of speech
in the data of n e w s p a p e r
(restriction : w o r d - f r e q u e n c y ~ 3).
In F i g . l l the axis
Y
represents total
number
of
the
last
KANJI in a word.
(Y)
X : Rate of v e r b
Y : Frequency
-lit/i*:
• :
.=~
~;:".
~
.~T
~
"~'=~4..~
'.-,., . ' . "
.
" ." ~F'i~
-" " ~a" "~
.~
-~t
:~. "." "
)tL ' 1 - " . . ' . . - "
-~, ,,
~
"%~e:.';~
"-,~ ",~'~
""
•
;s, . .**
.*•
** .
't"-:.: -. " "' : ." •
~:A.: "." .:... ,
"
~" ",:
." :" ". •
i~
,:.....• ,.
-.'.¢
~.
•
.,:
to/.'.:.."."
.
•
•
k:"
.
.::,
,~
.
~
.
•
~
"L4
"~
~
~x?,
.~.
%
•
,
....
,~°
.
:
"~
"~
i
;.I
•
•
'l'}~
• "~
,
"~
•
•
II.~I
iS
"
. . . .
• • " I.
,,...., ,....-. ,-.
• ..
" &
,''{~1.:
%~. : :... ....*
•
.
...
.
e.g.
.~fi
.
.. . ~
.
'~
"~
*
~] .~I ~ ~I
"
.
...
"~
.~m
~
*
" •
.~
-
".
,~
ff(':/'~l
¢
*
KKHH
T .last KANJI
in a word
The axis X shows the rate of KANJI
used for verbs. KANJI on X=I00 are
used for v e r b s
in
the all o c c u r rence of the last KANJI in a word.
The r e l i a b i l i t y of axis X increases
a c c o r d i n g to the value of axis
Y. In the lower area of the graph,
the value on axis X
seems
to
be
discrete
because
of s h o r t n e s s
of
the data.
8. C o n c l u s i o n
I
I
0
2.~
I
I
O0
Fig. ii
KANJI
75
for verbs
These a n a l y s e s are p r e l i m i n a r y
I
works to make c h a r a c t e r d i c t i o n a r y
I@@ ~ (X) having s t a t i s t i c a l data.
We
plan
to use the d i c t i o n a r y for c o m p u t e r
p r o c e s s i n g of v a r i o u s w r i t t e n Japanese.
References
(Y)
X : Rate of a d j e c t i v e
Y : Frequency
i L-:~C'0
.n+
-Z.
•X
iA q O
/b
,U
~b
,~...~. -~
.~
.~
i~
*igC'
.~
•
i,~
-2
I
~3
I
I
25
Fig.12
I
58
KANJI
I
75
for a d j e c t i v e s
--322
[i] T . T a n a k a , "A s i m i l a t i o n s y s t e m
for t r a n s l i t e r a t i o n of w r i t i n g
form of J a p a n e s e " , M a t h e m a t i c a l
Linguistics, Vol.ll,No.15,1978
[2] T.Tanaka, " T r a n s l i t e r a t i o n
of
Japanese writing", bit,Vol.10,
No.15, 1978
[3] T . T a n a k a , " S t a t i s t i c s of J a p a nese c h a r a c t e r s " ,
Studies
in
computational linguistics,
Vol. X,
(National L a n g u a g e Research Inst. R e p o r t - 6 7 ) ,
1980
[4] H . N a k a n o et al., "An a u t o m a t i c
p r o c e s s i n g of the n a t u r a l language
in
the word count system", (in this proceeding)
[5] M . N o m u r a et al., "A
study
of
Chinese characters
in m o d e r n
newspapers",
N.L.R. Inst. Report-56, 1976
[6] T . M o r o h a s h i , " D A I K A N W A
dictionary", T a i s h u - k a n Book Co. 1971
[7] A . T a n a k a , "A s t a t i s t i c a l m e a s u r e m e n t on s u r v e y
of
KANJI",
S t u d i e s in c o m p u t a t i o n a l
linguistics, Vol.VI[~, (N.L.R.Inst.
R e p o r t - 5 9 , 1976
[8] T . I s h i w a t a , A . T a n a k a , H . N a k a n o
et al., " S t u d i e s on the v o c a b ulary
of
modern newspapers",
Vol.l, Vol.2, N.L.R. Inst. Report-37,38, 1970,1971