Computer assisted writing system
Chien-Liang Liu, Chia-Hoang Lee, Ssu-Han Yu, Chih-Wei Chen
Department of Computer Science, 1001 University Road, Hsinchu 300, Taiwan, ROC
Computer assisted writing
Natural language processing
a b s t r a c t
In this paper, we designed and implemented a computer assisted writing system and the application
domain is love letter. The system includes text generation module, synonym substitution module and
simile expression module. A text generation model is proposed based on keyword generation model
and sentence generation model. The keyword generation model extracts important keywords from the
corpus and they will become the backbone of the template. Meanwhile, the sentences between keywords
will construct the content of the template and candidate sentences are retrieved from the corpus based
on statistical analysis. Synonym substitution and simile expression are two modules that could enrich the
content of the text. Synonym terms are retrieved from the Internet and a simile expressions discovery
mechanism is proposed to collect related simile expressions. The prototype system has shown that it
could work well on love letter application domain and the concept of this research could be extended
to another domain with minor modification.
1. Introduction
In essence, the ability to write plays an important role in language learning. Not only can it improve users’ writing skills, but
also it helps them develop the ability of communication. In recent
years, essays have become a major part of a formal education and
students are encouraged to have the ability to write in many
exams. Although writing is very important, it is a difficult job for
many users to write an article from scratch. Besides, writing is
important not only in schools, but also in our daily life. For example, when people would like to write a love letter, the ability of
writing will help the users compose a good love letter.
In general, reading and writing are closely related, since reading
a lot of articles means that the users may have enough material to
compose their articles. Besides, learning from examples could also
help users take the work of a master writer and use the structure
and patterns to compose their own articles. As with the popularity
of Internet, Internet has become a new knowledge source and the
concept of ‘‘Web 2.0”, which is the business revolution in the computer industry caused by the move to the Internet as a platform,
facilitates communication, information sharing, interoperability,
and collaboration on the World Wide Web. Many people are willing to share their ideas and works with other people through new
services such as social-networking sites, video-sharing sites, wikipedia and blogs. Therefore, Internet could be regarded as a huge
database and many literature works could be obtained from the
Internet. Meanwhile, computers have become essential equipments, so it is appropriate to make use of interactive computer assisted writing tools integrating with Internet data to assist users in
essay writing.
In practice, the availability of enhanced word processors, spelling checkers and grammar checkers could offer assistance to users
in the process of writing. Over the last few decades, much research
has been done on spelling and grammar checkers and these checkers have been integrated into many word processor softwares. In
practice, these tools could correct users writing errors, but they
could not assist the users in writing an article from scratch. For
example, if a user would like to compose a love letter, the biggest
problem is how to organize the content and how to use appropriate
sentences to express his/her feelings instead of the spelling and
grammar errors. In addition, people tend to use some sentences
or terms that have been appeared in other articles. Moreover, templates could provide a framework and reduce the physical effort
spent on writing so that people can pay attention to organization
and content. The observation above provides us the motivation
to construct a computer assisted writing system for users to compose a love letter. Meanwhile, the concept of this research could be
extended to another domain with minor modification.
The template content will be generated by a text generation
model based on statistical analysis. In essence, it is necessary to
understand the meaning of the text to produce high-quality and fluent text. However, it is still infeasible to apply natural language
understanding approach to text generation, because natural language understanding will require extensive knowledge about the
outside world and the ability to manipulate it. In theory, the task
of a text generation system can be characterized as mapping from
some input data to an output text. Meanwhile, the job of machine
translation is to render in one language the meaning expressed by
a passage of text in another language. Therefore, input data in text
generation system is similar to the source language in machine
translation and text generation process is similar to translation process. Many problems within natural language processing apply to
both generation and understanding and statistical natural language
processing (Manning & Schuetze, 1999) uses stochastic, probabilistic and statistical methods to resolve some of the difficulties. Statistical approach could be applied to any specific pair of languages
without linguistic rules. On the other hand, rule-based or grammar-based translations systems require the manual development
of linguistic or grammatical rules, so these approaches will be costly
and they could not be applied to other languages. Therefore, the
success of statistical approach in machine translation provides us
the motivation to adopt statistical approach on text generation.
Based on the statistical approach and the observation, the computer assisted writing system proposed in this paper is text generation based. The text generation model is based on a keyword
generation model and a keyword extraction algorithm is proposed
to discover special keywords from text corpus. Besides, a keyword
expansion model is proposed to expand core keywords. The
expanding keywords will act as the backbone of the text and statistical mechanism will be adopted to select appropriate sentences
from text corpus to fill in the content between keywords. Moreover, synonym of terms and simile expressions could enrich the
content of the article and enhance the variety of the articles. Hence,
synonym substitution module and simile expressions module are
used to decorate the text in the system. In synonym substitution
module, the synonym of term is retrieved from the Internet. Meanwhile, in simile expressions module, we proposed a simile expressions discovery mechanism, which adopts 14 simile terms as seeds
to collect related simile expressions. The experiment shows that
the simile expressions collected from this approach could provide
interesting simile expressions. As our experience with a first system has shown that the computer assisted text generation system
works well and it could help students develop the ability of essay
writing by learning from examples.
The rest of the paper is structured as follows. In Section 2 a survey of related researches on spelling and grammar checker and
natural language generation is presented. Section 3 describes the
text generation model which could be decomposed into keyword
generation model and sentence generation model, respectively.
In Section 4, the system architecture and design is presented. Section 5 describes the experiment and evaluation result. Finally, Section 6 contains the conclusion.
2. Related work
rection for the treatment of context-dependent spelling errors.
Although spelling and grammar checkers could help users correct
spelling and grammar errors, these tools could not help users organize and compose an article from scratch.
2.2. Text generation
In practice, text generation system, which investigates how computer programs can be made to produce high-quality natural language text, could provide users a template of the article and that
could facilitate the users to finish an article. Practically, text generation technique has been applied to many application domains.
Goldberg, Driedger, and Kittredge (1994) proposed to generate textual weather forecasts from representation of graphical weather
map. Meanwhile, Reiter, Mellish, and Levine (1995) proposed to
use natural language generation (NLG) techniques to automatically
produce technical documentation from a domain knowledge base
and linguistic and contextual models. Buchanan et al. (1995) built
an intelligent explanation module that produces an interactive
information sheet containing explanations in everyday language
that are tailored to individual patients, and responds intelligently
to follow-up questions about topics covered in the information
sheet. Williams and Reiter (2008) proposed SkillSum, a NLG system
that generates a personalized feedback report, to generate basic
skills report for low-skilled readers. In the following sections, we
give an overview about text generation systems that adopt different
2.2.1. Corpus-based
Langkilde and Knight (1998) introduced Nitrogen, a system that
implements a new style of generation in which corpus-based ngram statistics are used in place of deep, extensive symbolic
knowledge to provide very large scale generation. However, the
quality of the output is limited by the use of only bigram word statistical information, which cannot handle long-distance agreement, or distinguish likely collocation from unlikely grammatical
structure. The experiments of Nitrogen showed that corpus-based
knowledge greatly reduced the need for deep, hand-crafted knowledge. This knowledge, in the form of n-gram (word-pair) frequencies, could be applied to a set of semantically related sentences to
help sort good ones from bad ones.
2.2.2. Keyword-based
Uchimoto, Isahara, and Sekine (2002) proposed to generate sentences from ‘‘keywords” or ‘‘headwords”. This model considers not
only n-gram information, but also dependency information between
words. The construction part generates text sentences in the form of
dependency trees by using complementary information to replace
information that is missing to generate natural text sentences based
on a particular monolingual corpus. The evaluation part consists of a
model to generate an appropriate text when given keywords.
2.1. Spelling checker and grammar checker
Over the last few decades, much research has been done on
spelling and grammar checkers and these checkers have been integrated into many word processor softwares. Genthial and Courtin
(1992) proposed an architecture of a computer assisted writing
system which includes morphological parsing and generation, lexical correction techniques, syntactic parser and document editing
and exporting. Kukich (1992) focused on non-words error correction, isolated-word error correction and context-dependent word
correction to correct words in text. Bustamante and Leon (1996)
presented a grammar and style checker for Spanish and Greek native writers. Paggio (2000) developed a spelling and grammar corrector for Danish and addressed in particular the issue of how a
form of shallow parsing is combined with error detection and cor-
2.2.3. Template-based
In practice, some simple approaches such as canned text systems and template systems could be used to generate high-quality
text. For example, canned text systems could be used to produce
error messages, warnings, letters, etc. Meanwhile, template systems could be used in the circumstances where a message must
be produced several times with slight alterations. Template systems are often adopted in form letters, in which a few open fields
are filled in specified constrained ways. The template approach is
used mainly for multi sentence generation, particularly in applications whose texts are fairly regular in structure. Templates only
work in very controlled or limited situations. They cannot provide
the expressiveness, flexibility or scalability that many real domains
need (Langkilde & Knight, 1998).
3. Text generation model
The job of machine translation is to render in one language the
meaning expressed by a passage of text in another language. Statistical machine translation is a machine translation paradigm where
translations are generated based on the statistical models whose
parameters are derived from the analysis of bilingual text corpora.
Meanwhile, the task of a natural language generation system can
be characterized as mapping from some input data to an output
text. Therefore, input data in text generation is similar to the
source language in machine translation and text generation process is similar to translation process.
In machine translation, statistical approach has been widely
used and it has shown its capability. Essentially, statistical approach
could be applied to any specific pair of languages without linguistic
rules. On the other hand, rule-based or grammar-based translations
systems require the manual development of linguistic or grammatical rules, so these approaches will be costly and they could not be
applied to other languages. The ideas behind statistical machine
translation come out of information theory. Given a French string
f, the job of translation system is to find the string e that the native
speaker had in mind when he produced f. In other words, the translation process could be characterized by using Bayes’ theorem as
shown in Eq. (1) (Brown, Pietra, Pietra, & Mercer, 1993).
^e ¼ argmax Pðe j f Þ ¼
PðeÞPðf j eÞ
Pðf Þ
Since the denominator here is independent of e, finding ê is the
same as finding e so as to make the product P(e)P(f j e) as large as
possible and the Eq. (2) shows fundamental equation of machine
^e ¼ argmaxðPðeÞPðf j eÞÞ
According to our observation, when a student would like to finish an essay about a specific subject, he/she tends to start from the
concepts related to this subject. In addition to the key concepts related to the subject of the article, people also tend to use some sentences or terms that have been appeared in other articles.
Motivated by these observations and the statistical machine translation, text generation in this paper adopts a text generation model
as shown in Eq. (3), where T and K represent text and keyword,
respectively. P(T j K) represents a text generation model which
indicates that text sentence T will be generated when given a set
of keywords K. Similarly, this model could be characterized as Eq.
(4), where the model represented by P(K j T) is a keyword production model and P(T) is a language model. The keyword production
model outputs the main keywords when given the text and sentence generation model is used as a language model in this paper.
Tb ¼ argmax PðT j KÞ ¼
Tb ¼ argmax ðPðTÞPðK j TÞÞ:
In this paper, we propose a text generation model and develop a
text generation system that uses the model. Based on the above
equations, the text generation model includes two parts: keyword
generation model and sentence generation model. Keyword generation model and sentence generation model will be described in
the following sections.
4. System architecture
Fig. 1 shows the text generation system flow which includes
keyword extraction module, keyword expansion module, candidate sentence selection module and text generation module. The
Fig. 1. System flow.
keyword extraction module will extract candidate keywords and
core keywords. Core keywords are coming from special keywords
and they will become the input of the system. In this paper, a modified strict phrase likelihood ratio (SPLR) algorithm is proposed to
retrieve special keywords of the text. These special keywords are
used to provide an overview about the template and the users
could choose appropriate template based on these special keywords. Meanwhile, candidate keywords are used to construct the
skeleton of the text in keyword expansion module and appropriate
sentences will be selected to complete the content of the text.
Moreover, synonym terms and simile expressions are adopted to
enrich the content. The process and structure of each module will
be described in the following sections.
4.1. Keyword extraction
Fig. 2 shows the keyword extraction flow which includes term
segmentation, term extraction, and special keyword scoring mechanism. Unlike English, Chinese language could not make use of
spaces as boundary to separate words, so Chinese words segmentation is required in this stage. In this paper, maximum matching
method, which extracts the longest meaningful substring, is used
Fig. 2. Keyword extraction flow.
tributed from 0 to 1. The keywords with the highest five scores will
become core keywords. If the number of keywords is less than five,
candidate keywords will be selected as the core keywords. The
benefit of core keywords is that users could have a rough idea
about the final content from the core keywords.
4.2. Keyword expansion
Fig. 3. SPLR scoring example.
to perform Chinese words segmentation. In practice, the Noun and
Verb terms in an article could roughly represent the meaning of an
article, so Noun and Verb terms are selected as the candidate keyword list in term extraction step.
Meanwhile, the computer assisted writing system in this paper
is based on keywords and the users determine the template from
the keyword list. However, the Noun and Verb terms extracted
from the articles coming from the same domain will be similar.
Therefore, when the users would like to determine the template
from keywords, it is difficult for users to differentiate the template
content from similar keyword list. Thus, in addition to the keyword
list extracted from Noun and Verb terms, special terms will be retrieved from the articles and these terms will become the input of
the system. In other words, the users would determine the template content from these special terms.
Chang and Lee (2003) presented and developed a strict phrase
likelihood ratio (SPLR) approach to extract Chinese unknown
words more efficiently and precisely. In practice, these unknown
words could be used to differentiate the articles due to their rareness. A modified SPLR approach is proposed in this paper to calculate term scores and the higher ones will be selected as the
candidates of core keyword list and Eq. (5) shows the computation
tf ðKW i Þ
; KW i len > 1
Maxðtf ðKW i LÞ; tf ðKW i RÞÞ
In Eq. (5), tf(KWi), tf(KWi_L) and tf(KWi_R) represent term frequency of the keyword, term frequency of left hand side of the keyword and term frequency of right hand side of the keyword
respectively. Meanwhile, since single word in Chinese is not easy
to express important concept, only the terms with word length larger than 1 will be taken into account. As shown in Fig. 3, the SPLR
score of the keyword ‘‘the president of student association” could
be obtained from the frequency information.
In modified SPLR score computation model, normalization step
is required to normalize the scores and the final scores will be dis-
As described above, Noun and Verb terms are selected as candidate keyword list and modified SPLR computational model are
used to choose core keywords of the system. In practice, it is not
enough to generate an article from the core keywords. Therefore,
a keyword expansion process is adopted to expand the core keywords and Fig. 4 shows the keyword expansion model. In Fig. 4,
W1, W2, . . ., W10 represent the candidate keywords and W3, W8 represent core keywords. The expansion model will start from the core
keywords to include other keywords as text generation keywords.
Practically, if all the candidate keywords are included into the text
generation keyword set, the variation of final text will be limited.
On the other hand, if the number of keywords is few, it is not easy
to find appropriate sentences to complete the gap between the
keywords. In the expansion model, two previous keywords and
two next keywords of the core keywords will be selected as the
generation keywords. Thus, as shown in Fig. 4, W1, W2, W4 and
W5 will be expanded from W3. Meanwhile, W6, W7, W9 and W10
will be expanded from W8. If expanded keywords overlap, intersection approach will be adopted and only one overlapped keyword
will be selected. Hence, as shown in Fig. 5, keyword W5 is a overlapped keyword and the final generation keywords will be W1,
W2, W3, W4, W5, W6, W7, W8 and W9. Based on the core keyword
generation model and keyword expansion model, the users could
determine the generating text from core keywords and the system
could generate text from expanding keywords.
4.3. Candidate sentence selection and text generation
In an article, keywords are like the backbone of the article and
the sentences between keywords make up the content of the text.
When the expanding keywords are extracted from the above process, appropriate sentences between two keywords should be
determined as well. Eq. (6) shows how to select candidate sentences between two keywords.
jPre KW \ Wordði; jÞ \ Next KW j > 0
C u 2 Unitði; jÞj
Wordði; jÞ len 6 Threshold
where i is the index of the article, j is the position index of article
i, Pre_KW represents the previous keyword, Next_KW represents
the next keyword, Unit(i, j) represents the sentences that appear at
the jth position of the ith article, Word(i, j) represents the sentences
Fig. 4. Keyword expansion model.
Fig. 5. Keyword expansion model with overlapped keywords.
Fig. 6. Sentence selection from corpus based on keywords.
that have appeared between Pre_KW and Next_KW in the corpus
and Word(i, j)_len represents the length of Word(i, j). In Eq. (6), the
sentences that appear between two keywords and their lengths
are less than threshold value will be selected as candidate sentences. Fig. 6 shows that the sentences that appearing between
‘‘sensitive” and ‘‘spotlight” will be the candidate sentences.
Based on the above process, candidate sentence list could be obtained and these candidate sentences could be mixed with the keywords to generate different texts. The main goal of the system is to
construct a computer assisted text generation system to provide a
draft version of text for the users. In addition, the users could learn
how to improve their writing skills by learning from examples.
Fig. 7 shows the text generated by the system and the rough English translation of the content is presented in Appendix. Practically,
the content generated by the system provides a reference template
for the users and the users could modify this love letter to meet
their requirements.
4.4. Synonym substitution and simile expression
In essence, synonym terms plays an important role in essay
writing. Different terms with similar meaning could add variety
to the content and it will be better to use different terms with similar meaning in one article. Thus, the system proposed in this paper
provides synonym terms substitution mechanism to enrich the
content. Fig. 8 shows the synonym terms extraction flow. Take love
letter generation as an example, the data sources are from love letter corpus and the Sinica corpus.1 As described above, Chinese
terms need to be segmented first. In this paper, the Noun and Verb
terms are collected as the term set and these terms will be sent to
Sinica ‘‘Image Reflection Lake” system2 to obtain the synonyms of
these terms. The same terms will be filtered out first and the total
number of Noun and Verb terms is 80,972. The terms in the synonym
sets could be used to replace the terms with similar meanings. Fig. 9
shows the screen shot of synonym substitution scenario where different terms could be used to represent ‘‘deeply”.
Generally speaking, simile is an expression that describes something by comparing it with something else. With the help of the
Fig. 7. Love letter generation result.
simile, the content of the text will become more vivid. For example,
‘‘as white as snow” could be used to describe the color white by
using snow. In English, a simile is a figure of speech comparing
two different things, often introduced with the word ‘‘like” or
‘‘as”. Therefore, it is very important to find out the terms that will
appear in the simile expressions. In this paper, we conduct related
surveys and Fig. 10 shows the Chinese terms that always appear in
the simile expressions. All these terms are used as the seeds and
the sentences appearing after these terms will be collected as simile expressions. As shown in Fig. 11, the sentences after the seeds
could be used as simile expressions. Therefore, the simile expressions that are related to ‘‘smile”, ‘‘will” and ‘‘freedom” will be obtained and they could be used in the text to enrich the content.
Fig. 12 shows the simile expression scenario where ‘‘brave” could
be replaced by ‘‘as brave as a warrior”.
5. Experiment and result
5.1. Data set
The prototype system developed in this paper is applied to love
letter application domain. The corpus comes from love letters that
Fig. 8. Synonym extraction flow.
Fig. 9. Synonym substitution screen shot.
are collected from the Internet and the number of love letters is
446. In addition to the application domain corpus, Sinica corpus
are adopted to increase the content of the corpus. Based on the corpus, this paper proposed and developed an computer assisted writing system.
Fig. 11. Simile expression examples.
5.2. System design
As shown in Fig. 13, the input of the text generation system is
the keyword list. Each keyword list contains at most five keywords
and these keywords could give users a rough idea about the final
content. Moreover, the system allows users to generate text with
short content or long content. In love letter, users could provide
the receiver’s name and sender’s name and the system will use
these name information in the text. Fig. 14 shows the text generated by the system. As described above, the goal of the system is
to provide a computer assisted writing tool, so users could adjust
the content according to their specific requirements.
Moreover, synonym substitution and simile expression are two
important modules that could make the article become more interesting and enrich the content of the article. In synonym substitution module, the system focuses on the Noun and Verb terms
and the terms with similar meanings will be stored into the database for reference. As shown in Fig. 15, in the sentence ‘‘I want to
be with you bravely”, ‘‘bravery” is similar to ‘‘courageously”, so
these two terms could be interchanged. The system will provide
the terms with similar meanings to the users and users could
determine the best one.
Moreover, the simile expression module will provide the simile
expressions to enrich the content. For example, the term ‘‘courageous” could be enriched by ‘‘as courageous as a brave warrior”.
Fig. 16 shows the finial text after performing synonym substitution
Fig. 10. Simile expression seeds.
Fig. 12. Simile expression screen shot.
Fig. 13. Keyword list selection.
Fig. 14. Love letter generated by the system.
Fig. 16. System screen shot.
Table 1
Evaluation result.
Proportion (P3) (%)
and the system provides enough flexibility for the users to alter
the content to meet their requirement.
6. Conclusion
Fig. 15. Synonym replacement and simile expressions functions.
and simile expression process. Meanwhile, the system allows users
to define their own synonym terms and simile expressions, so the
system could obtain more feedback data from users and that could
enrich the content of the corpus.
5.3. Evaluation
The evaluation of this system includes readability, relevance
and rhetoric. Ten people, including nine males and one female,
are invited to experience the system and they are asked to give a
score for each item. The score ranges from 1 to 5 and the highest
score is 5. Table 1 shows the result which includes average score
and the proportion which is higher than 3. Each user is asked to review 20 love letters that are generated by the system and he/she
evaluated the readability, relevance and rhetoric expression of
the articles. The average score of the experiment is higher than 3
and that means that computer assisted writing system could provide some help for users to write a love letter.
In essence, it is necessary to understand the meaning of the text
to produce high-quality and fluent text. However, it is still infeasible to apply natural language understanding approach on text generation, because natural language understanding will require
extensive knowledge about the outside world and the ability to
manipulate it. In this paper, special keywords are used to provide
the clue about the template and statistical approach is adopted
to choose appropriate sentences between keywords. Meanwhile,
synonym terms and simile expression could enrich the content
A computer assisted writing system, which includes text generation, synonym terms substitution and simile expression suggestion, is presented in this paper. The text generation model is
proposed based on keyword generation model and sentence generation model. The keyword generation model extracts important
keywords from the corpus and they will become the backbone of
the template. Meanwhile, the sentences between keywords will
construct the content of the template and candidate sentences
are retrieved from the corpus based on statistical analysis. The
benefit of this approach is that the template could provide a framework and reduce the physical effort spent on writing so that people
can pay attention to organization and content. In addition, people
tend to use some sentences or terms that have been appeared in
other articles, so the candidate sentences that are coming from
the corpus provide more material for the users to compose a love
letter. Moreover, synonym substitution module and simile expression module both could enrich the content of the articles and enhance the variety of the articles. A simile expressions discovery
scheme is proposed in this paper to obtain simile expressions
and it works well according to the experiment. The prototype system has shown that it could work well on love letter application
domain and the concept of this research could be extended to another domain with minor modification.
This work was supported in part by the National Science Council under the Grants NSC-97-2221-E-009-135 and NSC-97-2811-E009-019.
Rough translation of the love letter in Fig. 7.
I am so scared, because I never wrote a love letter before and I do
not know how to let you know my heart. Somehow it feels like losing you before completely having you. I truly feel blessed that you
have become a part of my life, and I cannot wait for the day that we
can join our lives together. I have found out someone whom I can
share my feelings with. I cry with her and perceive her nervousness.
She is only a substitute of you and I wish you where her. I really like
you and how could I let you know I miss you so much. I am suffering
from the pain you left. Even though we have some quarrels, happiness still exist in my heart and I still miss the time with you.
