2442 JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 Structural Analysis on Social Network Constructed from Characters in Literature Texts Gyeong-Mi Park BK21 Center for U-Port IT Research and Education, Pusan National University, Korea [email protected] Sung-Hwan Kim and Hwan-Gue Cho Dept. of Computer Engineering Pusan National University, Korea Email: { sunghwan, hgcho } @pusan.ac.kr Abstract—Recently we witnessed that the social network analysis focusing on social entities is applied in the social science and web-science, behavioral sciences, as well as in economics, marketing. In this paper we present one method to construct and structure analysis the social network from literary fictions by a simple lexical analysis. And we will show that those literary social graphs, shows the power law distribution of some features, which is the typical characteristics of complex systems. We newly proposed the concept of the kernel of literary social network by which we can classify the abstract level of protagonists appeared in fictions. And we also studied the connectivity of social network based on statement distance distribution of characters. Our study shows that the metric distance among characters written in linear text is very similar to the intrinsic and semantic relationship described by fiction writers, which implies the proposed social network from fictions could be another representation of literary fiction. So we can apply other scientific and quantitative approach by analyzing the concrete social graph model extracted from textual data. Index Terms—Social Network, Complex System, Literary Fiction, Character Graph, Power law. I. INTRODUCTION Extracting useful information from a large textual repository is getting essential in data mining field. One difficulty in this work is how to deal with the various kinds of natural languages. Most work has based on English based texts, but recently some other languages have shown interesting result [9], [11]. Some previous work proposed the open problems on the dynamic property of social network extracted from text. In this view of new framework, we have tried to investigate the temporal structure of social network, since the literature fiction spans more than years or more than one human generation such as “War and Peace” written by Tosltoy. Our basic idea is that we can regard the complicated text Manuscript received October 30, 2012; revised November 22, 2012; accepted November 25, 2012. This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2010-371B00008). Corresponding should be addressed to Hwan-Gue Cho ([email protected]). © 2013 ACADEMY PUBLISHER doi:10.4304/jcp.8.9.2442-2447 (e.g. long literary fictions) as the typical complex system such as human society and the huge ecosystem in Earth. In novel lots of characters are interacting in the text space which consists of a sequence of words. So if we compute the distance" of two characters over the linear text, then the new kinds of social relationship can be extracted. And we can construct the social network from the distance matrix of characters appeared in literatures. So we need to characterize the structure of literary fictions in terms of complex system. By analyzing the social network extracted from text, we can determine the importance of fiction characters by a mechanical way of text analysis without any complicated semantic analysis. This will help us to understand the deep and common structure of literature regardless of written languages. So mining the topological structure of social graph from text will help us to find other important features in a simple graph analysis and can summarize the whole story or can measure the complexity of literature fiction by measuring the degree of interactions among protagonists or between protagonists and other boundary persons. This implies the proposed approach will greatly help to reveal the semantic structure of fictions by constructing the concrete social network and topological analysis of the graph. In next section, we will survey the related work on these topics. Section 3 will devoted to give some definitions and preliminary concepts. Section 4 will show the concrete algorithm for constructing the social network from text. Also we should open how we prepared the testing data (4 their-person novels) and how to preprocess to make them fit in experiment. Experiment results are shown in Section 6, where we can assure that these social networks are quite similar to the general complex systems. And finally the summarized conclusion and future open problems will be exposed. II. RELATED WORK A. Computational Analysis on Literature It is general that the computer-based literary textual analysis has typically performed at the word level [1]. This work has focused to discover authorial style and the JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 lexical patterns of word use. This computational linguistic work has contributed to validate or repute the basic literature theories that the novel is a literary form which tries to produce an accurate representation of the social world. According to that theory, it is believed that theories about the relation between novelistic form (the workings of plot, characters, and dialogue, to take the most basic categories) and changes to real-world social space. Recently researchers started to study property of cooccurrence words in natural language space [7], [8], [10], [12], [13]. Since all words in a text are closed related though they are arranged in the linear sequence. The cooccurrence pattern of some pair of words may reveal the important features hidden in some texture [17]. One interesting application of these co-occurrence analysis gave the clue to identify spam or vicious replying message [8], [10]. One application of word appearance pattern is biomedical text mining whose goal is to find the biologically and medically meaningful knowledge only by analyzing the huge textual data(mainly academic papers)[2], [16]. For example, hidden functions of genes can be estimated or predicted by biomedical text mining tools, which helps biologists to make a new drug fast and efficiently. B. Extracting Social Networks from Text Typical social network studies based on textual data focus on the structured data including email and Tweet messages and phone SMS messaging texts for social network construction [6], [8], [14]. The main goal of these work are to find the most influential person. And the propagating patterns of texts (messages) are the key subjects of this work. And pure combinatorial studies has been started to discover the frequency of some designated words [1], [12]. One of typical work on this line is the Zipf's law. In any natural language, The generic Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. That implies the most frequent word appears approximately twice as often as the second most frequent word, three times as often as the third most frequent word. Till now, little work has been done on extracting social relations between individuals from text. One notable work has published on constructing conversational networks from literature [3], [10], [12], [16]. Elsan et al proposed a method for extracting social networks from nineteenth-century British novels [5]. Link two characters are given if they are in conversation or not in the text. So they successfully made the conversational network to imply the closeness of protagonists. They tried to validate some literature-related hypotheses. It means they replaced the social network by conversational network to evaluate literary theories. In their graph model, the named characters are represented vertices represent characters and edges are added to indicate dialogue interaction among characters. They set the edge weight proportional to the amount of interaction. The two main constraints of conversational network are followings. 1. The characters © 2013 ACADEMY PUBLISHER 2443 should be in the same place at the same time; 2. The characters should speak turn by turn. Their conclusion is twofold. First there is an inverse correlation between the amount of dialogues in a novel and the number of characters in that novel. Second a significant difference of social interaction in the 19 century English novel is basically is geographical. One difference of Elsan's approach and ours is that we do not have interests in validating any literature theory. Another interesting work should be noted. Hassan et al have studied the positive and negative influential social network by analyzing the sentiment words in literature [1]. They have used a larger amount of online discussion posts. Their experiments showed that their method easily constructed networks from text with high accuracy, which connects out analysis to social psychology theories of signed network, namely the structural balance theory. Our purpose is to give any engineering or computational method to reveal the hidden structure of literary fictions. And also we will show the social network extracted from fictions is also one of complex system where rich gets richer and poor gets poorer by showing various statistics gathered in word analysis. The previous work focused on the static structure of social network such as degree distribution and connection form. But this paper also studies how the structure of social network varies along chapters by chapters. III. PRELIMINARY Since every social network is characterized by entities (nodes) and the distance function, it is important to define the valid metric function. In our social network model from literary text, the node(entity) is textual words denoting the persons described in the corresponding text. Then we need how to make them connected by a proper metric(distance function). We will explain this in the following. Let T denote one literary fiction text. Each T consists of a set of consecutive statements s . And each statement s can be decomposed of a sequence of words w , , where w , is the first word of statement s . And finally all words w , consists of a sequence of letters . |S | denotes the number of words(is the ,, statement s . In a similar way let |T| denote the number of statements in that text T. And we define statement rank ) = i and the word rank, of word , , staterank( wordrank( , ) is define the as followings. wordrank , | | 1, 1 By the preprocessing step, we already prepared the name lists of all characters appeared in the fiction. In the following we denote C the set of all characters in T. Since C also appeared in text with different form, we need to locate the wordrank(). Now we have to define the distance metric on statements and words. dist X p, q denote the distance 2444 JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 between twoo entities p annd q over X meetric space, where w X could be the t statement space or woord space. Leet me show a few examples forr this distancce function. Inn the following leet dist w , , w . denoote the numbeer of statement whhich lies in beetween two diifferent statem ments s and s . That is defined j i 1 if i . That meaans if two words loocate in a sam me statement, then the stateement distance betw ween two words w is zero.. And we deefine. |wordraank Y dist X, Y worrdrank X | As you exxpect, if two characters X and Y are cllosed related in story, s then , would be short compared to the pair of looosely related characters. So the o two characcters are expected to depennd on relatedness of the average distance of , alsoo the , o , w where is one frequency of threshold coonstant to dettermine if thee person is inn the working narrrative space. Let us givve one examplle to help the definitions abbove. Here we havve a tiny text with w 5 simple statements. Harry H appears two times at stattement 1, 2 which w are dennoted , . And we denote , , for thhree Hedwig from , the first stateement. Then we see that since 0, , 16 3aand , there are 16 words w betweeen and o over the wholee text space. And it is easy to know that workdrad w 4, 8. wordrank # 1 2 TABLE I. STATEMENT T DISTANCE SA AMPLES word sequence of eaach statement Haarry walked acrosss Hedwig's cage.. Heedwig and Harry had h been absent for f two nights. 3 Duursley worried about Hedwig and Lily. L 4 Duursley had pretendded for ten years.. 5 Beccause Lily and Jaames Potter had not n died. IV. CON NSTRUCTING G SOCIAL NE ETWORK FR ROM T TEXT A. Construccting Social Neetwork from Literary L Text Now we will w explain hoow to extract the social netw work from literaryy fiction. Let T be a text too be analyzedd and GT V, E be the t corresponnding graph coonstructed froom T. All words reepresenting chharacters are mapped m to verrtices of G, V. Nexxt we decide how to makke each charaacters (vertices) connnected by asssigning edge relation. We give the edge connnection as (v v , v ) if the weight w betweeen v and v , weigght(v , v ), is leess than a threeshold constannt c , which is givven by user too meet the ow wn purpose. Since S there are tw wo distance measures m based or statemennt or word entity, we applied dist measure for weight w function in thhis paper, whiich was decided by comparrative work againsst dist . More formaal computatioon is given. Sincee two charactters X and Y appears a more than once adjacenntly, we find all adjacent pair of X andd Y in text. The edgge weight of for two charracters X and Y is given follow wings. © 2013 ACADEMY PUBLISHER weight X, Y , 2 , , , and 0 1, a controll wheere k consstant. In this experiment w we set α 0..7. So if twoo charracters are inn a statementt, then statem ment distancee shou uld be zero(0)), so the weigght is 0.7. If the statementt distaance betweenn two words((characters) iss 5, then thee weig ght should 0..7 0.16807 . If we set the controll variaable α 0.9 then, 0 0.59049 . We can get thee extreemal case if α 1, whichh disregards the effect off distaant word in weight calculation n. Next iff weigght X, Y then we ggive the con nnecting edgee betw ween Y and Y . In implemenntation, we do o not considerr any pair of characcters which arre separated with w more thann 10 statements. To summarize, we can connstruct the so ocial networkk grap ph by adjustinng two contrrol variables, , . If wee increease or α then we geet the more dense sociall netw work structuree and we willl get the sparrse network iff we lower l c0. So generally g let uus , , , representt the social networrk extracted from the liteerature T withh meters α, . two control param Fig gure 1. Three Soocial Networks exxtracted from ficttions (a) Social neetwork extracted from Crime and P Punishment with the threshold co onstant ca = 23.56. (b) Another soocial network from m \Crime and Puniishment with morre a loose threshoold cb = 23.56.con nstant compared to case c (a). (c) We can c get the nearlyy complete social network of the ficction with a threshhold cb = 23.5 B. Kernel K Constrruction for Litterary Social Network N Since all fictioons consists oof a few protagonists andd p it is interesting to o observe thee otheer boundary persons, topo ological propeerty of each ccharacter according to thee role importance in i the story. S So we propose the conceptt n kerneel which refleects the role importance i off of network literary text spacee. In the follow wing we show w one examplee social network G__T (V,E,α,c_00 ). Figu ure 2. G_T (V,E E,α,c_0 ), the origginal Social Netw work Graph with 21 characters, c that iss Kernel_0 (G,t) JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 We call the t 1 whose graph g degree is 1. So theree are 9 1 whho can be considered some s , , , , , , , , boundary perrsons in fictioon since their interaction i is quite low. Now we will exxplain how to constructt the n , constructively c y and , of social networks successively.. We set , , , for base condition. And A we denotee (k+1)-th Kernel as which was derived d from , . The proceedure is simple. We delete all veertices of w , . whose e to t. So graph degree is less than or equal , should be b smaller thann , . When we geet , , , wee say the kernel is stabilized. In the fol-lowing Figure, we show s ,1 . 2445 In n this paper we w experimennt the connecttedness effectt by controlling c thee influence region. We hav ve check threee more statement foorward in the previously deepicted sociall ph, which means the influeence region is 3. Now wee grap will try to find the connecteedness by co ontrolling thee uence regionn 0,1, to 10. Ideally speaking, s alll influ charracters referredd in a fiction sshould be con nnected unlesss that fiction conssists of indeependent chap pter such ass nibus essay. Itt is very intereesting to find the minimum m omn influ uence region which guaranntees the conn nectedness off social graph extraacted from a teext. n the followinng Table, we show the component sizee In with h respect to the t influence region d 0, 1, 2, 3, 4, 5 , wheere d impliees we only considered the co-occurrence c e charracters in a single statemeent. Experimeent shows thee miniimal region depends d on tthe languagess and writingg stylee and genree. It would be interestin ng topic thee variaations on inffluence regionn for graph connectedness c s and it implicationn. TAB BEL III THE NU UMBER OF COM MPONENTS BY STATEMENTS S DISTAN NCE Title wing all degree 1 nodes(boundary n n nodes) Figure 3. Kernnel_1 (G,1) show are removed from f Kernel_0 (G G,t). The shaded nodes n {i,n,d} willl be removed if we coonstruct Kernel_22 (G,1). V. EX XPERIMENT TS A. Data Prep eparation Now we will w explain how h to locate the word possition of fiction chaaracters. Firstt of all we havve obtained thhe list of characterss appeared inn example tessting fictions from Wikipedia annd other interrnet sources. And it shoulld be noted that thhere is an auuxiliary bookklet explaininng all characters apppeared in Koorean novel "T The Earth", where w more than 500 persons arre listed withh brief explannation on the role and relationnship to otheer characters and protagonists. T TABEL II TEST DATA NOVEL L TEXT Title Language Words Three Kingdooms Korean 642,222 1,446,87 The Earth Korean 9 War and Peacce Harry Potter Stattements Characteers 1221,779 1, 2885 1776,387 515 English 584,618 30,912 1339 English 1,279,37 5 1226,112 6555 B. Connecteedness of Social Graph accoording to Influencee Region Let we havve statemennts from a textt. As we explaained, if a characteer appears in statem ment, then we will check other characters whhich appear inn next statem ments. ments of forrward statem ments The numbeer of statem determines the t graph connection. If we look aheead a smaller stateement forwardd, the social connection c shhould be sparse, evven may be disconnected. d L the numbber of Let statements too be looked forward f be thee influence reegion for a characteer of . © 2013 ACADEMY PUBLISHER d=0 d=1 d=2 d=3 d=4 d=5 Waar and Peace 22 9 5 4 2 1 Haarry Potter 74 20 4 4 2 1 Th he Earth 67 33 23 21 18 14 Th hree Kiingdoms 92 21 7 5 2 1 C. Connected C Coomponent Anaalysis on Edge Weight We W are interestted in the connnectedness off whole sociall grap ph with respecct to the edge w weight. If we only considerr the “strong" edgge, then wee would get the lots off mponent sincee some strong g local groupp disconnected com uld be conneccted without aany link edgees connectingg wou two different soccial groups. S So by controllling the edgee ght we can gett the locally reelated group. Also we havee weig to focus fo on the number n of coonnected com mponents withh respect to the edge weight conttrol. According A too our experriment, the number off com mponents are inncreasing fastt initially with h lower valuee of cut c weight , which enablles us to estim mate the “thee link characters” who w connects two differentt local group.. We think this prooperty would bbe one importtant feature too distiinguish the ploot style of anyy writers. Figure 4. Thhe number of com mponents by edgee weight 2446 JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 D. Variationn on Maximum m Component Size and Its implicatioon connect c the different local ggroups automaatically. Figure 5. thee number of nodee maximum compponent node accorrding edge weight We definee the maxim mum componeent, the conneected component whose num mber of noddes are maxximal compared to other componnents. Generaally we believve the mponent woulld contain the main protagoonists maximal com of the fictiion. We tried to identiify the maxximal component of o intermediatee social netwoork for a giveen cut value. We found fo that thee maximal coomponent of some s fictions was not reduced with respect to the cut weeight. mplies that thee core groupss of characterrs are That fact im highly relateed compared to the bounddary persons along a whole storyy. This variation of the size of maxximal component would w show thhe interestingg characteristics of novel plot. VI. B. Future F work and a Open Probblems Now N we are trying t to reveeal the dynam mic of sociall netw work in transsition. For exxample thoug gh the fictionn "Thee Romance off Three Kingddoms" has mo ore than 10000 charracters referred, the numberr of working characters c aree boun nded for a fiix size text w window, whicch means thee social network is so dynamic. IIf we measure the degree off hat would bee dynaamic developpment of ficttion story, th interresting to classsify fictions. Or we hope to o characterizee the topological structure s of tyypical detectiive story andd me story or fanntasy novel. crim C CONCLUSIO ON AND FUT TURE WORK A. Contribution Summaryy Experiment Summary S In this papper we considdered how to extract the social s network from m literary texxt. We can conclude thaat the simple lexiccal structure of literary such charaacters appearance clearly c isomorphic to the semantic struucture of the fictionn. Another intteresting fact we revealed is i the topological structure s of soocial network preserves p the most important annd crucial common c feattures of com mplex system such as the generral social (hum man) networkk and p and viruuses. the huge biollogical networrks including proteins • Structuraal analysis onn Social grapph extracted from literary fiiction gives innteresting factts which help us to characterrize the semaantic structuree of fictions.. For example the variatioon on the maximal m size of component would cleaarly shows thhe strength off core o whole stoory. group of protagonists over work shows quite • The extrracted literaryy social netw similar structure s in terms t of lexxical level too the narrative structure of fictions fi intrinssically. s • Our expeeriment clearlyy showed that this literary social network is i one kind off complex systtem where moost of properties are under poower law. • Also thee number off connected components with respect too the cut valuue for edge weight w impliess that the sociaal relationshipp among mainn characters is so higher thhan those off boundary persons. p Andd the connectedd component will help uss to find the local social grooup and the inntermediate linnking personss who © 2013 ACADEMY PUBLISHER REFEREN NCES [1] A. Hassan, A. A Abu-Jbara and D. Radeev. “Extractingg Signed Sociall Networks Frrom Text,” in n Proc. of thee TextGraphs Workshop W at ACL L, 2012, pp.4-12. [2] S.-Y. Bong annd K.-B. Hwanng. “Keyphrasee extraction inn biomedical puublications usinng mesh and intraphrase wordd co-occurrence information,” iin Proc. of the ACM A DTMBIO O, 2011, pp.63-666. “ myth aas a complex network,” Thee [3] Y.-M. Choi. “Greek Korean Physiccal Society, vol.. 49, no. 3 pp. 298-302, 2 2004. L and P. C. Fernando, “Siimilarity-Basedd [4] I. Dagan, L. Lee Models of Word W Cooccurreence Probabilitties,” Machinee Learning, vol. 34 no.1-3, pp. 43-69, 1999. [5] D. K. Elsonn, D. Nicholaas, and K. R. R McKeown,, “Extracting Soocial Networks from Literarry Fiction,” inn Proc. of the Association A for Computation nal Linguistics,, 2012, pp. 141--147. [6] C. L. Freemaan, The Deveelopment of Social S Networkk Analysis: A Sttudy in the Socciology of Scieence, Empiricall Press, 2004. Y andd T. Ito, “Filtering harmfull [7] Y. Fujii, T. Yoshimura sentences baseed on three-worrd co-occurrencce,” in Proc. off Collaboration,, Electronic meessaging Anti-Abuse and Spam m, 2011, pp. 64-772. Kahng and D.Kiim, “Modellingg [8] D.-H. Kim, G.. Rodgers, B.K hierarchical annd modular com mplex network ks: division andd independence,”” Physica A: SStatistical Mecchanics and itss Applications, vol. v 351, no.2-44, pp. 671-679, 2005. [9] S.-R. Kim, “Coomplex networrk analysis in litterature: Togi,”” The Korean Physical P Societyy, vol. 50, no. 4, 4 pp. 267-271,, 2004. nce of tags andd [10] R. Krestel andd L. Chen, “Ussing co-occuren resources to identify i spamm mers,” in Procc. of Machinee Learning andd Principles aand Practice of Knowledgee Discovery, 20008, pp. 38-45. [11] Y.-K. Lee, J.. Ku and H. Kim. “Analyssis of networkk dynamics from m the romance of the three kingdoms,” k Thee Journal of Koorea Contents Association, vol. v 9, No. 4,, pp.364-371, 20009. [12] X. Luo and A. A Zincir-Heywoood. “Combining word basedd and word co-ooccurrence baseed sequence an nalysis for textt categorization,,” in Proc. off the Machine Learning andd Cybernetics, vol. 3, 2004, pp. 1580-1585. [13] C. Monojit, C.. Diptesh and M M. Animesh, “G Global topologyy of word co-occcurrence netw works: beyond the t two-regimee power-law,” in i Proc. of Innt. Conf. on Computationall Linguistics, 20010, pp. 162-1700 [14] R. Lambiotte.., M. Auslooss and M. Thelwall. “Wordd statistics in Blogs B and RSS feeds: Tow wards empiricall JOURNAL OF COMPUTERS, VOL. 8, NO. 9, SEPTEMBER 2013 universall evidence,” Jouurnal of Inform metrics, vol. 1, no. n 4, pp.277-286, 2007. [15] J. Rydbeerg-Cox. “Sociaal Networks annd the Languaage of Greek Trragedy,” Journnal of the Chiccago Colloquiuum on Digital Humanities H and Computer Scieence, vol. 1, no. 3, pp. 1-11, 20111. [16] J. Stillerr, D. Nettle andd R. I. Dunbar, “The small world of shakespeare's plays,” Huuman Nature, vol. v 14, no. 4 ppp.397408, 20033. [17] Y. Fujii, T. Yoshimuura and T. Ito.. “Filtering haarmful sentencess based on three-word co-occuurrence,” in Proc. of the 8th Annual A Collabooration Electronic messaging AntiAbuse annd Spam Conferrence, 2011, pp. 64-72. Mi Park is at BK21 B Center for f UGyeong-M Port IT T Research annd Education. She receivedd the B.S. degrree from the Korean K Nationall Open University, Korea, annd the M.S. annd Ph.D. degreees from Pukyyoung © 2013 ACADEMY PUBLISHER 2447 Natio onal Universityy, Korea. Her reesearch interestts are computerr visio on and informattion retrieval. Sung-Hwan Kim is a M.S. sttudent in Pusann Naational Univerrsity. He receeived the B.S.. deegree from Pussan National University. U Hiss research interestss are informatio on retrieval andd seequence processsing Hwan-Gue Ch H ho is a Professor in Pusann N National Univerrsity. He receeived the B.S.. deegree from Seooul National Un niversity, Koreaa, annd the M.S. annd Ph.D. degreees from Koreaa A Advanced Insstitute of Science andd Technology, Korrea. His researcch interests aree coomputer algorrithms and bioinformatics..
© Copyright 2026 Paperzz