The Lord of the Rings: The Three Networks

This post is inspired by the star wars social network.

I created the interaction networks of Lord of the Rings characters for all three movies based on the scripts I found online.  The networks capture the story line of all movies surprisingly well and might be a nice gimmick for all Lord of the Rings enthusiasts. Code and network files can be found on github.

I also created interactive versions of the networks where you can drag, click and hover and generally play around a bit. The links to the those versions are below the respective plots.

For those who are interested in technical details of the data extraction and analysis, head down to the Making of section. But let’s start with some visualizations of the networks for the three movies.

The Fellowship of the Ring

The below network shows the interactions of characters in “The Fellowship of the Ring”.


The Fellowship of the Ring

open interactive

A link between two characters is present, if the both spoke in the same scene which should indicate that they somehow interacted with each other. The wider a link is, the more two characters interacted in different scenes. The colors represent the “race” of the character (Human, Hobbit, Elf, Wizard, Orc, Dwarf and other). The interactive version also includes good, evil and neutral as a node attribute. The size of each node corresponds to the total number of interactions the character appears in (read: weighted degree).

The network shows the general plot of the movie quite well. The Fellowship (Aragorn, Legolas, Gimli, Frodo, Sam, Merry, Pippin, Gandalf and Boromir) form the core of the network, since the movie is centered around their adventure to bring the ring to Mordor. Characters like Elrond and Arwin do not play that much of a role and are thus found in the periphery of the network.

The Two Towers

The network of “The Two Towers” is shown below.

The two Towers

open interactive

The network has an interesting structure and yet again reflects the plot quite well. It consists of three independent components, each corresponding to a different story line of the movie. There is the “Frodo-Sam-Gollum” component who are now on there own to get to Mordor and the “Merry-Pippin” component, who got captured by Orcs who thought they are the ringbearers. The biggest component consists of the rest of the Fellowship, who are busy rescuing Merry and Pippin (obligatory: They are taking the hobbits to Isengard) and defending Helms Deep.

In this network you can see that I made an arguable decisions: Smeagol and Gollum are represented as two separate characters. I initially thought that the distinction is important to make since they incorporate two different personalities, yet I noticed (too late) that the naming scheme in the scripts I used is rather ambiguous.

The Return of the King

The “Return of the King” brings all story lines together again, as seen below.

The Return of the King

open interactive

The interesting part of this network is the position of Gandalf. He seems to occupy a very central position tieing everything together. Individual small story lines are also visible due to wider links. For instance, Merry as guard of Theoden, fighting together with Eowyn and again the Frodo, Sam and Gollum triangle.


After finishing this post I found two works which did a similar analysis, however for the books. One is a blog post and one actually a scientific paper.

The Making of

As usual, everything was done with R and networks were visualized with visone. As data source, I used scripts of the movies from The Internet Movie Script Database (IMSDb).  The second movie came in a slightly different format but luckily I found a script that was similar to the IMSDb scripts here. I will only cover the IMSDb versions here, but the R code for second movie can be found in the Rscript on github.

Parsing the scripts for scene starts

The text of the scripts are contained within a  <pre></pre> environment in the HTML code. The extraction of the text is done with the XML Package.

url <- ",-The.html"
doc <- htmlParse(url)
text<- getNodeSet(doc,"//pre")

The variable text now contains the script, however still including all HTML tags. We can use the xmlValue function to get the plain text. Then we split the text into a vector of strings by looking for newline statements. As a last step we remove leading and trailing whitespace.

text.char <- sapply(text, xmlValue)
text.clean<- unlist(strsplit(text.char,"\n"))
text.clean<- str_trim(text.clean)

Now we have the text in a format that can be worked with. To find the start of a scene we look for INT. and EXT. statements in the text since they signify the start of a scene (if the scene is an interior or exterior scene). This is done using a regular expression and grep.

scene.start<- grep("(INT.|EXT.)",text.clean)

the above code is wrapped in a function called get.scene.starts in the code on github.

Parsing the scripts for characters

Parsing the characters turned out to be a little more complicated and I could not directly use the plain text. When a character talks, his name is written in all capital, bold and centered. To capture the “bold” statement, we need the HTML tags to find the characters. The below code seems a bit “hacky” but it does what it is supposed to do: convert the HTML code from XML format to plain text with HTML tags.

doc <- htmlParse(url)
text<- getNodeSet(doc,"//pre")
fl  <- saveXML(text[[1]], tempfile())
text.char <-readLines(fl)

Now comes the tricky regex part. As I said, talking character are written in all capital, bold and centered. The regular expression in the first line below exactly looks for these properties. To extract the character names in a clean way we then have to delete all HTML tags and extra whitespace.

m <- regexpr("<b>[ ]{2,}[A-Z]{3,}",text.char)

raw.characters <- (regmatches(text.char,m))
raw.characters <- str_trim(gsub("<b>[ ]*","",raw.characters))

Together with the extracted scene starts, we can now create lists of characters that appear in the same scene. In the Rscript file, the above code is wrapped in a function called get.scene.char in the Rscript on github.

Getting the Interactions

Now that we have a list of characters appearing in a scene, we can easily create the links for the networks shown in the post. The below function does the trick.

  else if(length(char.list)<2){
  A <- outer(char.list,char.list,function(x,y) paste(x,y,sep=" "))
  A <- A[lower.tri(A)]
  el<- matrix(unlist(strsplit(A," ")),ncol=2,nrow=length(A),byrow=T)

However, I feel like the code is overly complicated and maybe not directly understandable. Just trust me that it converts a list of names into a two column matrix that can be interpreted as interactions.

Looping over all scenes, we can create an edgelist of all interactions in a movie and create a network with the igraph package at the end, where multiple edges are turned to weighted edges.

g <- graph_from_edgelist(el,directed=F)
E(g)$weight <- 1
g <- simplify(g,edge.attr.comb = "sum")

The complete code with additional tweaks can be found on github. Additionally,  you can get the graphml files of the networks to create your own visualization.

Posted in: Network Analysis, R |

Tagged with: , , ,

Written by Dmathlete