Soccer Analytics Part 2: Why Al Kuwait SC was (probably not) the Best Team in 2013

In the first part of my Soccer Analytics series, I talked about general statistics of my dataset of the season 2013. As a reminder, i scraped all results from 177 domestic leagues and intra continental cups world wide. This entry will deal with the question “Who was the best team in 2013?” according to network analysis and why the result is (most likely) wrong.

Continue reading Soccer Analytics Part 2: Why Al Kuwait SC was (probably not) the Best Team in 2013

Not Interesting Networks: The Human Body

“Networks are everywhere” is THE catch phrase of social network analysis. As I said in a previous post, if you are a network scientist everything looks like a network to you.  That’s why I decided to start a series about “everywhere” networks. But not just any. No! In particular those that are just not interesting at all. Although networks might be ubiquitous, that does not always make them worthwhile studying.

With this special kind of networks, I will just do some fancy visualizations and maybe some boring statistical analysis. To emphasize the non-scientific relevance, the visualizations and statistics will be presented with the dearly beloved comic sans.

Body Part Relationships Network

I got the data from here. The description of the data is as follows:

“[…] contributors classified if certain body parts were part of other parts. Questions were phrased like so: “[Part 1] is a part of [part 2],” or, by way of example, “Nose is a part of spine” or “Ear is a part of head.”

Well if that’s not exciting…

Here is a visualization of the network

A cluster analysis revealed, to which class of body parts (upper in light blue, lower in light red) certain body parts belong. The algorithm was not so sure about the belly though.

The following table shows the most important body parts according to some centrality measures and what these measures could stand for.

So the head is the most important body part and the face is full of other things. Good to know.

Putting the Sex in SNA: Hook up Networks

Imagine you are a network scientist meeting new people in a bar. At one point, there is always the question: “So what is your research topic?”. Shocked and horrified, you are looking for an answer that does not scare away your new acquaintance. “algorithmic graph theory”? Sounds to nerdy. “social network analysis” ? Sounds better but you are tired of explaining that this is not the same as browsing facebook all day. Maybe an example from your work! “Protein Interaction Networks”? Oh god i had too many beers to explain that. 

Hook up Network of Grey’s Anatomy

For the described situation, it is always good to have an example ready that is relevant in real life.
And what could be more relevant than talking about “hook up networks”. Yes that’s a thing and you CAN do serious research with them. Or use them as a pick up line. 

The first time I stumbled upon these kinds of networks was when i was looking for an interesting example to show some basic network modeling approaches in a class. What I found was the “who slept with whom” network of Grey’s Anatomy [1,2]. Everybody who is familiar with that show will agree that besides hospital stuff, there is a lot of hanky-panky going on.

Below you find a visualization of the network, where the node size is proportional to the number of hook ups.

Comparing Grey’s Anatomy with the “real world”

Reddit user /u/kreekkrew had a fascinating project during his time in college. He recorded every hook up within his group of friends and published this matrix with the following explanation:
“This shows who has made out with who in my group of friends over the course of my time in college.
Criteria for data collection:
1. To be on the chart, a person needs to have been to at least two of our parties since my freshman year (2010).
2. For a make-out session to count, it needs to:
  • be with someone else on the chart, at any point in time,
  • last more than 10 seconds, and
  • have some level of seriousness behind it. For example, you can’t make out with someone for 5 seconds just to get on the chart and increase your count for the night. (Yes, that has happened.)
Population is composed of 26 people (61% male/39% female), aged between 18 and 24, all at an engineering school (believe it or not) in the US. Obviously, names have been initialized for the sake of privacy.”

Of course, I turned his matrix into a network
Now we can compare how Grey’s Anatomy relates to the real word (as real as engineering school gets…) The table shows some simple statistics for both networks.
Most of the statistics should be self explanatory, except maybe the 4-cycle statistic.

A study on sexual networks [3] found a prohibition against coupling with a former partner’s former partner’s former partner (a 4 cycle) due to status implications. That is, counting 4 cycles allows us to quantify a level of potential drama, since awkward status implications arising from other’s sexual relationships make for compelling entertainment.

What do we learn?

Engineering school is far more dramatic than Grey’s Anatomy (compare 4 cycles). To make Grey’s Anatomy a bit more realistic, they should maybe implement some more same sex coupling. I guess “McDreamy” hooking up with another male doctor would create quite some confusion among the female viewership. 
And poor dudes at Engineering school. Females are already quite rare there (Which is itself quite sad!), and the few that go there decide to rather make out with each other…

Friends and Hypergraphs: The One With All The Networks

Undoubtedly, Friends is one of my favorite tv series. I guess i am not the only one there, since the hosting of all episodes on netflix created quite a stir in the online community. The story of the show is quite easy to follow: Ross loves Rachel, Ross dates Rachel, Rachel breaks up with Ross, Rachel loves Ross, Ross marries Rachel, Ross divorces Rachel, Ross loves Rachel, Ross and Rachel have a happy end. Oh yeah between all the Rachel Ross dilemmas, Monica marries Chandler, Phoebe sings smelly cat and Joey does all kinds of shenanigans. So the question is, is the “Ross and Rachel” story the most central element of the show?
To answer this question, I am gonna look at a dataset of shared plotlines throughout the whole show. That is, which subset of the six characters appeared  together in in a plot during an episode. These plots can range from simply hanging out together in Central Perk to some hanky-panky in the bedroom. We can see a shared plotline as some form of interaction and therefore analyse the show from a network perspective. Great! That is my area of expertise! However, what renders the analysis a bit more complicated is the fact, that plotlines can consist of more than two characters, creating a link with more than two endpoints. So we are not just dealing with a regular network, but with a hypergraph.

Network Visualizations of all Episodes

I spent a lot of time on trying to come up with a visualization, that shows hyperedges in a pretty way. But i failed horribly (That’s why I am doing network analysis and not a network drawing, i guess). So i decided to split the hyperedges into regular edges. That means, if there was a plotline consisting of Monica, Chandler and Joey, i created the edges (Monica, Chandler), (Monica, Joey) and (Chandler, Joey). I did that with all the plotlines of each episode and counted the number of times a certain storyline occurred and aggregated these counts for each season. So in the end, i got 10 different networks. The edge width indicate how often two characters shared a plotline during the respective season.

Clicking through these figures, I always start thinking about all the funny scenes of the respective seasons. I think it is time to rewatch it for the 10th time!

Who is the most central character?

Visualizations are fun and stuff but they do not really help us to determine the most central character of the show. Since I deal with Network Centrality day in, day out i have a lot of methods up my sleeves to deal with this problem. However, most of them are not really applicable on hypergraphs. The only measure that can be used quite straight forward is eigenvector centrality. So lets do some mild math:

A simple network is usually represented in an adjacency matrix $A$ where $A_{ij}=1$ if there is a link between actor $i$ and actor $j$ and $A_{ij}=0$ otherwise. Since in our case, edges have no directions. $A_{ij}=A_{ji}$ and therefore $A$ is a symmetric matrix. Thanks to Perron-Frobenius, we know that there is a real eigenvalue $lambda$ which is bigger than every other eigenvalue of $A$. For this eigenvalue, the following equation holds
$$ Av=lambda v$$
The entries of the vector $v$ are then used to rank the actors of the network. But how can we interpret $v$? The short and simple (and slightly wrong) explanation is, that actors are considered important, if they are connected to other important actors. So it is not just the number of connections, but also the quality of these connections.
When we deal with hypergraphs, we are faced with the problem, that we can no longer represent our network with an adjacency matrix since links can have more than two endpoints. Instead, I will use the so called incidence matrix $E$. The incidence matrix has as many rows as the network has links and as many columns as actors are present. So $E_{ij}=1$ if actor $j$ takes part in edge $i$.

In order to use the eigenvector centrality concept on $E$, it first has to be projected to a square matrix in the actor space. This is done by multiplying $E$ with its transposed $E^T$, i.e. we have the equation
$$E^TEv=lambda v.$$

The interpretation of $v$ is the same as before and so it should reflect the importance of the characters. But before looking at the show as a whole, I will show the importance rankings of the characters in each season. Or in other words: Who were the most central characters in season 1 to 10?  Lets take a look at the seasonwise entries of the vector $v$ and its induced ranking

Original size can be found here

I think the values and rankings reflect the storyline of the seasons quite well. For example season 1 mostly deals with Rachel becoming more independent and the whole Ross and Rachel thing. Season 4 to 6 mainly deal with the relationship of Monica and Chandler, therefore, they are should be the most central characters during this period. Notable is also the position of Phoebe. During the whole show the story never really focuses on here, such that here position within each season ranking is always quite low.

Now lets consider all interactions of all episodes at once. That is, we want to know who is the most central character in the show. And it is…

…CHANDLER! That was kind of surprising to me! But even more surprising is the low overall rank of Ross. Shouldn’t he be at least as central as Rachel, since the whole show is about the relationship of Ross and Rachel?

Of course one could question my relatively simple approach on finding the central characters and of course one could question the dataset. But then again, this is a blog about mildly scientific topics, so…yeah… take the results as they are but do not over interpret them. Also, because i am going to show in an upcoming post, why the results are as they are.


A big thank you goes to Alex Albright who not only provided the dataset but also some valuable discussions which actually motivated me to write this blog entry. Please check out her blog too!

Soccer Analytics Part 1: Mildly Interesting Statistics of 2013

In the summer term 2014, I participated in a seminar about Soccer Analytics. Imagine the fun of combining your research with your hobby! Anyways, i tried to use the seminar to improve my skills in scraping data from the web. Although my R code is the worst case of spaghetti code, I managed to get every match played in first divisions in 177 countries and in every continental competition from and The data set comprises results of 33884 games among 2398 teams world wide. Enough to go wild on analyses! This first Part mostly deals with semi interesting overall and country specific statistics. Future posts will deal with such exciting questions as “who was the best team in the world?” and “If A beats B and B beats C, does that mean A beats C?”

Most Common Scores and Goals per Game

A first question I am going to answer is “What are the most frequent results in soccer?” I did some quick googling and all I could find were common results in specific leagues. I am sure someone else has done a worldwide analysis before, but I couldn’t find anything. So I did some counting in my dataset. The figure below shows the 20 most common results in the season 2012/2013 in 177 countries.

Original size can be found here
So 1:0 and 1:1 are the most common results. Not very exciting but i guess that was to be expected. The two results add up to around 23% percent. Together with the next group of scores (0:0, 2:1, 0:1, 2:0, 1:2) they already cover 61% of all results. Interestingly, but also to be expected, the home win results are more frequent than their away win counter part (e.g. 2:1 and 1:2). I will deal with the home field advantage a bit later. The next figure shows the goals per game distribution.
Original size can be found here
Around 73% of all games end with 3 or less goals. The expected value of goals per game 2.62. But let us focus a bit more on the extreme values of this distribution. 0.2% of all games end with more than 10 goals. There are even two games where more than 20 goals were scored:

  • March 23: Popua-Lotoha’apai United 0:20 [Tonga]
  • June 12: Police-Friend Development 19:3 [Laos]

The zero-zero coefficient

Yes it is by far the most annoying result in soccer. Especially when you witness it live in the stadium. Years ago i had a discussion with a friend, that there should be something like a “zero-zero”-coefficient for each team. So if two teams with a high zero-zero coefficient play against each other, you better not be in the stadium! Although my dataset is large, it is not big enough to derive a meaningful zero-zero coefficient for each team. However, I can at least state which countries have the most and the least number of goalless games, i.e. a high and low zero-zero coefficient respectively . 
Top 3 countries with most 0:0
  1. Kenya [240 games 21.3% goalless]
  2. Benin [182 games 19.8% goalless]
  3. Guinea [132 games 19.7% goalless]
Interestingly, all three are African countries. More strikingly, from the top 10 countries, 8 are African.
Top 3 countries with least 0:0
  1. French Polynesia [75 games 0% goalless]
  2. American Samoa [45 games 0% goalless]
  3. Cook Islands [40 games 0% goalless]
Again we see a strong “continental trend”. All three countries are oceanic. They are followed by two others, the Solomon Islands and New Zealand. 
The continental trend observed in both lists will be explored in the next section a bit more thoroughly.

Mean Goals per Game in Different Countries

We saw in the last section, that there seems to be a continental trend when it comes to goalless games. The question is, does this hold in general, that is can we find general trends in goals per game? Looking at the following map, the answer is yes.
Original size can be found here
One can clearly see, that the mean number of goals per game is generally much lower in Africa compared to the rest of the world. A bit surprising, at least for me, are the low values in South America. An explanation might be, that on both continents the focus lies more on a good defense, or as a German would say: “Die Null muss stehen!” is their motto. In Asia and North America, scoring goals has a higher priority. That is a 4:3 is always better then a 1:0. Although goalkeepers might argue against that reasoning. As in the last section, let’s look at the extreme ends.
Top 3 countries with highest mean goals per game
  1. Tonga 6.625
  2. Cook Islands 6.275
  3. Laos 5.41
Top 3 countries with lowest mean goals per game
  1. Haiti 1.66
  2. Benin 1.67
  3. Burkina Faso 1.71

Home/Away Wins

Home field advantage plays a big role in soccer. There is a lot of research on that topic which can be found by simply googling for “home field advantage soccer”. So i will refrain from  philosophizing about it in depth. I was just curious about the differences in the fraction of home/away wins around the world. The first plot shows the home win fraction vs. the away win fraction. I was interested if there are countries, where more away wins occur than home wins. And indeed there are quite a few!

Original size can be found here

In total there are 19 countries where more away than home wins occurred:
Liberia, Swaziland, Bahrain, Latvia, Liechtenstein, Lithuania, Malta, San Marino, Syria,  Cambodia,  Cook Islands, Fiji, French Guiana, Saint Helena, Sao Tome and Principe, Solomon Islands, Somalia, Tonga, Trinidad and Tobago. So what do these countries have in common that could explain this phenomenon? (Besides being small countries  maybe) 


The thing that distinguishes soccer from many other sports are the draws. They can be a pain in the ass if you try to come up with sophisticated ranking methods in soccer (more on that in a future post). The following map shows the fraction of draws around the world.
Original size can be found here
The top three in this category are Morocco Kenya and Egypt where around 40% of the games end in a draw.

What do we learn?

There are some significant differences on a continental levels when it comes to soccer results. Especially the soccer played in Africa seems to deviate strongly from the rest of the world. It has a high frequency of 0:0, a low average number of goals per game and in general a high fraction of games that end in a draw. So Africa seems to be a paradise for defense enthusiasts!

As I said, these are just some mildly interesting statistics of the dataset. If anyone has further suggestions on what to add here, feel free to tell me in the comments.

Oh yeah and i guess it is quite obvious that i like xkcd.