/ python

Soccer world cup page rank analysis

Today i have something unusual (at least for me) to share with you; it's an analysis of the world cup matches.

Of course, i have no idea about soccer, but i do know how to transform data, and as such i will show you something i created by scraping data off FIFA's website. I will proceed to explain the whole process in this blog post.

Data Source

For this analysis, i wanted to take into account all matches played in the "preliminary" rounds, as such i created a scrapy python scraper which extracts all match results from the following pages:

http://es.fifa.com/worldcup/matches/preliminaries/africa/index.html

http://es.fifa.com/worldcup/matches/preliminaries/asia/index.html

http://es.fifa.com/worldcup/matches/preliminaries/europe/index.html

http://es.fifa.com/worldcup/matches/preliminaries/nccamerica/index.html

http://es.fifa.com/worldcup/matches/preliminaries/oceania/index.html

http://es.fifa.com/worldcup/matches/preliminaries/southamerica/index.html

http://es.fifa.com/worldcup/matches

The data i extract is the teams playing the match, and the final score of each team. The final format of the csv is:

home,away,area,scoreHome,scoreAway

Data transformation

Here we leave the Python world to enter the R world :) (I know it can also be done with Python, but i haven't explored that part of Python yet)

I created a R script that starts by loading the CSV to a data frame, and then starts grouping and summarizing the values such that i can have the matches expressed as a graph, i want to have a losing team "pointing" to the winning team. This is because i later want to apply the Page Rank algorithm, the one that google uses to rank websites.

If you don't know the page rank algorithm, it basically says "If you are linked to by important websites, then your website is important" (oversimplified).

The format of the graph edge list is:

from,to,weight

Where weight is a value that i determine from the relative importance of the match, i determine it based on the amount of matches played in the same area, or a fixed value if it's not a preliminary match.

Graph analysis

With the data transformed into place, i used the igraph package to create a plot with tkplot to see what i had transformed, you can see the plot with only the classified teams here:

Graph

Results

So, what's the result of applying the page rank? As of games played until 27th June, the results are the following (More points is better):

Position Team Page Rank
1 Costa Rica 350.86
2 Francia 345.10
3 Alemania 341.07
4 México 323.63
5 Uruguay 315.64
6 Argentina 303.62
7 EEUU 280.18
8 Países Bajos 257.48
9 Suiza 226.53
10 Chile 218.52
11 Honduras 215.53
12 Colombia 199.86
13 España 191.62
14 Bélgica 188.58
15 Ecuador 164.52
16 Nigeria 122.12
17 Argelia 100.79
18 Costa de Marfil 88.69
19 Portugal 87.81
20 Ghana 86.88
21 República de Corea 86.59
22 Camerún 86.50
23 Irán 80.28
24 Australia 78.92
25 Bosnia y Herzegovina 76.98
26 Grecia 75.04
27 Italia 74.07
28 Inglaterra 64.73
29 Brasil 57.44
30 Croacia 55.54
31 Japón 54.43
32 Rusia 37.49

Show me the mon- i mean the code

The github repository for this analysis is sicarul/worldcup_scraper

The steps to run it are:

  1. Clone the repo (git clone https://github.com/sicarul/worldcup_scraper)
  2. Create a virtualenv in the directory (cd worldcup_scraper && virtualenv .)
  3. Activate the environment (. bin/activate)
  4. Install dependencies (pip install -r requirements.txt)
  5. Go into the scraper (cd worldcup)
  6. Run the scraper (scrapy crawl preliminary -o ../preliminary.csv -t csv)
  7. Go back into the root of the project (cd ..)
  8. Run the R script (Rscript analyze.R) <- You will need igraph and plyr installed

Some notes

  • This is not a prediction, as it only reflects on past values, and not on some statistic model.

  • As Brazil didn't have to play preliminary matches, it's score is irrelevant, at least for now

  • Sorry about the team names being in spanish, i had already done everything by the time i realized that.