/ python

Scraping data from the central bank of Argentina

Today i'm going to show you an example of data scraping with the BCRA, which is the central bank of Argentina.

On this website, we have a section called "Estadisticas e indicadores" (Stats and indicators), which has a subsection "Principales variables" (Main variables). On this section we have a set of KPIs which can be checked for any date you ask, so i decided i wanted to get all available variables, and be able to chart them. This is already available at http://bluelytics.com.ar/#/analysis.

So, how are we going to download all this information? I decided i would use a tool already familiar to me: Scrapy. It's a Python framework well prepared to deal with data scraping.

So, how do we start now? Ensure you have Python 2 and pip installed and you will be able to follow these steps. (You can probably do it with Python 3, but the instructions will probably change)

You should create a directory, and if you use git a git repository for that matter, like i did:

mkdir bcra_scraper
cd bcra_scraper
virtualenv .
. bin/activate
pip install Scrapy

The Scrapy installation will take some time, after it's done, you can start creating your project, with Scrapy installed you have to execute:

scrapy startproject bcra

This will create a project structure in the folder "bcra", which includes all you need to start working.

bcra
├── scrapy.cfg
└── bcra
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

So, now we have a project, we now need to generate our first spider, to crawl all that precious BCRA data. We should enter the project and generate the spider

cd bcra
scrapy genspider bcra bcra.gov.ar

This created a new bcra.py file on the spiders folder, which should contain the following:

# -*- coding: utf-8 -*-
import scrapy


class BcraSpider(scrapy.Spider):
    name = "bcra"
    allowed_domains = ["bcra.gov.ar"]
    start_urls = (
        'http://www.bcra.gov.ar/',
    )

    def parse(self, response):
        pass

This is a basic spider which does nothing, but we have a basic project working! You can test it's working by stepping into the base project folder and executing:

scrapy crawl bcra

Now, let's analyze the website. If you just look at the plain website, you can think it's doing something mysterious, or some kind of AJAX call to not change the URL but anyway change the data every time. In this case, as in many others, the way the website is structured is horrible, but will serve our scraping purposes very well.

If you inspect the code and pay attention, you will see the website is structured with an iFrame, a "mainFrame". If you use the network tab in chrome and change a date, you will notice an URL of this kind:

http://www.bcra.gov.ar/estadis/es010000.asp?FechaCons=02/02/2015

What does this mean? We don't even have to crawl the website! We just have to define which date ranges we need and then request each website, extract the data we need and close it.

For this, i searched an easy way to get date ranges in Python and i found this nice generator function:

def daterange(start_date, end_date):
	for n in range(int ((end_date - start_date).days)):
    	yield start_date + timedelta(n)

So we define two variables:

end_date = datetime.now().date()
start_date = end_date - timedelta(days=30)

And we define the start_urls like this:

	start_urls = ['http://www.bcra.gov.ar/estadis/es010000.asp?FechaCons=%s' % x.strftime("%d/%m/%Y") for x in daterange(start_date, end_date)]

So now, how do we extract each value? My way is to inspect the HTML structure with Google Chrome, and then generate the corresponding XPATH query.

The data table starts like this:

<table border="1" width="100%" id="table7" class="Tabla_Borde" cellpadding="0" cellspacing="3">
	<tbody>
    <tr>
		<td class="Celda_Borde_Titulo">
        	<span style="text-transform: uppercase">Descripción (en millones de $)</span>
     	</td>
		<td class="Celda_Borde_Titulo">
        	<span style="text-transform: uppercase">Valor</span>
        </td>
	</tr>
    <tr>
		<td class="Celda_Borde">
        	<a href="es010100.asp?descri=1&amp;fecha=Fecha_Serie&amp;campo=Res_Int_BCRA" target="_self">Reservas Internacionales del B.C.R.A. excluidas asignaciones DEGs 2009 (en millones de dólares - cifras provisorias sujetas a cambio de valuación)</a>
        </td>
		<td class="Celda_Borde_Centro">31316</td>
	</tr>
[...]

So here we can identify some useful patterns:

  • The table which contains all data has a class of Tabla_Borde
  • Each TR is a different variable
  • Each non-data row has cells with a class of Celda_Borde for the description and Celda_Borde_Centro for the value

I decided i would get all values at once with one XPATH query, and then save all values to a Scrapy item.

To test this, you can use the scrapy shell:

scrapy shell http://www.bcra.gov.ar/estadis/es010000.asp?FechaCons=02/02/2015

On the console, you can try different patterns until you get the data you need, in this case, this query worked:

response.xpath('//td[@class="Celda_Borde_Centro"]/text()').extract()

The xpath here says:
Look for all td elements with a class of Celda_Borde_Centro, and extract their inner text. The extract() function converts the result into an array of values.

Finally, we need to populate the item to return, as most values are numbers with decimals, i defined a function to convert them to floats and fail gracefully, the final code is this:

*- coding: utf-8 -*-
import scrapy
from datetime import datetime, date, timedelta

from bcra.items import BcraItem

end_date = datetime.now().date()
start_date = end_date - timedelta(days=30)
#start_date = date(2010,1,1)

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
	    yield start_date + timedelta(n)

def getFloatVal(value):
	try:
    	return float(value.replace(',','.'))
	except Exception:
    	return None

class BcraSpider(scrapy.Spider):
	name = "bcra_spider"
	allowed_domains = ["bcra.gov.ar"]
	start_urls = ['http://www.bcra.gov.ar/estadis/es010000.asp?FechaCons=%s' % x.strftime("%d/%m/%Y") for x in daterange(start_date, end_date)]

    def parse(self, response):
	    values = response.xpath('//td[@class="Celda_Borde_Centro"]/text()').extract()

        item = BcraItem()
	    item['date'] = response.url.split('=')[1]
    	item['Reservas'] = getFloatVal(values[0])
    	item['Asistencia'] = getFloatVal(values[1])
        item['BaseMonetaria'] = getFloatVal(values[2])
    	item['Circulacion'] = getFloatVal(values[3])
	    item['BilletesPublico'] = getFloatVal(values[4])
        item['EfectivoFinanciero'] = getFloatVal(values[5])
    	item['DepositosBCRA'] = getFloatVal(values[6])
	    item['LEBAC'] = getFloatVal(values[7])
        item['DepositosFinancieras'] = getFloatVal(values[8])
    	item['CuentasCorrientes'] = getFloatVal(values[9])
	    item['CajasAhorro'] = getFloatVal(values[10])
        item['APlazo'] = getFloatVal(values[11])
    	item['CEDROS'] = getFloatVal(values[12])
	    item['OtrosDepositos'] = getFloatVal(values[13])
        item['PrestamosAPrivados'] = getFloatVal(values[14])
    	item['TasasInteresEntrePrivadas'] = getFloatVal(values[15])
	    item['TasasInteres30Dias'] = getFloatVal(values[16])
        item['BADLAR'] = getFloatVal(values[17])
    	item['TasasLebac'] = getFloatVal(values[18])
	    item['CambioRef'] = getFloatVal(values[19])
        item['CER'] = getFloatVal(values[20])

        if item['Reservas'] != None:
        	return item
    	else:
    	    raise Exception("No data on date " + item['date'])

This returns a "BcraItem"... but we haven't defined what's that! So we need to step into the file bcra/items.py and define it:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BcraItem(scrapy.Item):
	date = scrapy.Field()
	Reservas = scrapy.Field()
	Asistencia= scrapy.Field()
	BaseMonetaria = scrapy.Field()
	Circulacion = scrapy.Field()
	BilletesPublico = scrapy.Field()
	EfectivoFinanciero = scrapy.Field()
	DepositosBCRA = scrapy.Field()
	LEBAC = scrapy.Field()
	DepositosFinancieras = scrapy.Field()
	CuentasCorrientes = scrapy.Field()
	CajasAhorro = scrapy.Field()
	APlazo = scrapy.Field()
	CEDROS = scrapy.Field()
	OtrosDepositos = scrapy.Field()
	PrestamosAPrivados = scrapy.Field()
	TasasInteresEntrePrivadas = scrapy.Field()
	TasasInteres30Dias = scrapy.Field()
	BADLAR = scrapy.Field()
	TasasLebac = scrapy.Field()
	CambioRef = scrapy.Field()
	CER = scrapy.Field()

Then if you want to execute the crawling, you just type in:

scrapy crawl bcra

And if you want to write the results to a file you can use one of the output formats, or research about scrapy's exporters

scrapy crawl bcra -o bcra.json -t json
scrapy crawl bcra -o bcra.csv -t csv

I hope this helps you get started with data scraping!
You can see this repository at:
https://github.com/Bluelytics/bcra_scraper

Any comments and/or doubts are welcome!