Web scraping with a Raspberry Pi and Python

Posted by mantonel on June 20, 2016. Last update on October 16, 2016.
Python Raspberry Pi Web scraping

Introduction

This tutorial has been done with raspbian jessie installed on a fresh new Raspberry Pi 3.

If it is a fresh install do not forget to expand the file system and reboot. If you can plug an Ethernet cable, do it (it's going to be much faster)!

First we are going to update the system.

sudo apt-get update
sudo apt-get upgrade

We are going to use Python 3 for this tutorial.

python3 --version
Python 3.4.2

To avoid being annoyed by the screensaver, let's install xscreensaver (this is not required but practical).

sudo apt-get -y install apt-transport-https
sudo apt-get -y install xscreensaver

You can then disable the screensaver by the GUI: Menu/Preferences/Screensaver or by editing the file ~/.xscreensaver : set the line "mode" to "off" (this file seems to appear after you start xscreensaver by the GUI for the first time).

Simple web scraping with urllib

We are going to start by using urllib to get the source code of the page "http://mantonel.com".

Create a file "output.txt" and execute the following code (put the 2 files into the same folder):

#!/usr/bin/env python3

from urllib import request

# This line gets the source code of http://mantonel.com
urlrequest = request.urlopen('http://mantonel.com')

# Write the output to output.txt
with open('output.txt', 'w') as file:
    file.write(str(urlrequest.read()))

The file "output.txt" contains the source code of the page, you can verify it thanks to your favorite web browser.

Let's try with another website, for example 500px (by the way this website has some awesome pictures).

If you compare now "output.txt" and the source code thanks to your web browser, you should notice it is different. Why? My website does not use javascript to generate html code but 500px does.

Executing javascript when web scraping

To execute javascript we are going to need a web browser, Firefox (iceweasel), and a Python library, Selenium, to use it:

sudo apt-get -y install iceweasel
sudo python3 -m pip install selenium==2.53.5

You might have noticed I didn't install the latest version of selenium, it's because right now selenium 3.0.1 doesn't work with Firefox 45.4.0.

Now you can execute this code:

#!/usr/bin/env python3

from selenium import webdriver

# Create the web broswer instance
browser = webdriver.Firefox()
# Web browser fullscreen (optionnal)
# You can also use browser.set_window_size(w,h).

browser.maximize_window()
# Get the source code
browser.get('https://500px.com/popular')
# Close the browser
browser.close()

# Write the output to output.txt
with open('output.txt', 'w') as file:
    file.write(str(browser.page_source))

If you are executing the script from a remote terminal, do not forget to execute this command line before the script:

export DISPLAY=:0

It can be usefull sometimes to modify the size of the window. Here Firefox is maximized to get more pictures displayed (so that we get more source code).

Firefox is pretty slow on a Raspberry Pi so do not worry if it takes a few seconds. When executing the script, if you installed xscreensaver and pluged your Raspberry Pi into a screen, you should see Firefox opening, loading the page and then closing.

As Firefox takes a bit of time to start, if you want to visit several web pages, of course, do it within the same instance of Firefox.

NB : sometimes the first time I execute this script on a fresh new install it doesn't work, it stops at the line "webdriver.Firefox()", if it happens try to start Firefox once (by the GUI or with the following command) and then execute the script again:

firefox

Headless web scrapping

Now we are going to do the same but without displaying Firefox, the web browser is going to be executed into a virtual display. We need the Xvfb package which provides a display server like X11 but without any screen output, the data is only kept in memory. This way you do not need a screen or a X server running. Pyvirtualdisplay is a Python wrapper to use Xvfb.

sudo apt-get -y install xvfb
sudo python3 -m pip install pyvirtualdisplay

Now you can execute the following code:

#!/usr/bin/env python3

from selenium import webdriver
from pyvirtualdisplay import Display

# Create the virtual display
display = Display(visible=0, size=(1920,1080))
# Start the display
display.start()

# Create the web broswer instance
browser = webdriver.Firefox()
# Web browser fullscreen (optionnal)
# You can also use browser.set_window_size(w,h).

browser.maximize_window()
# Get the source code
browser.get('https://500px.com/popular')
# Close the browser
browser.close()

# Write the output to output.txt
with open('output.txt', 'w') as file:
    file.write(str(browser.page_source))

# Close the display (don't close it before)
display.stop()

The script should give the same result as before but without showing the web browser on screen.

Headless web scrapping without X server

Finally we are going to do the same once again but without the desktop GUI. Let's deactivate X server:

sudo raspi-config

Navigate to "Boot Options" then choose "Console Autologin" (choose "Desktop Autologin" if you want to revert this change later) and "Ok". Restart the Raspberry Pi with:

sudo reboot

When the reboot is done, you should only see the console prompt. Execute the script, you should get the same result.

To conclude

Web scraping is interesting with a Raspberry Pi because, unlike your PC, it's cheap and does not require much power so you can let it plugged forever.

If you look for something faster, take a look at Phantomjs. I will maybe make another tutorial on Phantomjs if I can make it work on my Raspberry Pi.

I guess you are now ready to scrap the whole web (or maybe not).


1 Comment

bend94@gmail.com January 1, 2018
Hi, I am looking for the similar scrap script to be able to catch information coming from a https site . This https site need to login first with user + password Can you advise me ? how to do so thanks