Scrapy - your neighborhood spider!

Spider? You are kidding :)

Well, spiders are all over the internet. It crawls websites and retrieves information readily available on the wild web.

From the Wikipedia:

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Why spider?

Crawlers are software programs that parse various websites and retrieves information for indexing purposes. It prepares a stack of data and makes it available for users whenever required. Each website has some amount of meta information which facilitates SEO. The crawlers scrapes such information from websites and stores it so that whenever an user requests for something that matches the data in the stack, it sends the relevant url to the user. In fact this is also how the search engines use crawlers to present us with list of urls for a query of interest.

These programs operate 24 / 7 to facilitate internet users. Just, imagine if a human had to do the same task manually, day after day how tedious it would have been. Hence, the use of automation. Crawlers automate the task of scraping websites and builds a stack.

Developing a spider

Developing a basic spider is fairly simple. Scrapy is a framework written in python to facilitate this. Using this one can build a crawler of varying complexity.

Below is a very quick demonstration of how I developed a basic version of a spider that would take an user provided query string and retrieve the relevant links from Google.

Without any further adieu let's get started. First things first, import the relevant libs

import scrapy
import re
from scrapy.linkextractors import LinkExtractor
import sys
from scrapy import Selector

Assign the base google url to a variable.

gURL = 'https://www.google.com/search?q='

Let's define the spider class now.

class webSpider(scrapy.Spider):

    name = 'webCrawler'
    start_urls = []
    QUERYSTRING = ''

webSpider is the name of my spider class. webCrawler is the name I have given to my crawler. Within the scrapy framework a project would eventually get created by this name. start_urls = [] is the list of urls that scrapy looks at for crawling and QUERYSTRING is my google query

Scrapy can be invoked from within it's framework and tested via scrapy's shell. For more information please refer the official scrappy docs located here.

I personally like to keep my code modular and execute as a script. It gives me lot of flexibility especially if I ever want to integrate it with some other program, it'd be fairly simple.

I'll now define the self function to facilitate invocation from an external script.

def __init__(self, QUERY, *argv, **kwargv):
        super(webSpider, self).__init__(*argv, **kwargv)        
        self.QUERYSTRING = QUERY
        self.start_urls = [gURL+QUERY]

I can now pass the google QUERY from an external script and load the base url for a google search in start_urls. If scrapy sees start_urls it would load the urls mentioned in the list and initiate the crawls.

My objective now is to retrieve the search results google reports. To do this I am defining a function called parse. See below:

def parse(self, response):

            xlink = LinkExtractor()
            link_list=[]

            for link in xlink.extract_links(response):
                if len(str(link))>200 or self.QUERYSTRING in link.text:
                    surl = re.findall('q=(http.*)&sa', str(link))   
                    if surl:                      
                        link_list.extend(surl)                  

            print(link_list)

Like I mentioned earlier, scrapy loads the urls mentioned in start_urls and issues standard http requests. This function will parse the response, iterate through each link retrieved by LinkExtractor, check if the google query we were interested in is present in the text, extract the URL via regex and store it in a list for future use.

This practically is the initial and most basic form of a crawler that would search for a user query and retrieve the links. The snippets of code above can be pasted, as is, in a file named myspider.py. You can give it a name of your choice.

Like I mentioned earlier, I'd want to execute this crawler from an external python script. To do that paste the code snippet below in a .py file and name it whatever you want.

#!/usr/bin/env python3

import scrapy
from scrapy.crawler import Crawler, CrawlerProcess
from myspider import webSpider

QUERY = 'what+is+ethical+hacking'

crawler = Crawler(webSpider)
process = CrawlerProcess()
process.crawl(webSpider, QUERY)
process.start()

With this script developed, every time you have a query to search in google, all you have to do is update the variable QUERY with your search string and execute it. Note: the spider class we wrote up earlier needs to get imported in this file. The links retrieved can now be used for further scraping whenever needed.

You are now all set! Execute the script and automate your google searches like a pro :)

Where am I heading?

Where I am heading towards is something even I don't know. I am using this as a play project to brush up my scripting skills in python. Well, now that I have asked myself, I'd probably want to use this to automate my internet searches and create a list of links for queries of interest.

If I end up doing something interesting with this I'll be sure to let you know. In case you have recommendations on an interesting application of this, please feel free to drop a note in comments and/or connect with me.

Until then, Cheers!