Scrapy to corpus!

·

6 min read

Scrapy to corpus!

When I wrote my previous blog on a very basic scrapy implementation, I had no clue where to apply it, but web crawlers. In fact, my implementation was not like the standard ones either. It scrapes the web and retrieves links from Google based on the user's query. Which is unlike a standard one.

There isn't a value in merely storing it in an user accessible location like a dB or a flat file for future reference. The program's worth is determined by its application.

That said, while surfing through the wild web I ran into something called a corpus and it immediately struck me if I could build a corpus using scrapy.

Corpus

Corpus is a collection of texts associated to any topic whatsoever. It could be a topic centric collection specific to a subject or a collection of all and sundry topics under the sun. It's usually stored in digital format in databases for subsequent use in NLP related applications.

From the Wikipedia:

In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Why corpus?

Corpus is the most basic and critical building block of a Natural Language Processing (NLP) application. In order to develop a NLP application we need a corpus that has written or spoken natural language material. It is the input to the application for retrieval and processing of facts and figures associated to the use case in development.

How to create a corpus?

I will be using my earlier implementation of scrapy to build a custom corpus. Note: this is not something you'd see an NLP application developer do. He/she would most likely use existing corpora to use in his program. Only a crazy developer may end up building a corpus on the fly. So, hang on and enjoy the madness.

In this post I will not write the entire program but mention snippets of code that needs to get added to my scrapy implementation to achieve it.

First things first, import the required class as shown below.

from scrapy import Selector

This class in scrapy helps in extracting data from any html source.

body = ""

Initialize a variable named body in the webSpider class of my implementation of scrapy. It would store the content of the web page being scraped.

def getBody(self, response):
            sel = Selector(response)
            body = "".join(sel.xpath('//body//div//p//text()').extract()).strip()
            print(body)
            textfile = self.QUERYSTRING.replace(" ","")+".txt"            
            f = open(textfile, "a+")
            f.write(body)
            f.close()

The getBody function will extract text from HTML source (in this case the webpage being scraped) between the

tags as shown below and save it to a file.

<body>
       <div>
             <p>.......</p>
       </div>
       <div>
             <p>.....</p>
       </div>
       .....
</body>

In fact, it'll extract text from any where in the html source that matches the above pattern. If there are multiple tags, it'd extract from all and append the txt file.

for url in link_list:
                yield scrapy.Request(url,callback=self.getBody)

Finally, add the above at the end of the parse function to loop through the list of links, make http requests to each and call the custom function above getBody to yield the content.

And that's it! Try it out for your self to see how it works. If you need to see what the entire code looks like, give me a shout either in the comment section below or twitter.

What next?

Now that I have a mechanism of creating a corpus, I could use it in a chat bot type use case where I converse with it. I can ask queries and it can either respond from the existing corpus or retrieve info using this implementation and then respond.

Stay tuned for more updates on this implementation.