Scraping response from a Form submission using Scrapy

There have been times when I've wanted to keep track of content on the net, specifically track the changes in content. Python-Requests + Regex is the usual way to go, but it requires a lot of overhead code. I guess that's why people create libs. Anyway, so Scrapy has a bunch of utils that allow us to automate the process. Here's an example on how to scrape information off of the response from a form submission.

I am using this to track the stats on my USCIS case. I intend to deploy it to a Django-Celery server instance and have Tasker (on my android) pull it and give me an update every morning. More on that later.

So let's define a simple container for the information. Here's my container.

#items.py
import scrapy

class UscisItem(scrapy.Item):  
    # define the fields for your item here like:
    # name = scrapy.Field()
    timestamp = datetime.datetime.now()
    case_number = scrapy.Field()
    status_headline = scrapy.Field()
    status_details = scrapy.Field()
    links = scrapy.Field()

Now, to create the spider. The spider is what crawls the urls and gets the data.

import scrapy  
from scrapy.http import FormRequest  
from ..items import UscisItem

USCIS_CASE_NUMBER = "SOME_STRING"

class USCISCaseStatusSpider(scrapy.Spider):  
    name = 'case_status'
    allowed_domains = ['uscis.gov']
    start_urls = ['https://egov.uscis.gov/casestatus/landing.do']

    def parse(self, response):
        yield FormRequest.from_response(response,
                                        formname="caseStatusForm",
                                        formdata={"appReceiptNum": USCIS_CASE_NUMBER},
                                        callback=self.parseUSCISCaseResponse)

    def parseUSCISCaseResponse(self, response):
        item = UscisItem()
        item['case_number'] = USCIS_CASE_NUMBER
        item['status_headline'] = response.xpath('/html/body/div[2]/form/div/div[1]/div/div/div[2]/div[3]/h1/text()').extract()
        item['status_details'] = response.xpath('/html/body/div[2]/form/div/div[1]/div/div/div[2]/div[3]/p/text()').extract()

And BOOM done. scrapy crawl case_status will launch the spider and UscisItem() will contain the relevant resultant data. Of course the next step is to insert it into a DB if you want and do whatever with it. I'll build it later.

Oh interestingly, if you don't want to deploy it on a remote machine, you could just cron it up and run it every morning. And of course since I am on Mac OS X I prefer the pretty notifications.

I use pync for that sort of thing. The repo contains the fixes that are not deployed on PyPI so it's better to just clone the repo and use that particular package.

So in the end our parse method just displays a notification with the status headline.

from pync import Notifier

def parseUSCISCaseResponse(self, response):  
        item = UscisItem()
        item['case_number'] = USCIS_CASE_NUMBER
        item['status_headline'] = response.xpath('/html/body/div[2]/form/div/div[1]/div/div/div[2]/div[3]/h1/text()').extract()
        item['status_details'] = response.xpath('/html/body/div[2]/form/div/div[1]/div/div/div[2]/div[3]/p/text()').extract()
        Notifier.notify(item['status_headline'], title='USCIS Case Status')

That should be fun, seeing changes every morning. It's trivial, but powerful.