Python Web Search Agent

So there was this web page, nothing more than a text file in fact, which I was compelled to check regularly. I had ordered something online from a company whose web-store design was still in the dark ages, and their “method” of letting customers know whether or not their products had shipped was to periodically update a web-viewable text file with a public list of all the order numbers that had recently shipped. Very private and secure, I know.

No, no, they couldn’t possibly email this information to you — you had to check the page manually. And if your products were backordered, as mine were, you’d be checking often.

Well, that’s just stupid, I thought, and since I had a computer sitting right in front of me, I thought it might be a good idea to automate the process.

So I cobbled together a little Python script on my Ubuntu box, which uses the pycURL interface to grab the URL in question, parse it, search it for my order number, and then fire me off an email (using sendmail) if and when it finds anything.

I automated the script to run hourly using cron, like so:

crontab -e

and then included the following line in my crontab file:

# m h  dom mon dow   command
7 * * * * /path/to/searchagent.py

Seven minutes after the hour (for luck), every hour, every day.

On the hunch that this script might be useful to somebody else, I decided to post it here. Fair warning to all: this is a huge hack, and I’m sure it needs work. Also, there’s probably a better method that I’ve completely overlooked. Comments, corrections, and suggestions are welcome, as always.

Here’s the script:

#!/usr/bin/env python

#############################
# Module:   searchagent.py
# Author:   Brian D. Wendt
# Date:     2008/08/15
# Version:  Draft 0.3
'''
Searches for a specified string in a supplied webpage,
then sends an email when found.

Requires Linux 'sendmail' and 'pycurl' module.
On Ubuntu, try 'sudo apt-get install sendmail python-pycurl'
'''
################################

import sys, os
import pycurl

#####################################################
### Edit these variables! ###########################
#####################################################

# webpage to search
url = 'http://reddit.com/'

# string to find in the webpage
query_string = "LOLcats"

# email message particulars
to_name    = "Harry S. Noob"
to_email   = "hsnoob@example.com"
from_name  = "Python Search Agent"
from_email = "searchagent@example.com"
subject    = "We've Found Something!"
body       = "Your search query %s was found in a scan of %s." % (query_string, url)

#####################################################
### The mail-sending function (requires sendmail) ###
#####################################################

def sendmail(to_name, to_email, from_name, from_email, subject, body):
    # full path to sendmail
    mailerdaemon = "/usr/sbin/sendmail"
    # logfile when email sent (so you only send once!)
    email_log = "searchagent.log"
    # format the message
    mail = "To: "%s" nFrom: "%s" nSubject: %snn%s" % (to_name, to_email, from_name, from_email, subject, body)
    # check to see if the email logfile exists
    try:
        logfile = open(email_log, 'r')
        email_sent = logfile.readline()
    except:
        email_sent = ""
    # if no mail's been sent, send one
    if email_sent == "":
        # open a pipe to sendmail and write message to the pipe
        print "Sending email..."
        p = os.popen("%s -t" % mailerdaemon, 'w')
        p.write(mail)
        exitcode = p.close()
        if exitcode:
            print "Oops! There was an error: %s" % exitcode
        else:
            print "Mail sent!"
            # log the sending, so you don't send again
            log = open(email_log, 'w')
            log.write("MAIL SENT")
    else:
        print "Email already sent!  Exiting..."

#####################################################
### Read the webpage using pyCURL ###################
#####################################################

# some CURL options (pose as a Mozilla browser)
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0)'
headers = ['Cache-control: max-age=0', 'Pragma: no-cache', 'Connection: Keep-Alive']

# create temporary textfile for results
tempfile = "TEMP-HTML.txt"
outputfile = open(tempfile, "w")

# keep track of the result count and results list
results_count = 0
results_list = []

# set up CURL object
c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HEADER, 1)
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.CONNECTTIMEOUT, 40)
c.setopt(pycurl.TIMEOUT, 300)
c.setopt(pycurl.FILE, outputfile);
c.perform()
outputfile.close()

# now search the file line by line
file = open(tempfile, "r")
for line in file.readlines():
    search = line.find(query_string)
    if search != -1:
        # store results in a list, if desired
        # results_list.append(line.strip())
        results_count += 1

# if there's any result, send an email
if results_count > 0:
    print "Found a match.  ",
    sendmail(to_name, to_email, from_name, from_email, subject, body)
else:
    print "No matches found."

# delete temporary textfile
os.remove(tempfile)

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s