Case information from the Las Vegas Municipal Court can be accessed through an ASP.NET Web Form. We’ll be doing a bit of scraping.
That form takes a single argument, 16 digits or less in length. I was going to tackle this with the
mechanize library. First run based on
import mechanize br = mechanize.Browser() br.set_all_readonly(False) # allow everything to be written to br.set_handle_robots(False) # no robots br.set_handle_refresh(False) # can sometimes hang without this br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
And get our target url:
target = 'https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx' response = br.open(target) print response.read()
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><title> City of Las Vegas Court Case Lookup </title><link rel="stylesheet" type="text/css" href="http://www.lasvegasnevada.gov/includes/stylesheetSmall.css" /> [...]
We know that there’s only one form, but let’s see:
for form in br.forms(): print "Form name:", form.name print form
Form name: None <POST https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx application/x-www-form-urlencoded <HiddenControl(__EVENTTARGET=) (readonly)> <HiddenControl(__EVENTARGUMENT=) (readonly)> <HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)> <HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)> <TextControl(txt_CaseNo=)> <SubmitControl(btn_GetCase=Get Report) (readonly)>>
As we expected, a single form. We’ll select this form and iterate through the controls:
br.form = list(br.forms()) for control in br.form.controls: print control print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
<HiddenControl(__EVENTTARGET=) (readonly)> type=hidden, name=__EVENTTARGET value= <HiddenControl(__EVENTARGUMENT=) (readonly)> type=hidden, name=__EVENTARGUMENT value= <HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)> type=hidden, name=__VIEWSTATE value=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ <HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)> type=hidden, name=__EVENTVALIDATION value=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML <TextControl(txt_CaseNo=)> type=text, name=txt_CaseNo value= <SubmitControl(btn_GetCase=Get Report) (readonly)> type=submit, name=btn_GetCase value=Get Report
More of the same. Create a single test number and submit our query:
test_num = 'C1002073A' br["txt_CaseNo"] = test_num response = br.submit() print response.read() print br.response().read()
Which isn’t what we wanted.
from selenium import webdriver from selenium.webdriver.common.keys import Keys
Needed to figure out exactly how Selenium works, so using a rough approximation of the example on RTD:
p = webdriver.FirefoxProfile() p.set_preference("webdriver.log.file", "/tmp/firefox_console") driver = webdriver.Firefox(p) driver.get(url) assert "City of Las Vegas Court Case Lookup" in driver.title
Which works fine.
It would appear that our issue with getting the second page is that the lookup spits
out via a call to
window.open, the problem being that it uses
window.title. Looking up the pop-up window via the page title
is brittle at best, so I just pulled all the window handles into a
loop and matched urls before writing the page source to disk.
Got a new browser window with Selenium, so now let’s define a function to save our report to disk…
def get_lvmc_case_report(caseNumber): driver = webdriver.Firefox() driver.get(url) casenum = driver.find_element_by_name('txt_CaseNo') casenum.send_keys(caseNumber) casenum.send_keys(Keys.RETURN) driver.implicitly_wait("30000") for handle in driver.window_handles: driver.switch_to_window(handle) if driver.current_url == 'https://secure2.lasvegasnevada.gov/defendantreport/report.aspx': filename = caseNumber + '.html' outfile = open(filename, 'w') outfile.write(driver.page_source) outfile.close() print "Wrote %s" % filename driver.quit()
In order to test the scraper we’ll need some case numbers. There exists a multitude of arrest/mugshot sites, so that’s not particularly difficult. For this project I’ll be using Jailbase, given that it produces an archived snapshot of the inmate entry used to source each arrestee’s page, as well as an API which we may make use of later. The question then becomes whether or not the case number pattern holds true for citations, i.e. those seizures that did not culminate in a custodial arrest. Some manual checking shows at least two “CA”-numbers from non-custodial arrests.
Also, reviewing our manual downloads leads me to notice that the case numbers seem to be an arithmetic progression. I manually queried a sequence of case numbers and all were hits so that simplified matters considerably. We can use a list comprehension to generate the case numbers and be reasonably sure that we’ll get hits.
test_nums = ['C1012474A','C101200474A', 'C1096047A'] test_seq = ['C' + str(x) + 'A' for x in range(1096048, 1096051, 1)] test_seq
['C1096048A', 'C1096049A', 'C1096050A']
test_nums is a bad case number, one not found in the DB. When
this happens the server raises an alert message and fails to open a
report.aspx window. This fails relatively gracefully.
Now, to avoid hammering the city’s server let’s introduce a random wait, then use
from random import randint from time import sleep for test_num in test_nums: print test_num get_lvmc_case_report(test_num) sleep(randint(1,10))
C1012474A Wrote C1012474A.html C101200474A C1096047A Wrote C1096047A.html
A quick inspection of the files shows HTML tables, which was what we wanted. Now, let’s try the sequential set:
for test_num in test_seq: print test_num get_lvmc_case_report(test_num) sleep(randint(1,10))
C1096048A Wrote C1096048A.html C1096049A Wrote C1096049A.html C1096050A Wrote C1096050A.html
Everything else looks good. We’ll run this over
range(1010001, 1099263, 1) for a total
of 89,262 queries dating from 26 February 2010 to 3 August 2013. I’ll shorten the sleep time upper bound to three
seconds, cutting five days down to a day and a half.
Run it overnight and on Sunday and start work on parsing the tables into a database tomorrow.
- Should add
gzipcompression on file save.
- Progress bar or a counter would be nice.
- The connection at home is spotty, and I may need/want to put this on EC2, so making the script take the range from the command line is the next logical step.
- Running this in a second X session keeps it out of the way.
And we finally start getting data, after making a public records request for that very same data that took seven weeks to deny. Script took about a half-hour to figure out.