Scraping the LV Municipal Court database
Case information from the Las Vegas Municipal Court can be accessed through an ASP.NET Web Form. We’ll be doing a bit of scraping.
Case lookup.
Case lookup is available here. That page contains a frame with a single form element.
That form takes a single argument, 16 digits or less in length. I was going to tackle this with the mechanize
library. First run based on
the scraperwiki mechanize
cheat
sheet:
import mechanize
br = mechanize.Browser()
br.set_all_readonly(False) # allow everything to be written to
br.set_handle_robots(False) # no robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
And get our target url:
target = 'https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx'
response = br.open(target)
print response.read()
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
City of Las Vegas Court Case Lookup
</title><link rel="stylesheet" type="text/css" href="http://www.lasvegasnevada.gov/includes/stylesheetSmall.css" />
[...]
We know that there’s only one form, but let’s see:
for form in br.forms():
print "Form name:", form.name
print form
Form name: None
<POST https://secure2.lasvegasnevada.gov/defendantreport/Default.aspx application/x-www-form-urlencoded
<HiddenControl(__EVENTTARGET=) (readonly)>
<HiddenControl(__EVENTARGUMENT=) (readonly)>
<HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)>
<HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)>
<TextControl(txt_CaseNo=)>
<SubmitControl(btn_GetCase=Get Report) (readonly)>>
As we expected, a single form. We’ll select this form and iterate through the controls:
br.form = list(br.forms())[0]
for control in br.form.controls:
print control
print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
<HiddenControl(__EVENTTARGET=) (readonly)>
type=hidden, name=__EVENTTARGET value=
<HiddenControl(__EVENTARGUMENT=) (readonly)>
type=hidden, name=__EVENTARGUMENT value=
<HiddenControl(__VIEWSTATE=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ) (readonly)>
type=hidden, name=__VIEWSTATE value=/wEPDwULLTE3MDc4NTU3NDBkZBFZvEXEM/AfxRuZRWuEEWJrOA5E0fGHGTe0sOVSnEAZ
<HiddenControl(__EVENTVALIDATION=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML) (readonly)>
type=hidden, name=__EVENTVALIDATION value=/wEWAwLdxISPDgL43ua9AQK3leriC2s/4orhhWHuQloqbi0JGVVNeFua4wyBGgXiQ+jkAsML
<TextControl(txt_CaseNo=)>
type=text, name=txt_CaseNo value=
<SubmitControl(btn_GetCase=Get Report) (readonly)>
type=submit, name=btn_GetCase value=Get Report
More of the same. Create a single test number and submit our query:
test_num = 'C1002073A'
br["txt_CaseNo"] = test_num
response = br.submit()
print response.read()
print br.response().read()
[...]
<div>
<input name="txt_CaseNo" type="text" value="C1002073A" maxlength="16" id="txt_CaseNo" style="font-size:12px;height:20px;width:175px;" />
<input type="submit" name="btn_GetCase" value="Get Report" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("btn_GetCase", "", true, "", "", false, false))" id="btn_GetCase" class="formButton" /><br />
<span id="validMessage" class="alertMssg" style="display:none;">Please enter a case ID</span>
<span id="lblError"></span>
</div>
[...]
Which isn’t what we wanted.
Selenium deals with the Javascript for us
It would appear that problem here is javascript. I’m too lazy to figure out what’s happening here so I’m going to just throw in a grenade: Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Needed to figure out exactly how Selenium works, so using a rough approximation of the example on RTD:
p = webdriver.FirefoxProfile()
p.set_preference("webdriver.log.file", "/tmp/firefox_console")
driver = webdriver.Firefox(p)
driver.get(url)
assert "City of Las Vegas Court Case Lookup" in driver.title
Which works fine.
It would appear that our issue with getting the second page is that the lookup spits
out via a call to window.open
, the problem being that it uses null
as the window.title
. Looking up the pop-up window via the page title
is brittle at best, so I just pulled all the window handles into a for
loop and matched urls before writing the page source to disk.
Got a new browser window with Selenium, so now let’s define a function to save our report to disk…
def get_lvmc_case_report(caseNumber):
driver = webdriver.Firefox()
driver.get(url)
casenum = driver.find_element_by_name('txt_CaseNo')
casenum.send_keys(caseNumber)
casenum.send_keys(Keys.RETURN)
driver.implicitly_wait("30000")
for handle in driver.window_handles:
driver.switch_to_window(handle)
if driver.current_url == 'https://secure2.lasvegasnevada.gov/defendantreport/report.aspx':
filename = caseNumber + '.html'
outfile = open(filename, 'w')
outfile.write(driver.page_source)
outfile.close()
print "Wrote %s" % filename
driver.quit()
In order to test the scraper we’ll need some case numbers. There exists a multitude of arrest/mugshot sites, so that’s not particularly difficult. For this project I’ll be using Jailbase, given that it produces an archived snapshot of the inmate entry used to source each arrestee’s page, as well as an API which we may make use of later. The question then becomes whether or not the case number pattern holds true for citations, i.e. those seizures that did not culminate in a custodial arrest. Some manual checking shows at least two “CA”-numbers from non-custodial arrests.
Also, reviewing our manual downloads leads me to notice that the case numbers seem to be an arithmetic progression. I manually queried a sequence of case numbers and all were hits so that simplified matters considerably. We can use a list comprehension to generate the case numbers and be reasonably sure that we’ll get hits.
test_nums = ['C1012474A','C101200474A', 'C1096047A']
test_seq = ['C' + str(x) + 'A' for x in range(1096048, 1096051, 1)]
test_seq
['C1096048A', 'C1096049A', 'C1096050A']
So test_nums[1]
is a bad case number, one not found in the DB. When
this happens the server raises an alert message and fails to open a
report.aspx
window. This fails relatively gracefully.
Now, to avoid hammering the city’s server let’s introduce a random wait, then use
a for
loop:
from random import randint
from time import sleep
for test_num in test_nums:
print test_num
get_lvmc_case_report(test_num)
sleep(randint(1,10))
C1012474A
Wrote C1012474A.html
C101200474A
C1096047A
Wrote C1096047A.html
A quick inspection of the files shows HTML tables, which was what we wanted. Now, let’s try the sequential set:
for test_num in test_seq:
print test_num
get_lvmc_case_report(test_num)
sleep(randint(1,10))
C1096048A
Wrote C1096048A.html
C1096049A
Wrote C1096049A.html
C1096050A
Wrote C1096050A.html
W5.
Everything else looks good. We’ll run this over range(1010001, 1099263, 1)
for a total
of 89,262 queries dating from 26 February 2010 to 3 August 2013. I’ll shorten the sleep time upper bound to three
seconds, cutting five days down to a day and a half.
Run it overnight and on Sunday and start work on parsing the tables into a database tomorrow.
Final notes:
- Should add
gzip
compression on file save. - Progress bar or a counter would be nice.
- The connection at home is spotty, and I may need/want to put this on EC2, so making the script take the range from the command line is the next logical step.
- Running this in a second X session keeps it out of the way.
And we finally start getting data, after making a public records request for that very same data that took seven weeks to deny. Script took about a half-hour to figure out.