Selenium, the most underrated tool for #opendata

In one of my last projects at AxisPhilly*, I was working with Julia Bergman on transparency and looking at open records requests.

The open records request website is a classic case of government website. Well intentioned, but horribly formed data. There’s no clean url to snatch things up, so you can’t directly link to any results. So when Julia asked about getting all the records pulled down into a digital format, I asked Casey if he had any ideas.

Turns out that earlier this year, Brian Abelson wrote a scraper using Selenium at BCNI, and I didn’t totally get the theory at the time. However, I’m hoping this will help someone who’s never used Selenium (or only seen it used for say, integration tests) see the benefit.

Here’s the deal. Selenium is a browser automation tool. Meaning, you can write a script and have it go do things for you. Computers, right? So instead of clicking the “next” button, you tell Selenium to do so. I also really like that it does launch a browser instance, so you can watch what it’s doing. Of course, if you prefer running it “headless” (meaning you don’t have to/get to watch it), Chris Le just published a post on doing that, using the Node.js Selenium package.

So, what does that look like?

I blatantly “forked” (ripped off of) Brian’s example. This is definitely not elegant, but it’s brute forcing its way through pagination to save all the pages to text files, which are easily manipulated. With just a little more work, you could loop through each row and directly save to CSV, if you wanted.

Here’s the thing about why this was so interesting to me that I made a note to write about it — I feel like lots of journalists/open data folks are into Python, and it’s so, so, easy to target things on the page when you use the find_element_by_css_selector method. And if you checked out Chris’s post, you see that you can also use Selenium in JavaScript, or a bunch of other languages in fact.

* On AxisPhilly

I don’t know a clear way to write these things out on le internet, but many folks on the internet know by now that I’ve left AxisPhilly. Sean and Erika have spoken openly about how some of the drama there shook out, but all I will say from my end is that I was so happy to get the time and opportunity to work solely on open source, journalistic projects. I was especially happy to work with awesome devs like Casey and Jeff and be in the same room collaborating with great journalists. I start my next gig soon, so there’ll be an announcement of sorts when that happens I suppose. I wish AxisPhilly the best of luck, and hopefully they can continue to put out some good work.

Leave a Reply Cancel reply