Scraping Pages with jQuery
May 4th, 2007
Last night at 1am I was suddenly overwhelmed by the urge to develop a Dashboard widget for fetching movie schedules off kino.de.
Of course, kino.de doesn’t offer an RSS feed for the schedules. That would be too easy wouldn’t it?
To create a real challenge though, kino.de uses stone-age HTML.
I didn’t know how many tables a sane person could nest around the tiniest bits of HTML to create a page. There’s almost literally a table around every single tag in the code of that page. Holy shit.
“Good luck parsing that with Regular Expressions“, I thought.
Fortunately, just the day before, I discovered jQuery, a Javascript framework with strong support for finding DOM-Nodes via CSS, XPath and some custom selectors. The tricky part now was to get jQuery to access the DOM-Tree of the schedule page on kino.de
There are several ways to do this but after a while I figured out a neat trick I want to share with you.
First, you need to fetch the document using XMLHttpRequest. To allow the widget to send XMLHttpRequests anywhere on the net, you need to manipulate your widget’s info.plist and add the following key/value pair:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
…
<key>AllowNetworkAccess</key>
<true/>
…
</dict>
</plist>
Now you can fetch and access a documents contents:
var req = new XMLHttpRequest()
req.onreadystatechange = function() {
if (req.readyState == 4) {
var contents = parse(req.responseText)
… //Process body here, see below
}
}
req.open("GET", URL, true)
req.send(”)
Now, contents hold the HTML source of the page you just fetched. To parse it with Webkit and make available its DOM-tree to jQuery, the following does the trick:
var d = document.createElement(‘div’)
d.innerHTML = ‘<div id="root">’ + body + ‘</div>’
This inserts the complete HTML into a div tag that’s not part of your actual widget. HTML and BODY are automatically stripped. Nice. You may wonder about the <div id="root"> thing there. You’ll need that if you actually want to use jQuery’s XPath selectors to their full potential.
Since you want to query the fetched document and not our widget’s DOM, you’ll have to use the created div element as the context for jQuery’s $() function. If you use a context though, it’s not possible anymore to select absolute paths.
If you want to query all tables on the top level of the fetched document, $("/table") won’t work. $("table") on the other hand returns not only top-level tables but every single table in the document (that’s 54 in the kino.de example. FIFTY-FOUR!).
Now the root-div is just a trick to emulate absolute paths within the context. Since your original document is enclosed by it, you can use $("div#root/table") to query the top-level tables.
September 10th, 2007 at 02:33
Parsing HTML with innerHTML…
…
June 8th, 2008 at 11:02
[...] with jQuery, as Jan Varwig describes: Fortunately, just the day before, I discovered jQuery, a Javascript framework with [...]