office space: Crawling AJAX in practice. Part 2

Part 1

XULRunner + Crowbar

This is the only working solution I managed to build, I'm sure you'll find description on how to install and prepare XULRunner and Crowbar. Short walk through:

Get XULRunner, SVN checkout fresh Crowbar build:

svn checkout http://simile.mit.edu/repository/crowbar/trunk/

I was installing stuff on RHEL 4.2, so I had some problems with GTK+ libraries, they are shared and can't be updated nor removed. Evolution28 came to help: evolution28-pango, evolution28-glib2, evolution28-cairo.
Untar XULRunner, move Crowbar trunc folder so it would be in the same level with XULRunner:

[root@host _CRAWL]# ll
drwxr-xr-x 5 root root 4096 Nov 12 15:03 trunk
drwxr-xr-x 11 root root 4096 Sep 26 06:44 xulrunner

Change dir to xulrunner and perform installation. This will result an application.ini file which is later used as parameter file.

./xulrunner --install-app ../trunk/xulapp

Now get an X session, VNC is a good solution. Change dir to trunk/xulapp, export parameter variable with required libraries (in case you're stuck with them too), get XULRunner up and running:

export LD_LIBRARY_PATH=/usr/evolution28/lib
cd trunk/xulapp
../../xulrunner/xulrunner application.ini

You will now get two windows: Crowbar and an unnecessary debug console, which can be closed (VNC delete option). Crowbar is now accessed with any browser on port 10000. We will abuse this port in next part.

Getting the page source

We now must have a URL to desired page, sadly though I was unable to find a solution to follow the links using Crowbar. Maybe some other tool should be used to provide links, for example Perl::Mechanize.
Lets say we have a direct link, I was playing around with ubs.com. Using cURL we finally get the page source.

curl -s –data "url=https://wb1.ubs.com/app/ABU/3/QCoreWeb/GRT_3_Aggregation/gvu/pg_mi/?grt_locale=en_US/&delay=3000" http://127.0.0.1:10000

-s - stands for silence;
-data - followed by data;
"url=&delay=1000" - URL link and delay in milliseconds for Crowbar to prepare content, 5-10 seconds is normal;
http://127.0.0.1:10000 - Crowbar proxy listening on port 10000.

cURL sure has GET and POST mechanisms, but I was unable to get a proper link and perform POST on pages that don't have CGI.
The output data is saved to file by adding > ubs.html to the end.

Viewing the data

The cURL output as one problem, its garbaged, you lose most of the newlines from page source, which means we are going to use one more tool. You can try opening page with Firefox, all the data is there, not so nicely layed out, but there. And all we need is data. Using stream editor sed we transform page into a viewable form. A command line that suites my test pages:

cat ubs.html | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<|<|g' | sed 's/br&/> -e :a -e '/\/TD>$/N; s/\/TD>\n/\/TD>/; ta' | sed 's/<\/TABLE/\n<\/TABLE/g' > ubs_eol.html

I simply cleanup some unicode characters, split tables and rows to newlines. The page is prepared to be parsed. An example of more advanced filter with more unicode and more tabs and spaces in source file:

less ubs.html | tr ',' '.' | sed 's/\cM/è/g' | sed 's/\cW/æ/g' | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<| <|g' | sed 's/br&/> a" -e "P;D" | sed 's//TD>/g' | sed -e :a -e '/\/TD>$/N;s/\/TD>\n/\/TD>/; ta' | sed 's/ \n<\/TABLE/\n<\/TABLE/g' > ubs_eol.html

Next I use custom C filter to get only the data - words and numbers, filter splits the whole page by tables (TABLE and DIV tags), reads everything between closed tag marks ">" and "<". The filter is parameterisised and gives us just table and column we need. An example request:

./filter -d -t 29 -c 3 | tr -d [=\'=] | grep Name -v > OUTPUT.csv

-d - use DIV tags for more accurate table division;
-t - show only table 29;
-c - show column 3 (always prints column #1 plus required one);

The output follows:

DAX,4336.73 DJ
EURO STO,2257.67
DJ Industr ,8048.57
NASDAQ Comb,1465.17
Nikkei 225,8023.31
S&P 500,823.36
SMI,5382.44

Working examples

No comments.

office space

Thursday, January 15, 2009

Crawling AJAX in practice. Part 2

No comments:

spyware

registered linux user