Part 1
XULRunner + CrowbarThis is the only working solution I managed to build, I'm sure you'll find description on how to install and prepare
XULRunner and Crowbar. Short walk through:
Get
XULRunner, SVN checkout fresh Crowbar build:
svn checkout http://simile.mit.edu/repository/crowbar/trunk/I was installing stuff on RHEL 4.2, so I had some problems with GTK+ libraries, they are shared and can't be updated nor removed. Evolution28 came to help:
evolution28-pango, evolution28-glib2, evolution28-cairo.
Untar XULRunner, move Crowbar
trunc folder so it would be in the same level with XULRunner:
[root@host _CRAWL]# ll
drwxr-xr-x 5 root root 4096 Nov 12 15:03 trunk
drwxr-xr-x 11 root root 4096 Sep 26 06:44 xulrunnerChange dir to xulrunner and perform installation. This will result an
application.ini file which is later used as parameter file.
./xulrunner --install-app ../trunk/xulappNow get an X session, VNC is a good solution. Change dir to
trunk/xulapp, export parameter variable with required libraries (in case you're stuck with them too), get XULRunner up and running:
export LD_LIBRARY_PATH=/usr/evolution28/lib
cd trunk/xulapp
../../xulrunner/xulrunner application.iniYou will now get two windows:
Crowbar and an unnecessary debug console, which can be closed (VNC delete option). Crowbar is now accessed with any browser on port 10000. We will abuse this port in next part.
Getting the page sourceWe now must have a URL to desired page, sadly though I was unable to find a solution to follow the links using Crowbar. Maybe some other tool should be used to provide links, for example Perl::Mechanize.
Lets say we have a direct link, I was playing around with ubs.com. Using cURL we finally get the page source.
curl -s –data "url=https://wb1.ubs.com/app/ABU/3/QCoreWeb/GRT_3_Aggregation/gvu/pg_mi/?grt_locale=en_US/&delay=3000" http://127.0.0.1:10000-s - stands for silence;
-data - followed by data;
"url=&delay=1000" - URL link and delay in milliseconds for Crowbar to prepare content, 5-10 seconds is normal;
http://127.0.0.1:10000 - Crowbar proxy listening on port 10000.
cURL sure has GET and POST mechanisms, but I was unable to get a proper link and perform POST on pages that don't have CGI.
The output data is saved to file by adding
> ubs.html to the end.
Viewing the dataThe cURL output as one problem, its garbaged, you lose most of the newlines from page source, which means we are going to use one more tool. You can try opening page with Firefox, all the data is there, not so nicely layed out, but there. And all we need is data. Using stream editor
sed we transform page into a viewable form. A command line that suites my test pages:
cat ubs.html | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<|<|g' | sed 's/br&/> -e :a -e '/\/TD>$/N; s/\/TD>\n/\/TD>/; ta' | sed 's/<\/TABLE/\n<\/TABLE/g' > ubs_eol.htmlI simply cleanup some unicode characters, split tables and rows to newlines. The page is prepared to be parsed. An example of more advanced filter with more unicode and more tabs and spaces in source file:
less ubs.html | tr ',' '.' | sed 's/\cM/è/g' | sed 's/\cW/æ/g' | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<| <|g' | sed 's/br&/> a" -e "P;D" | sed 's//TD>/g' | sed -e :a -e '/\/TD>$/N;s/\/TD>\n/\/TD>/; ta' | sed 's/ \n<\/TABLE/\n<\/TABLE/g' > ubs_eol.htmlNext I use custom C
filter to get only the data - words and numbers, filter splits the whole page by tables (TABLE and DIV tags), reads everything between closed tag marks ">" and "<". The filter is parameterisised and gives us just table and column we need. An example request:
./filter -d -t 29 -c 3 | tr -d [=\'=] | grep Name -v > OUTPUT.csv-d - use DIV tags for more accurate table division;
-t - show only table 29;
-c - show column 3 (always prints column #1 plus required one);
The output follows:
DAX,4336.73 DJ
EURO STO,2257.67
DJ Industr ,8048.57
NASDAQ Comb,1465.17
Nikkei 225,8023.31
S&P 500,823.36
SMI,5382.44Working examplesNo comments.
