Thursday, January 15, 2009

Crawling AJAX in practice. Part 2

Part 1

XULRunner + Crowbar


This is the only working solution I managed to build, I'm sure you'll find description on how to install and prepare XULRunner and Crowbar. Short walk through:

Get XULRunner, SVN checkout fresh Crowbar build:

svn checkout http://simile.mit.edu/repository/crowbar/trunk/

I was installing stuff on RHEL 4.2, so I had some problems with GTK+ libraries, they are shared and can't be updated nor removed. Evolution28 came to help: evolution28-pango, evolution28-glib2, evolution28-cairo.
Untar XULRunner, move Crowbar trunc folder so it would be in the same level with XULRunner:

[root@host _CRAWL]# ll
drwxr-xr-x 5 root root 4096 Nov 12 15:03 trunk
drwxr-xr-x 11 root root 4096 Sep 26 06:44 xulrunner


Change dir to xulrunner and perform installation. This will result an application.ini file which is later used as parameter file.

./xulrunner --install-app ../trunk/xulapp

Now get an X session, VNC is a good solution. Change dir to trunk/xulapp, export parameter variable with required libraries (in case you're stuck with them too), get XULRunner up and running:

export LD_LIBRARY_PATH=/usr/evolution28/lib
cd trunk/xulapp
../../xulrunner/xulrunner application.ini


You will now get two windows: Crowbar and an unnecessary debug console, which can be closed (VNC delete option). Crowbar is now accessed with any browser on port 10000. We will abuse this port in next part.

Getting the page source

We now must have a URL to desired page, sadly though I was unable to find a solution to follow the links using Crowbar. Maybe some other tool should be used to provide links, for example Perl::Mechanize.
Lets say we have a direct link, I was playing around with ubs.com. Using cURL we finally get the page source.

curl -s –data "url=https://wb1.ubs.com/app/ABU/3/QCoreWeb/GRT_3_Aggregation/gvu/pg_mi/?grt_locale=en_US/&delay=3000" http://127.0.0.1:10000

-s - stands for silence;
-data - followed by data;
"url=&delay=1000" - URL link and delay in milliseconds for Crowbar to prepare content, 5-10 seconds is normal;
http://127.0.0.1:10000 - Crowbar proxy listening on port 10000.

cURL sure has GET and POST mechanisms, but I was unable to get a proper link and perform POST on pages that don't have CGI.
The output data is saved to file by adding > ubs.html to the end.

Viewing the data

The cURL output as one problem, its garbaged, you lose most of the newlines from page source, which means we are going to use one more tool. You can try opening page with Firefox, all the data is there, not so nicely layed out, but there. And all we need is data. Using stream editor sed we transform page into a viewable form. A command line that suites my test pages:

cat ubs.html | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<|<|g' | sed 's/br&/> -e :a -e '/\/TD>$/N; s/\/TD>\n/\/TD>/; ta' | sed 's/<\/TABLE/\n<\/TABLE/g' > ubs_eol.html

I simply cleanup some unicode characters, split tables and rows to newlines. The page is prepared to be parsed. An example of more advanced filter with more unicode and more tabs and spaces in source file:

less ubs.html | tr ',' '.' | sed 's/\cM/è/g' | sed 's/\cW/æ/g' | sed 's|/b>||g' | sed 's/b>/>/g' | sed 's|</|<|g' | sed 's|<| <|g' | sed 's/br&/> a" -e "P;D" | sed 's//TD>/g' | sed -e :a -e '/\/TD>$/N;s/\/TD>\n/\/TD>/; ta' | sed 's/ \n<\/TABLE/\n<\/TABLE/g' > ubs_eol.html

Next I use custom C filter to get only the data - words and numbers, filter splits the whole page by tables (TABLE and DIV tags), reads everything between closed tag marks ">" and "<". The filter is parameterisised and gives us just table and column we need. An example request:

./filter -d -t 29 -c 3 | tr -d [=\'=] | grep Name -v > OUTPUT.csv

-d - use DIV tags for more accurate table division;
-t - show only table 29;
-c - show column 3 (always prints column #1 plus required one);

The output follows:

DAX,4336.73 DJ
EURO STO,2257.67

DJ Industr ,8048.57
NASDAQ Comb,1465.17
Nikkei 225,8023.31
S&P 500,823.36
SMI,5382.44


Working examples

No comments.










Monday, January 12, 2009

se7en impressions on se7en

Lets face it, people use windows basically because of games and simplicity of hardware management. You can tell more Because and Because NOT, but every time you will get back to these two facts. Every new emergence of new Windows Interface (NOT operating system) comes with a huge marketing bubble. Here are my 7 impressions on Windows 7.

Size matters. Yes, the first thing i've done is calculated size owned by Windows folder. And yes - its a BAAAAD impression - 7.6G. Its the same result as for Vista. The magic winsxs folder is again the biggest eater with 4+G. Its a very bad bad thing compared to XP and common sense.

Services. Even worse. Approx 133 (Vista) up to 146 (7). But settings are a little bit better: Tablet service stopped, Defender is started, but not in active scan mode. You get something you have found after installing XPSP2 - the same experience witch comes after reading and stopping all the unused services. But not all this time, and there are more of them.

Procesess. The first good impression? 28 processes after boot-up compared to 40 on Vista. Where is the trick?

Wallpapers. Its a one shitty collection of shitty wallpapers. Why are they out of focus? Ever heard of contrast?

Performance. I installed 7 on MS Virtual server, which itself has a poor performance, but 7 performs nicely, especially the memory usage.. cant believe it, i now have 270 megs out of 512 free. Maybe MS programming department asked coders to clean up some code left from windows 2000. Or maybe they have finally tried to code the lines themselves not with the help of a studio wizard.
File deletion performs as bad as in Vista. Bad bad bad MS.
For a better test we need some pure hardware.

Control Panel. Hmm - i've seen this before. Hmm, G... no it cant be. Gn.. o.. No i cant believe it. Face it MS, you are copying more and more in every next release. You don't have details, icons are put to columns. MS forgot to add grouping like in Gnome control panel. Oops, i said it out loud.

Explorer. Oh-Oh. No white space for a right-click in explorer. Damn you, i've seen this before. Nooo, its Gnome again. When for the first time i had to choose between KDE and Gnome - i red this: "Gnome is more like Windows environment, Windows users will have a better experience and intuitiveness". Well now it seems that Windows explorer is more like Nautilus. Shame on you.
File system is still full of dead links for XP backward compatibility.
Those big icons everywhere its not because you are using handicapped 7 version, its for the purpose, - 7 is touch screen oriented.

Conclusions. 7 is a fat ass Vista clone with reduced memory usage (or cleaned up code), and a better looking Gnome skin. Had no possibility to test hardware issues and program compatibility, but i'm guessing its the same Vista engine. Hey, there is a new Paint interface, is that's why 7 was released??