office space: 2015

Wednesday, December 9, 2015

ImageMagick and ghostscript delegate problem: sh gs command not found

PROBLEM

Recent ImageMagick update kicked a little in the nuts. I mainly convert PDF to PNG and then use them for OCR. Here is how it welcomed me after an update to 6.7.2:

sh: gs: command not found
convert: missing an image filename

It works ok from a command line, but fails from Java or any external environments. At first the error may look like a command line syntax change, but it isnt, dont start moving your flags places or in-front/after PDF file - this is not the problem here.

SOLUTION

1) # which gs
2) locate delegates.xml file in your system
3) open the file for editing (# nano /etc/ImageMagick/delegates.xml)
4) find all occurrences of "gs", which will be simply put in html quotes: <...>command=""gs" -q <..> and replace it with the output of you "which gs" result, in my case it looked like this: <..>command=""/usr/local/bin/gs" -q <..>. Leave the surrounding syntax intact.

After that just save the file and run you external command again. There will be around 10 occurrences, replace them all to be on the safe side, I suspect that its only PS configurations that had to be changed.

Wednesday, November 25, 2015

SSL certificate hash transition from SHA1 to SHA2. ORA-28857

Abstract

The SHA-1 cryptographic hash algorithm has been known to be considerably weaker than it was designed to be since at least 2005. As part of their SHA-2 migration plan, Google, Microsoft and Mozilla have announced that they will stop trusting SHA-1 SSL certificates. Google will begin phasing out trust in SHA-1 certificates by the end of 2014, while Microsoft and Mozilla will begin phasing out trust for SHA-1 certificates in 2016.

Problem(s)

If you are using Google Chrome - some https websites or services using SHA1 will not be opened and you will likely get an error displayed in article image. Another situation that you may bump into is Oracle Wallet. Looks like Oracle up to 11.2.0.2 doesn’t support SHA-2 based certificates. If you have a 10.x wallet and website changed its certificate to SHA2 - you will not be able to import it - you will get an error "some trusted certificates could not be installed". See image attached.
Also when connecting to such a site using UTL_HTTP.REQUEST you will end up with an ORA error: "Request Failed: ORA-28857: Unknown SSL error".

Suggestions

Upgrade Java, Tomcat and Oracle.

Dates to remember

November 2014 - SHA-1 SSL Certificates expiring any time in 2017 will show a warning in Chrome 39.
December 2014 - SHA-1 SSL Certificates expiring after May 31, 2016 will show a warning in Chrome 40.
January 2015 - SHA-1 SSL Certificates expiring any time in 2016 will show a warning in Chrome 41.
January 1, 2016 - Microsoft ceases to trust Code Signing Certificates that use SHA-1.
January 1, 2017 - Mozilla Firefox and Microsoft ceases to trust SSL Certificates that use SHA-1.

Services currently supporting SHA2

Apache server 2.0.63+ with OpenSSL 0.9.8o+
Java based servers using Java 1.4.2+
OpenSSL based servers using OpenSSL 0.9.8o+
Oracle Wallet Manager 11.2.0.1+
Oracle Weblogic 10.3.1+

Thursday, October 29, 2015

Larry Ellison on self

“When you write a program for Android, you use the Oracle Java tools for everything, and at the very end you push a button and say, ‘Convert this to Android format.'” - Larry Ellison, 2013, CBS interview.

Oracle 10g on Centos 6.7 - x86 issues: ntcontab.o, snmccolm.o, ORA-27125

Abstract

Its getting harder and harder to install good old 10g on newer Centos versions. Last attempt on 6.5 was semi-problematic, with some extra packages missing, 6.7 deployment was even more challenging.

System specifications

Centos version (/etc/issue): CentOS release 6.7 (Final)
Oracle version: Version 10.2.0.1.0 Production (10201_database_linux_x86_64.cpio)

Problems and solutions

#1: Error invoking target 'ntcontab.o' of makefile

This error occurred around 65% in installation progress, aborting is not an option - more errors will follow and in the end whole process fails. Did some testing and it appears that one process is building the file, next one is instantly deleting it:

# cp /misc/oracle/product/10.2.0/db_1/lib32/ntcontab.o /misc/oracle/product/10.2.0/db_1/lib/

But in the end it looks like it was one of the following (where not needed in 6.5):

# yum install libaio-devel.i686 -y
# yum install zlib-devel.i686 -y
# yum install glibc-devel -y
# yum install glibc-devel.i686 -y
# yum install libaio-devel -y
# yum install ksh -y
# yum install glibc-headers

#2: /misc/oracle/database/product/10.2.0/db_1/sysman/lib/snmccolm.o: could not read symbols: File in wrong format

It might be that it may be solved with some more x86 packages thrown into the pile, I was not able to find the exact culprit. Many of the sources online simply telling to ignore this and fix it with 10.2.0.4 patch. Thats what we'll do: Ignore.

#3: ORA-27125: unable to create shared memory segment

Reason is unknown, it was thrown by DBCA with continuous installation process.
The solution is very simple, first check the oracle user group information:

[oracle@storage] $ id oracle
uid = 500 (oracle) gid = 502 (oinstall) groups = 502 (oinstall), 501 (dba)
[oracle@storage] $ more /proc/sys/vm/hugetlb_shm_group
0

Execute the following command as root, the dba group is added to the system kernel:

[oracle@storage] $ echo 501 > /proc/sys/vm/hugetlb_shm_group

Continue with DBCA, step will fail, but then retry DBCA, the problem disappeared and database was created.

#4: bonus problem, not Centos 6.7 related: You do not have enough free disk space to create the database

My bad was that I had 12TB storage mounted, looks like I missed the 10g storage requirement: 400MB, but less then 2TB. Found no other solution then to unmount /dev/sdb1, shrink it with gparted to get ~500GB space, ext4 it and mount the new device /dev/sdb2 to another mount point /oracle.

Conclusion

Move to 11i or 12c, its about time.

Monday, October 5, 2015

Server monitoring recipe with SNMP: Observium + Nagios

Objective

Everyone having at least couple of servers, even a single server, would want to monitor it eventually. Some time ago I used MRTG for all that, but as the needs expanded I could do less and less with it and in the end it even became too complicated to use. MRTG is powerful, yet vulnerable to simple server restarts - you have to remap your pins.

Recipe

Will jump to it right away: the best option currently is Observium + Nagios. Will tell about the first one in a separate paragraph, it might suffer from an early death some day, but currently its a good tool for the job. I ended up using two tools because Nagios has a very good alerting system, but lacks interfaces and as you probably guessed already Observium has interfaces, but lacks alerting system.

Observium

The peckers behind this tool are pretty questionable. Some time ago they had a fundraising campaign to collect some doe and implement an alerting system. After funds were raised - they removed the promised functionality from the Community release and made it part of their paid version. Money is money, but hey, Internet knows everything. Further more I tried to communicate with them on Facebook - all my page messages and comments where removed and all PM's ignored.
Nevermind the folks, their tool is good for one thing - drawing nice charts:

Nagios

Where Observium fails - Nagios can help. Its an open source project, no need to tell more. It lacks interfaces and historical information (excluding payed plugins and extensions), but it has a powerful alerting system. Just setup a couple of users with emails and you are done:

Conclusions

Nagios allows you to receive an email in the middle of the forest when your backup drive hits a warning limit while Observium helps you analyze and plan you infrastructure, workloads and record historical events. I can now see that admin still hasnt added memory to our webserver and I asked for that a week ago.

Saturday, August 29, 2015

Gmail attachment cleanup. Cleanup large gmail letters

Even though Gmail gives you quite a lot of storage, for years I have developed a habit to clean up the trash. It just feels right.

Type in "size:20000000" for filtering mails with attachments >20MB. Seach query is in bytes, filter by any size limit you want. "size:5000000" for emails with larger then 5MB attachments.

Once you find the monstrous mails - you can either delete it all or just one of the messages. If you open the conversation - large messages are always expanded. This is just great if you have some development material which is out of date, but you would still like to save the conversation for later. Just select "Delete this message" from message tools.

Friday, July 10, 2015

Cuba Libre

Here is another break from all the typing.

----------------------------------------------------
* 2 oz (50ml) white rum  Place the ice into a highball glass.
* 1 lime                 Pour over the rum. Some more rum.
* Cola                   Cut the lime in quarters. Squeeze
* Ice                    one lime in (optional). Drop all the
                         limes in. Top it up with Cola.
-------------------------------------------------------------

More Java grants on Oracle. ORA-29532 java.io.FilePermission

Abstract

I got a simple file system writer/reader, it starts with Oracle Directory alias and continues generating folders using organization number and some bits of date. Alias part is static, the rest.. ..is suppose to be generated infinitely. Not including mount, ownership and permission details, basically your main folder and subfolders have to be fully available to user running Oracle.

Short spec
Oracle Directory: /attachments/ (alias ATTACHMENTS)
Organization id: 301
Todays date monthly token: 0715
Schema in use: AWS

Error

Lets start with stack trace:

<...>

java.security.AccessControlException: the Permission (java.io.FilePermission /attachments/301/0715/19871_head.txt write) has not been granted to AWS. The PL/SQL to grant this is dbms_java.grant_permission( 'AWS', 'SYS:java.io.FilePermission', '/attachments/301/0715/19871_head.txt', 'write' )

<...>

oracle.jdbc.driver.OracleSQLException: ORA-29532: Java call terminated by uncaught Java exception: java.security.AccessControlException: the Permission (java.io.FilePermission /attachments/301/0715/19871_head.txt write) has not been granted to AWS. The PL/S
QL to grant this is dbms_java.grant_permission( 'AWS', 'SYS:java.io.FilePermission', '/attachments/301/0715/19871_head.txt', 'write' )

<...>

Possible fixes

Thing is you need write permissions in your Oracle dir, but in this case its recursive and never ending. I start with this:

BEGIN
dbms_java.grant_permission( 'AWS', 'SYS:java.io.FilePermission', '/attachments/*', 'write' );
END;

Small bit that made me spend couple of hours was recursive Java grant, just use dash "-" instead of "*" and grant will be valid for all your subdirectories:

BEGIN
dbms_java.grant_permission( 'AWS', 'SYS:java.io.FilePermission', '/attachments/-', 'write' );
END;

Just in case you need more then write - use full fleet of file permission types:

BEGIN
dbms_java.grant_permission( 'AWS', 'SYS:java.io.FilePermission', '/attachments/-', 'read,write,delete' );
END;

Thursday, July 2, 2015

Leap second bug 2015. Linux/Centos, 100% CPU: Java, Oracle, OPMN, Tomcat

Oh dear, looks like there are services having serious issues with the Leap Second added last night. Read more about Leap Second 2015 in Wiki. Fix is simple:

# service ntpd stop; date -s "`date`";service ntpd start;

or

# /etc/init.d/ntpd stop; date -s "`date`"; /etc/init.d/ntpd start;

The problem occurred on an older Java/Oracle running webserver. All CPU's went 100% high. All services that had anything to do with JVM have gone bonkers: Tomcat, OPMN, Oracle, WebCache.

At first I disabled services that where failing and where not so important, but then all the others jumped to 100% CPU. It took some minutes before the situation was clear - all stuck services had one thing in common - JAVA. Once they went down, CPU went to idle. Ones who where prepared for this day did that 3 years ago. Happy restarting all the lazy admins.

Friday, June 26, 2015

Java source compilation in Oracle

Its a pretty lame subject, I had very limited access to customers box and was not able to use my toys on Toad. Amazingly it took a while to get things in one place and there are a couple of problematic points along the way.

Abstract

A couple of Java source packages residing in Oracle. Need them transferred and compiled.

Compiling Java source in Oracle

ALTER JAVA SOURCE OWNER.JavaSource COMPILE;
ALTER JAVA CLASS OWNER.JavaClass RESOLVE;

After I tried to compile the sources, compile returned success, resolve though returned nothing. This means we have errors or dependencies. My Java source is called "ReceiveMail", name stands for itself..

Debugging Java source compile

Actually its very easy, you will easily find additional filters for your query. All_errors/user_errors/dba_errors tables dont have timestamp, but its not needed, they store recent errors (or warnings in case of Java source):

SELECT * FROM user_errors WHERE name = 'RECEIVEMAIL';

Compiling Java source in Oracle example

There is one more (un)expected twist, normally Oracle creates objects in UPPERCASE, just thats not the case for Java source. Standard is respected. This also led to a stupid one hour delay because I was not able to locate the source which was already compiled and valid.

ALTER JAVA SOURCE RECEIVEMAIL COMPILE;
ALTER JAVA CLASS "ReceiveMail" RESOLVE;

Dont forget the correct "ClassName" in correct case. The resolve command can actually also compile the source I did not performed a full scale analysis on when and how, but two other packages that I had got automatically compiled while I was playing with resolve command.

Use this select to see the state of you Java objects:

SELECT object_name, object_type, status FROM user_objects WHERE object_type like '%JAVA%';

You will see if there are any more filters needed, I only had 8 packages, so it was enough. After a successful compilation objects status changed to valid.

Extra grants

If you got this far and your sources are working, then thats it. I was missing one more extra grant for my schema which is called 'INVOICE', you must have appropriate permissions to run this and choose your own schema instead of 'INVOICE':

BEGIN
dbms_java.grant_permission('INVOICE', 'SYS:java.util.PropertyPermission', '*', 'read,write' );
dbms_java.grant_permission( 'INVOICE', 'SYS:java.net.SocketPermission', 'pop.server.com:*', 'accept, connect, resolve' );
END;

Friday, May 15, 2015

PDF page count on Linux commandline

# pdfinfo yourpdffilename.pdf | grep Pages | awk -F: '{print $2}' | tr -d '[:blank:]'

pdfinfo on Centos is provided by package poppler-utils

Wednesday, April 22, 2015

Shell: loop through multiple files and pass them to the script

Short intro

Script file ./main.sh - bash script, doing the htm/html file processing. HTML - is a folder name. Problem: pass all the htm files to the script file as parameters for processing.

Passing files as parameters

# cd HTML
# ll
-rwxr-xr-x 1 oracle root 25828 Apr 13 15:03 filter
-rwxr-xr-x 1 oracle root 9955 Apr 13 15:03 filter.c
-rw-r--r-- 1 oracle root 2142 Apr 22 08:37 loader.ctl
-rw-r--r-- 1 oracle root 9792 Apr 22 08:46 loader.log
-rwxr-xr-x 1 oracle root 2648 Apr 22 10:04 main.sh
-rw-r--r-- 1 root root 781 Apr 22 09:49 10L31JKYRF5UH4.htm
-rw-r--r-- 1 root root 641 Apr 22 09:49 10L31JUKER1WP1.htm
-rw-r--r-- 1 root root 904 Apr 22 09:49 10L31JULIT5LI3.htm
-rw-r--r-- 1 root root 858 Apr 22 09:49 10L31JUOER5GT3.htm
-rw-r--r-- 1 root root 683 Apr 22 09:49 10L31JUPEM9TH9.htm
........................
# for FILE in *.htm; do ./main.sh $FILE; done;

Tuesday, April 21, 2015

TIFF: alternatiff, npapi, Chrome 42

Sad news from Chrome

April 2015

In April 2015 (Chrome 42) NPAPI support will be disabled by default in Chrome and we will unpublish extensions requiring NPAPI plugins from the Chrome Web Store.

September 2015

In September 2015 (Chrome 45) we will remove the override and NPAPI support will be permanently removed from Chrome. Installed extensions that require NPAPI plugins will no longer be able to load those plugins.

Temporary solution

Sad news to plugin users, in my case its Alternatiff. Got 9 other plugins which are not so painful. After recent update to Chrome v42 Alternatiff plugin is not longer loaded. Check it here: chrome://plugins.
There is a temporary fix for now, found in Java page: https://java.com/en/download/faq/chrome.xml#npapichrome

Here is what you do:

1. In your URL bar, enter:
chrome://flags/#enable-npapi
2. Click the Enable link for the Enable NPAPI configuration option.
3. Click the Relaunch button that now appears at the bottom of the configuration page.

Monday, April 20, 2015

SQL Loader: multiple tables, multiple problems

Short intro

Continuous development of Ajax Crawler importer led to data import problems. Oracle SQL Loader was used in all previous versions of my crawler, just this time its a multi-table structure in both: datafile and database. Spent almost 2 weeks on the simple subject and after someone pointed out the solution I was not able to find more then 2 references online, so hopefully its the third one for you: "position(1)" - you must reset the loader if you are importing into multiple tables even though this directive looks like a fixed length argument. Table and datafile examples are shortened, just to give you the explanation of control file.

Tables

Im importing invoices into three tables: inv_invoices_imp, inv_invoice_lines_imp, inv_invoice_comments_imp. Inv_invoices_imp contains invoice header, inv_invoice_lines_imp contains accounting and invoice line data, inv_invoice_comments_imp - user comments.

Datafile

Datafile is a HTML file, containing 6 different tables/blocks: some text, header table, some text, lines table, comments table, some text. One datafile contains one invoice data.

SQL Loader config

Oracle SQL Loader can read, parse and load almost any type of data. Any separation, fixed or delimiter separated, single or multiple sources and destinations. For out case we got multistructured datafile and three different destination tables. Options used: truncate table, skip rows, conditional rows, fillers, sequences, foreign keys. If you got stuck with SQL Loader loading only empty lines and having no errors in log files here is a checklist: column names, data types, missed separators, encoding. Only full list I was able to find is here, except the position(1) part.

Full loader.ctl file

OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET UTF8
INTO TABLE inv_invoices_imp
TRUNCATE
--APPEND
WHEN (1:1) = 'H'
FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
(dummy0 FILLER, VENDOR_NUM, ACCOUNT_NUM, VENDOR_NAME, VENDOR_ORG, INVOICE_NUM, INVOICE_DATE "to_date(:INVOICE_DATE,'MM/DD/YYYY')",
DUE_DATE "to_date(:DUE_DATE,'MM/DD/YYYY')", VALUTA, AMOUNT "to_number(:AMOUNT,'99999999999.9999')", VALUTA_EX "to_number(:VALUTA_EX,'99999999999.9999')",
AMOUNT_NOK "to_number(:AMOUNT_NOK,'99999999999.9999')", KID, BILAGSNR, dummy1 FILLER, dummy2 FILLER, TAX "to_number(:TAX,'99999999999.9999')",
dummy3 FILLER, dummy4 FILLER, dummy5 FILLER, dummy6 FILLER, dummy7 FILLER, dummy8 FILLER, dummy9 FILLER, DERESREF, dummy10 FILLER,
dummy11 FILLER, dummy12 FILLER, INVOICE_ID EXPRESSION "INV_INVOICES_IMP_SEQ.nextval"
)
INTO TABLE inv_invoice_lines_imp
TRUNCATE
WHEN (1:1) = 'L'
FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
(dummy0 FILLER POSITION(1), S1, S1_NAME, S2, S3, BELOP "to_number(:BELOP, '999999999999.9999')", DESCRIPTION, VAT_ID,
VAT_AMOUNT "to_number(:VAT_AMOUNT, '999999999999.9999')", BELOP_NOK "to_number(:BELOP_NOK, '999999999999.9999')",
S4, S5, S6, S7, dummy1 FILLER, FAKTURAID EXPRESSION "INV_INVOICES_IMP_SEQ.currval", ID EXPRESSION "INV_INVOICE_LINES_IMP_SEQ.nextval"
)
INTO TABLE inv_invoice_comments_imp
TRUNCATE
WHEN (1:1) = 'C'
FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
(dummy FILLER POSITION(1), CUSER, ACTION, CDATE, DESCR, FAKTURAID EXPRESSION "INV_INVOICES_IMP_SEQ.currval",
ID EXPRESSION "INV_INVOICE_COMMENTS_IMP_SEQ.nextval"
)

Previous problems and config explanation

UTF8 - charset spec, I suggest you use it even though your file and database are unicode.
FILLER - useful argument, column name going with it may not exist in database table, for the same table - there cant be duplicates, so use them like dummy1, dummy2, etc. If you dont know it yet - here you specify order of your data in a datafile using destination table columns. use FILLER on the data columns that you want to skip.
to_date, to_number - a must use if your destination column is numeric or date, suggest importing them all as VARCHAR2 at first, then converting to desired datatype and checking them one by one.
EXPRESSION .NEXTVAL - will mention this one, you dont have to have this one in datafile, but its essential for creating foreign key relation with other related tables.
POSITION(1) - hopefully its the directive you are here for. Its used twice in two related tables and placed after the first column in setup. When loading into more than one table, the position has to be reset for each table after the first one, using POSITION(1) with the first field, even though it looks like fixed length directive. If you miss this directive you will end up with nice empty table lines with sequences and foreign keys, no errors in log file. With some luck you might see "all fields were null" message - but you must be very lucky - usually because of some other related error.
EXPRESSION .CURRVAL - not much magical, but here is how you establish relation with your parent table. Hopefully your data complexity is similar. I'm also using child tables ID sequence in control file just to show you the full view. Child ID generation is only needed if you use conventional data load path.

Sample datafile (chopped)

<...>
H;LeverandÃ¸rnr;Bankkontonr;LeverandÃ¸rnavn;Organisasjonsnr;Fakturanr;Fakturadato;Forfallsdato;Valuta;FakturabelÃ¸p;Valutakurs;FakturabelÃ¸pNOK;KID;Bilagsnr;Scannebatch;Duplikat;MvabelÃ¸p;NettobelÃ¸p;Fakturatype;Val.dok;Selskapskode;Selskap;Refusjon postnr sted;refusjon Land;Deres Ref;Refusjon navn;Refusjon adresse;
H;40013;62190581506;TUR-RETUR AS - NO 870 989 587;870989587;105358;2/19/2015;3/1/2015;NOK;5064;1;5064;103071053583;
80746991;;N;403;4661;1;;FT;GatoFly AS;;;;;;
L;Konto;Kontonavn;Avdeling;Prosjekt;BelÃ¸p;Bilagstekst;MVA-kode;MVA belÃ¸p;BelÃ¸p NOK;Anlegg;Produkt;Salgssted;Kanal;Sats
L;7135;Reisekostnader;4500;1400;220;Nye FT. opphold H.Hernes 26-28.2/1-3.3;0;0;220;;;;;0
L;7135;Reisekostnader;4500;1400;4844;Nye FT. opphold H.Hernes 26-28.2/1-3.3;1D;358.81;4844;;;;;8
L;Fakturahistorikk
C;Bruker;Handling;Dato;Kommentarer
C;BTIP Connector ;Lagret av BTIPC ;2/24/2015 11:12:12 AM ;E-invoice saved by BTIPC
C;brigde ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;Autosirk- referanse blank
C;brigde ;Grunnlagsdata endret ;2/24/2015 11:28:04 AM ;fakturatype-1
C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;matchSupplierAccount. match pÃ¥ konto.40013
C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;Endret flytstatus
C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;setCompName OK.
<...>

As you can see data is semicolon separated, first column is destination identifier, date and number separators are visible as well. Data has some crap text lines, but does not matter now. No external ID's or references are used. Invoice line import lines ("L") can be anywhere in the file, does not matter that now they are between header and comments. First column and some others are marked as FILLER in control file.

HTML to datafile

Will reveal some more cards for you. Data file was an old HTML file, it was missing end-tags, using a couple of self aspired tags, data formatting was also not very handy. Here is full source of my bash script used to prepare the file for reading.

#!/bin/sh
file="$1"
echo processing $file

echo converting to unicode
cat $file | iconv -f utf-16 -t utf-8 > "$file".out
echo done

echo HTML cleanup
less "$file".out | tr ',' '.' | sed 's/ //g' | sed 's/\cM//g' | sed 's/\cW//g' | sed 's/<\/TR>/<\/TR> /g' | sed 's| sed 's/ / \n/g' | sed ':a;N;$!ba;s|\n
sed 's/ / \n/g' > "$file".clean

echo cleanup complete

echo header and lines separation
./filter -t 2 -c 2 -f "$file".clean > "$file".tmp
./filter -t 3 -f "$file".clean > "$file".lines
./filter -t 4 -f "$file".clean > "$file".comments
echo done separating

echo transposing headers
cols=2; for((i=1;i<=$cols;i++)); do awk -F ";" 'BEGIN{ORS=";";} {print $'$i'}' "$file".tmp | tr '\n' ' '; echo; done > "$file".header
echo transposed

echo cleanup
rm "$file".out -rf
rm "$file".clean -rf
rm "$file".tmp -rf
echo cleaned up

echo single file

sed -e 's/^/H;/' "$file".header > "$file".out

sed -e 's/^/L;/' "$file".lines >> "$file".out

sed -e 's/^/C;/' "$file".comments >> "$file".out

echo joined

echo sql loader start

sqlldr schema/******@sid data="$file".out control=loader.ctl discard="$file".discard

echo loaded

Bash script usage is simple:

# script.sh data_file.html

Conversion explanations

Conversion - my html file was encoded in utf16 so first step is to get some readable file instead of binary looking one.

HTML cleanup - examples and more explanations are available in previous post Crawling AjAx part 2. In this case I had to add the missing end-tags, generate data separators, move some new lines forth and back to have a readable file.

Filter - a modified HTML table selection script. Source is also available in Crawling AjAx part 2. This script picks desired table and column data from a formated HTML file.

Transposition - new problem, header table data is vertical, lines and comments - horizontal. Have to separate header and make the data horizontal as well.

Last steps - transposed data is joined back to a working file, each table data gets a distinctive line marker to be used with SQL Loader. Last step - SQL Loader call. You can skip the cleanup step to see the temporary working files if needed.

Contact

Contact me simakas[at]gmail.com for details or original source code if needed.

office space