Efficient Data Load using Java with Oracle

Getting your data from its source (where it is generated) to its destination (where it is used) can be very challenging, especially when you have performance and data-size constraints (how is that for a general statement?).

The standard Extract-Transform-Load sequence explains what is involved in any such Source -> Destination data-transfer at a high level.

We have a data-source (a file, a database, a black-box web-service) from which we need to ‘extract’ data, then we need to ‘transform’ it from source format to destination format (filtering, mapping etc.) and finally ‘load’ it into the destination (a file, a database, a black-box web-service).

In many situations, using a commercial third-party data-load tool or a data-loading component integrated with the destination  (e.g. SQL*Loader) is not a viable option. This scenario can be further complicated if the data-load task itself is a big one (say upwards of 500 million records within 24 hrs.).

One example of the above situation is when loading data into a software product using a ‘data loader’ specific to it. Such ‘customized’ data-loaders allow the decoupling of the products’ internal data schema (i.e. the ‘Transform’ and ‘Load’ steps) from the source format (i.e. the ‘Extract’ step).

The source format can then remain fixed (a good thing for the customers/end users) and the internal data schema can be changed down the line (a good thing for product developers/designers), simply by modifying the custom data-loader sitting between the product and the data source.

In this post I will describe some of the issues one can face while designing such a data-loader in Java (1.6 and upwards) for an Oracle (11g R2 and upwards) destination. This is not a comprehensive post on efficient Java or Oracle optimization. This post is based on real-world experience designing and developing such components. I am also going to assume that you have a decent ‘server’ spec’d to run a large Oracle database.

Preparing the Destination 

We prepare the Oracle destination by making sure our database is fully optimized to handle large data-sizes. Below are some of the things that you can do at database creation time:

– Make sure you use BIGFILE table-spaces for large databases. BIGFILE table-spaces provide efficient storage for large databases.

– Make sure you have large enough data-files for TEMP and SYSTEM table-space.

– Make sure the constraints, indexes and primary keys are defined properly as these can have a major impact on performance.

For further information on Oracle database optimization at creation time you can use Google (yes! Google is our friend!).

 

Working with Java and Using JDBC

This is the first step to welcoming the data into your system. We need to extract the data from the source using Java, transforming it and then using JDBC to inject it into Oracle (using the product specific schema).

There are two separate interfaces for the Java component here:

1) Between Data Source and the Java Code

2) Between the Java Code and the Data Destination (Oracle)

Between Data Source and Java Code

Let us use a CSV (comma-separated values) format data-file as the data-source. This will add a bit of variety to the example.

Using the ‘BufferedReader’ (java.io) one can easily read gigabyte size files line by line. This will work best if each line in CSV contains one data row thereby we can read-process-discard the line. Not requiring to store more than a line at a time in memory, will allow your application to have a small memory footprint.

Between the Java Code and the Destination

The second interface is where things get really interesting. Making Java work efficiently with Oracle via JDBC. Here the most important feature while inserting data into the database, that you cannot do without, is batched prepared statement. Using Prepared Statements (PS) without batching is like taking two steps forward and ten steps back. In fact using PS without batching can be worse than using normal statements. Therefore always use PSs, batch them together and execute them as a batch (using executeBatch method).

A point about the Oracle JDBC drivers, make sure the batch size is reasonable (i.e. less than 10K). This is because when using certain versions of the Oracle JDBC driver, if you create a very large batch, the batched insert can fail silently while you are left feeling pleased that you just loaded a large chunk of data in a flash. You will discover the problem only if you check the the row count in the database, after the load.

If the data-load involves sequential updates (i.e. a mix of inserts, updates and deletes) then also batching can be used without destroying the data integrity. Create separate batches for the insert, update and delete prepared statements and execute them in the following order:

  1. Insert batches
  2. Update batches
  3. Delete batches
One drawback of using batches is that if a statement in the batch fails, it fails the whole batch which makes it problematic to detect exactly which statement failed (another reason to use small batch sizes of ~ 100).
Constraints and Primary Keys
The Constraints and Primary Keys (CoPs) on the database act as the gatekeepers at the destination. The data-load program is like a truck driving into the destination with a lot of cargo (data).
CoPs can either be disabled while the data-load is carried out or they can remain on. In case we disabled them during the data-load, when re-enabling them we can have Oracle check the existing data against them or ignore existing data and only enable it for any new operations.
Whether CoPs are enabled or disabled and whether post-load validation of existing data is carried out can have a major affect on the total data-load time. We have three main options when it comes to CoPs and data loading:
  1. Obviously the quickest option, in terms of our data-load, is to drive the truck through the gates (CoPs disabled) and dump the cargo (data) at the destination, without stopping for a check at the gate or after unloading (CoPs enabled for future changes but existing data not validated). This is only possible if the contract with the data-source provider puts the full responsibility for data accuracy with the source.
  2. The slowest option will be if the truck is stopped at the gates (CoPs enabled), unloaded and each cargo item examined by the gatekeepers (all the inserts checked for CoPs violations) before being allowed inside the destination.
  3. A compromise between the two (i.e. the middle path) would be to allow the truck to drive into the destination (CoPs disabled), unload the truck and at the time of transferring the cargo to the destination, check the it (CoPs enabled after load and existing data validated).
The option chosen depends on the specific problem and the various data-integrity requirements. It might be easier to do the data file validation ‘in memory’ before an expensive data-load process is carried out and then we can use the first option.
Indexes
We need indexes and primary keys for performing updates and deletes (try a large table update or delete with and without indexes – then thank God for indexes!).
If your data load consists of only inserts and you are loading data into an empty or nearly empty table (w.r.t. amount of data being loaded), it might be a good idea to drop any indexes on it before starting the load.
This is so because as the data is inserted into a table, then any indexes on it are updated as well which takes additional time. If the table already contains a lot of data as compared to the the size of the new data being loaded then the time saved by dropping indexes will be wasted when trying to rebuild the index.
After the data is loaded we need to rebuild any dropped indexes and re-enable CoPs. Be warned that re-building indexes and re-enabling CoPs can be very time consuming and can take a lot of SYSTEM and TEMP space.
Logging
Oracle, being a ‘safe’ database, maintains a ‘redo’ log so that in case of a database failure we can perform recovery and return the database to its original state. This logging can be disabled by using the nologging option which can lead to a significant performance boost in case of inserts and index creations.
A major drawback of using nologging is that you loose the ability to ‘repeat’ any operations performed while this option is set. When using this option it is very important to take a database backup before and after the load process
Nologging is something that should be used judiciously and with a lot of planning to handle any side-effects.
Miscellaneous 
There are several other exotic techniques for improving large data loads on the Oracle side, such as partitioned tables. But these require more than ‘basic’ changes to the destination database.
Data-loading optimization for ‘big’ data is like a journey without end. I will keep updating this post as I discover new things. Please feel free share your comments and suggestions with me!

 

 

1 Comment

  1. mike says:

    Good summation of ETL (IMHO) – thank you.
    Do you have any thought son XSLT for the transform method?

    Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s