Javascript: Playing with Prototypes – II

Let us continue the discussion about Prototypes in Javascript and show the different ways in which inheritance can work. Inheritance is very important because whether you are trying to extend the JQuery framework or trying to add custom event sources in Node.JS you will need to extend an existing JS object.

Let us remember the most important mantra in JS – “nearly everything interesting is an object, even functions”

Objects are mutable, primitives (e.g. strings) are NOT!

Let us first introduce the example. There is a base object: Person which has two properties ‘id’ and ‘age’ and getter/setter methods for these. We want to create a child object: Student, which should inherit the id and age properties from Person and add its own read-only ‘student id’ property.

[codesyntax lang=”javascript”]

/*
Base object: Person
*/
function Person(id)
{
  this.id = 0;
  this.age = 0;
}

/*
Add set/get methods for Age and Id
*/
Person.prototype.setId = function(id)
{

  this.id = id;
};

Person.prototype.getId = function()
{
  return this.id;
};

Person.prototype.setAge = function(age)
{

  this.age = age;
};

Person.prototype.getAge = function()
{
  return this.age;
};


/*
Child object Student which should extend properties and methods from Person
*/
function Student(sid)
{
  this.sid = sid;

  /*
  Constructor for Person (to be safe)
  */
  Person.call(this);
  /*
  Student Id getter
  */
  Student.prototype.getSid = function()
  {
    return this.sid;
  }
}

[/codesyntax]

 

There are different ways (patterns) of implementing ‘inheritance’ based (Inheritance Methods):

  • Pattern 1: Student.prototype = Object.create(Student);
  • Pattern 2: Student.prototype = Object.create(Person.prototype);
  • Pattern 3: Student.prototype = new Person;

Below is the snippet of code we use to probe what happens in each of the three cases. Two instances of Student are created (s1 and s2). Then we examine the prototypes and assign values to some of the properties.

[codesyntax lang=”javascript”]

<Inheritance Method: one of the three options above>

var s1 = new Student(101);
var s2 = new Student(102);

console.log("S1",s1);
console.log("S2",s2);
console.log("Proto S1",Object.getPrototypeOf(s1));
console.log("Proto S2",Object.getPrototypeOf(s2));
if (Object.getPrototypeOf(s1) == Object.getPrototypeOf(s2)) {
  console.log("Compare prototypes:",true);
}

console.log("\n\n");

s1.setId(1);
s1.setAge(30);
console.log("S1",s1.getAge());

s2.setId(2);

console.log("Compare Id S1:S2",s1.getId(),s2.getId());

s2.setAge(20);
console.log("S2 set age 20");

console.log("S1 age",s1.getAge());
console.log("S2 age",s2.getAge());

[/codesyntax]

 

Let us look at what happens in each case:

1) Student.prototype = Object.create(Student);

Output:

[codesyntax lang=”php”]

S1: { sid: 101, id: 0, age: 0 }
S2: { sid: 102, id: 0, age: 0 }
Proto S1: { getSid: [Function] }
Proto S2: { getSid: [Function] }
Compare prototypes: true


/Users/azaharmachwe/node_code/NodeTest/thisTest.js:73
s1.setId(1);
^
TypeError: Object object has no method 'setId'
at Object.<anonymous> (/Users/azaharmachwe/node_code/NodeTest/thisTest.js:73:4)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:901:3

[/codesyntax]

 

The surprising result is that an exception is thrown. It seems there is no method ‘setId’ on the Student instance. This means that inheritance did not work. We can confirm this by looking at the prototype of S1 and S2 instance. Only the getter for student id defined in the Student object is present. We have not inherited any of the methods from Person.

But if we look at the list of attributes we see ‘id’ and ‘age’ present. So it seems the attributes were acquired somehow.

If we look at the way we define the Person object we actually add the ‘id’ and ‘age’ attributes to the instance (i.e. we use this.id) where as the accessor methods are added on the prototype. When we create an instance of Student as Student.prototype = Object.create(Student) we correctly set the attributes as they are defined at the instance level.

If the line in bold is removed then you will only see the Student level attribute (‘sid’).

 

2) Student.prototype = Object.create(Person.prototype);

Output:

[codesyntax lang=”php”]

S1: { sid: 101, id: 0, age: 0 }
S2: { sid: 102, id: 0, age: 0 }
Proto S1: { getSid: [Function] }
Proto S2: { getSid: [Function] }
Compare prototypes: true



S1 30
Compare Id S1:S2 1 2
S2 set age 20
S1 age 30
S2 age 20

[/codesyntax]

No errors this time.

So we see both S1 and S2 instances have the correct attributes (Person + Student) prototypes for both contain the getter defined in Student and both have the same prototype. Something more interesting is the fact that we can set ‘age’ and ‘id’ on them as well showing us that the attribute setters/getters have been inherited from Person.

But why can’t we see the get/set methods for ‘age’ and ‘id’ on the Student prototype? The reason is that with the call to Object.create with the Person.prototype parameter we chain the prototype of Person with that of Student. To see the get/set methods for ‘age’ and ‘id’ that the Student instance is using add the following line to the probe commands:

console.log(“>>”,Student.prototype.__proto__);

This proves that the object is inheriting these methods at the prototype level and not at the object level. This is the recommended pattern for inheritance.

3) Student.prototype = new Person;

This is a method you may see in some examples out there. But this is not the recommended style. The reason is that in this case you are linking the prototype of Student with an instance of Person. Therefore you get all the instance variables of the super-type included in the sub-type.

Output:

[codesyntax lang=”php”]

S1: { sid: 101 }
S2: { sid: 102 }
Proto S1: { id: 0, age: 0, getSid: [Function] }
Proto S2: { id: 0, age: 0, getSid: [Function] }
Compare prototypes: true



S1 30
Compare Id S1:S2 1 2
S2 set age 20
S1 age 30
S2 age 20

[/codesyntax]

Note the presence of ‘id’ and ‘age’ attributes with default values in the prototypes of S1 and S2. If the attributes are array or object type (instead of a primitive type as in this case), we can get all kinds of weird, difficult to debug behaviours. This is the case with frameworks where a base object needs to be extended to add custom functionality. I came across this issue while trying to create a custom Node.JS event source.

Wrong way to extend: A Node.JS example

I have seen many Node.JS custom event emitter examples that use pattern number (3). The correct pattern to use is pattern (2). Let us see why.

The code below extends the Node.JS EventEmitter (in ‘events’ module) to create a custom event emitter. Then two instance of this custom event emitter are created. Different event handling callback functions for the two instances are also defined. This will allow us to clearly identify which instance handled the event.

In the end we cause the custom event to fire on both the instances.

[codesyntax lang=”javascript”]

var ev = require("events");

/*
Create a custom event emitter by extending the Node.JS event emitter
*/
function myeventemitter(id)
{
  this.id = id;
  ev.EventEmitter.call(this);
}
/*
Try different ways of extending
*/

myeventemitter.prototype = new ev.EventEmitter;

myeventemitter.prototype.fire = function()
{
  console.log('\nFire',this.id);
  this.emit('go',this.id);
}

/*
Initialise two instances of the custom event emitter
*/
var myee1 = new myeventemitter("A");
var myee2 = new myeventemitter("B");

/*
Define callbacks on the custom event ('go')
*/
myee1.on('go',function(id)
{
  console.log("My EE1: Go event received from",id);
});

myee2.on('go',function(id)
{
  console.log("My EE2: Go event received from",id);
});

/*
Cause the custom event to fire on both the custom event emitters
*/
myee1.fire();
myee2.fire();

/*
Dump the prototype of our custom event emitter
*/
console.log(myeventemitter.prototype);

[/codesyntax]

Note we are using pattern (3) to extend the EventEmitter:

myeventemitter.prototype = new ev.EventEmitter;

We expect that custom events fired on instance 1 will result in the event handling function on instance 1 being called. The same thing should happen for instance 2. Let us look at the actual output:

[codesyntax lang=”javascript”]

Fire A
My EE1: Go event received from A
My EE2: Go event received from A

Fire B
My EE1: Go event received from B
My EE2: Go event received from B
{ domain: null,
_events: { go: [ [Function], [Function] ] },
_maxListeners: 10,
fire: [Function] }

[/codesyntax]

This looks wrong! When we cause instance 1 to fire its custom event it actually triggers the event handling functions in both the instances! Same happens when we try with instance 2.

The reason as you may have already guessed is that when we use pattern (3) we actually attach the JSON object that holds the individual event handling functions to the prototype (variable name: _events). This can be seen in the above output.

Therefore both instances of the custom event emitter will have the same set of event handling functions registered because there is only one such set.

To correct this just switch the extension patter to (2):

[codesyntax lang=”javascript”]

Fire A
My EE1: Go event received from A

Fire B
My EE2: Go event received from B

{ fire: [Function] }

[/codesyntax]

The output now looks correct. Only the instance specific callback function is called and the prototype does not store the event handling functions. Therefore each instance of the custom event emitter has its own set for storing event handling functions.

Bristol Government: Open Data Initiative

Bristol City Council (BCC) is now publishing some of their data sets online as part of the Open Data initiative.
This is a VERY positive move and I too hope that this leads to the development of ‘new’ solutions to the city’s problems.
More information can be found here: https://opendata.bristol.gov.uk

The Tableau Viz below uses the Quality of Life Indicators data from across Bristol. This is available from the BCC website. The data set has a set of questions (about 540) asked across the different wards in Bristol (about 35) on a yearly basis starting from 2005 till 2013. Obviously data is not available across all the dimensions, for example the question:
“% respondents who travel for shopping by bus” for the Redland ward is available only from 2006-2010.

The raw data from the Open Data website was processed using Apache Spark’s Java Libraries. This was then dumped into a data file which was imported into Tableau.

Link: Dashboard

The heat map below plots the regression slope of the survey results over the years (beta) against the Questions and Wards.
Criteria for beta calculation: minimum three years data should be available.

https://public.tableau.com/javascripts/api/viz_v1.js

Heat Map Beta Ques/Ward

Horizontal Web-app Scaling with Nginx and Node.JS

One highly touted advantage of using Node.JS is that it makes applications easy to scale. This is true to an extent especially when it comes to web-apps.

A stateless request-response mechanism lends itself to parallelisation. This is as easy as spinning up another instance of the request handling process on the same or different machine.

Where state-full request-response is required (say to maintain session information) then to scale up the ‘state’ must be shared safely across different instances of the request handling processes. This separates out the ‘functional’ aspects of the request handling mechanism from the side-effect related code.

To tie in all the different web-app instances under a single public address and to load-balance across them we need a ‘reverse-proxy’. We will use Nginx for this.

Software needed:

  • Nginx (v 1.7.10)
  • Node.JS (v 0.10.12)

First let us setup the Nginx configuration:

[codesyntax lang=”javascript”]

events {
	worker_connections 768;
}

http {

	upstream localhost {
		server 127.0.0.1:18081;
		server 127.0.0.1:18082;
		server 127.0.0.1:18083;	
}
	server {
		listen 80;
		
		location / {
			proxy_pass http://localhost;
		}
	}
}

[/codesyntax]

 

More info about setting up and running Nginx – http://wiki.nginx.org/CommandLine

This configuration sets up the public address as localhost:80 with three private serving instances on the same machine at port: 18081, 18082 and 18083.

Let us also create a serving process in Node.JS using the Express framework:

[codesyntax lang=”javascript”]

var express = require("express");

var app = express();

var name = process.argv[2];
var PORT = process.argv[3] || 18080;

console.log("Server online: ",name,":",PORT);

app.get("/", function(request,response)
        {
           console.log("You have been served by: ",name,"on",PORT);

           response.write("Served by :"+name+" on "+PORT);
           response.end();
        });

app.listen(PORT);

[/codesyntax]

 

This takes in server name and port as the arguments.

We will spin up three instances of this serving process on the same machine with the  port numbers as in the Nginx config.

If we name the above as server.js then the instances can be spun up as:

node server.js <server_name> <port>

*Make sure you use the correct port (as provided in the Nginx config file).

Screen Shot 2015-03-22 at 01.51.15

 

Then just point your browser to localhost:80 and you should see:

Screen Shot 2015-03-22 at 01.56.33

 

Press refresh multiple times and you should see your request being served by different instances of web-app. Nginx by default uses ’round-robin’ load-balancing therefore you should see each of the instances being named one after the other as below (almost!).

Screen Shot 2015-03-22 at 01.56.45 Screen Shot 2015-03-22 at 01.58.15

 

Scaling out is as simple as spinning up a new instance and adding its IP and port to the Nginx configuration and reloading it.

 

Understanding the NodeJS EventLoop

The EventLoop is the secret sauce in any NodeJS based app.

It provides the ‘magical’ async behaviour and takes away the extra pain involved in explicit thread based parallelisation. On the flip side you have to account for the resulting single threaded JavaScript engine that processes the callbacks from the EventLoop. If you don’t then the traditional style of writing ‘blocking’ code can and will trip you over!

The LIBUV has an EventLoop which loops through the queue of events and executes the JS callback function (on a single thread as at any given time).

You can have multiple event sources (Event Emitters in NodeJS land) running in LIBUV on multiple threads (e.g. doing file I/O and socket I/O at the same time) that put events in the queue. But there is always ONE thread for executing JS therefore can only ‘handle’ one of those events at a time (i.e. execute the associated JS callback function).

Keeping this in mind let us look at a few such ‘natural’ errors where the code looks fine to the untrained eye but the expected output is not produced.

1) Wave bye bye to While Loops with Flags!

A common scenario is where we have while loops controlled by a flag variable for example. If you were wanting to read from console till the user types ‘exit’ then you would write something like this using blocking functions:

[codesyntax lang=”php”]

while (command != ‘exit’ ) 

//Do something with the command

command = reader.nextLine()

end while

[/codesyntax]

It will work because the loop will always be blocked till the nextLine() method executes and gives us a valid value for the command or throws an exception.

If you try and do the same in NodeJS using the async functions you might be tempted to re-write it as below. First we register a callback function which will trigger when the enter key is hit on the console. It will accept as a parameter the full line typed on the console. We promptly put this into the global command object and finish. After setting up the callback, we start an infinite loop waiting for ‘exit’. In case the command is undefined (null) we just loop again (‘burning rubber’ so as to say).

[codesyntax lang=”php”]

var command = null

//Register a callback function 

reader.on(‘data’, function (data) { command = data })

while (command != ‘exit’ ) 

if (command !=null)

//Do something with the command

command = null

end if

end while

[/codesyntax]

Unfortunately this code will never work. Any guesses what will be the output? If you guessed that it will go into an infinite loop with command always equal to ‘null’ you are correct!

The reason is very simple: JS code in NodeJS is processed by a single thread. In this case that single thread will be kept busy going through the while loop. Thus it will never get a chance to handle the console input event by executing the callback. Thus command will always stay ‘null’.

This can be fixed by removing the while loop.

[codesyntax lang=”php”]

var command = null

//Register a callback function 

reader.on(‘data’, function (data) 
	{ 
		command = data 
		if(command == 'exit')
		
			process.exit()
		
		end if
		
		/*
		Here we can either parse the command
		and perform the required action 
		 
	 OR 
		 
		 we can emit a custom event which all
		 the available command processors listen for 
		 but only the target command processor responds		
		*/
	
	})

[/codesyntax]

 2) Forget the For Loop (at least long running ones)

This next case is a very complex one because it is very hard to figure out whether its the for loop thats to blame. The symptoms may not show up all the time and they may not even show up in the output of your app. The symptoms can also change depending on things like the hardware configuration and configuration of database servers your code is interacting with (if any).

Let us take a simple example of inserting a fixed length array of data items into a database. In case the insert function is blocking the following code will work as expected.

[codesyntax lang=”php”]

for(var i=0; i<data.length; i++)
	database.insert(data[i])
end for

[/codesyntax]

In case the insert function is non-blocking (e.g. NodeJS) then we can experience all kinds of weird behaviour depending on the length of the array, such as incomplete insertions, sporadic exceptions and even instances where everything works as expected!

In case of the while loop example, the JS thread is blocked forever so no callbacks are processed. In case of for loops, the JS thread is blocked till the loop finishes running. This means in our example if we are using non-blocking insert the loop will execute rapidly without waiting for the insert to complete. Instead of blocking, the insert operation will generate an event on completion.

This is part of the reason why NodeJS applications can get a lot of work done without resorting to explicit thread management.

If the array is big enough we can end up flooding the receiver leading to buffer overflows along the way and resulting in dropped inserts. In some cases if the array is not that big the system may behave normally.

The question of how big an array can we deal with is also difficult to answer. It changes from case to case, as it depends on the hardware, the configuration of the target database (e.g. buffer sizes) and so on.

The solution involves getting rid of the long running for loop and using events and callbacks. This throttles the insert rate by making them sequential (i.e. making sure next insert is triggered only when the previous insert has completed)

[codesyntax lang=”php”]

var count = 0

//Callback function to add the next data item
function insertOnce()

	if(count>MAX_COUNT)
	
		/*
                 Exit process by closing any external connections (e.g. database)
                 and clearing any timers. Ending the process by force is another option
                 but it is not recommended
                */
                
	
	end

	database.insert(data[count], 
		
		function ()
	
		//Called once current data has been inserted
		
		emit_event('inserted')
		end
		
		)
		
	count++

end

//Call insertOnce on the inserted event
event_listener.on('inserted', insertOnce)

//Start the insertion by doing the first insert manually.
insertOnce()

[/codesyntax]

 3) Are we done yet?

 Blocking is not always a bad thing. It can be used to track progress because when a function returns you know it has completed its work one way or the other.

One way to achieve in NodeJS is to use some kind of a counter global variable that counts down to zero or up to a fixed value. Another way to do this is to set and clear timers in case you are not able to get a count value. This technique works well when you have to monitor the progress of a single stage of an operation (e.g. inserting data into a database as in our example above).

But what if we had multiple stages that we wanted to make sure execute in a synchronous manner. For example:

1) Load raw data into database

2) Calculate max/min values

3) Use max/min values to normalise raw data and insert into a new set of tables

There are some disadvantages with this approach:

1) Counters and timers add unwanted bulk to your code

2) Global variables are easy to override accidentally especially when using simple names like ‘count’

3) Your code begins to look like a house with permanent scaffolding around it

Furthermore once you detect that the one stage has finished, how do you proceed to the next stage?

Do you get into callback hell and just start with the next stage there and then, ending up with a single code file with all three stages nested within callbacks (Answer: No!)?

Do you try and break your stages into separate code files and use spawn/exec/fork to execute them (Answer: Yes)?

It is a rather dull answer but it makes sure you don’t have too much scaffolding in any one file.

Javascript: Playing with Prototypes – I

The popularity of Javascript (JS) has skyrocketed ever since it made the jump from the browser to the server-side (thank you Node.JS). Therefore a lot of the server-side work previously done in Java and other ‘core’ languages is now done in JS. This has resulted in a lot of Java developers (like me) taking a keen interest in JS.

Things get really weird when you try and map a ‘traditional’ OO language (like Java) to a ‘prototype’ based OO language like JS. Not to mention functions that are really objects and can be passed as parameters.

That is why I thought I would explore prototypes and functions in this post with some examples.

Some concepts:

1) Every function is an object! Let us see, with an example, the way JS treats functions.

[codesyntax lang=”javascript” lines=”normal”]
function Car(type) {
this.type = type;
//New function object is created
this.getType = function()
{
return this.type;
};
}

//Two new Car objects
var merc = new Car(“Merc”);
var bmw = new Car(“BMW”);
/*
* Functions should be defined once and reused
* but this proves that the two Car objects
* have their own instance of the getType function
*/
if(bmw.getType == merc.getType)
{
console.log(true);
}
else
{
//Output is false
console.log(false);
}
[/codesyntax]

The output of the above code is ‘false’ thereby proving the two functions are actually different ‘objects’.

 

2) Every function (as it is also an object) can have properties and methods. By default each function is created with a ‘prototype’ property which points to a special object that holds properties and methods that should be available to instances of the reference type.

What does this really mean? Let us change the previous example to understand what’s happening. Let us play with the prototype object and add a function to it which will be available to all the instances.

[codesyntax lang=”javascript” lines=”normal”]

function Car(type) {
   this.type = type;
}

Car.prototype.getType = function()
{
    return this.type;
}

//Two new Car objects
var merc = new Car("Merc");
var bmw = new Car("BMW");

/*
 * Functions should be defined once and reused
 * This proves that the two Car objects
 * have the same instance of the getType function
 */
if(bmw.getType == merc.getType)
{
    //Output is true
    console.log(true);
}
else
{
    console.log(false);
}

[/codesyntax]

We added the ‘getType’ function to the prototype object for the Car function. This makes it available to all instances of the Car function object. Therefore we can think of the prototype object as the core of a Function object. Methods and properties attached to this core are available to all the instances of the function Object.

This core object (i.e. the prototype) can be manipulated in different ways to support OO behaviour (e.g. Inheritance).

 

3) Methods and properties can be added to both the core or the instance. This enables method over-riding as shown in the example below.

[codesyntax lang=”javascript” lines=”normal”]

function Car() {
    
}

//Adding a property and function to the prototype
Car.prototype.type = "BLANK";

Car.prototype.getType = function()
{
    return this.type;
}

//Two new Car objects
var merc = new Car();
var bmw = new Car();

//Adding a property and a function to the INSTANCE (merc)
merc.type = "Merc S-Class";
merc.getType = function()
{
    return "I own a "+this.type;
}

//Output
console.log("Merc Type: ", merc.getType());
console.log("BMW Type: ", bmw.getType());
console.log("Merc Object: ",merc);
console.log("BMW Object: ",bmw);

[/codesyntax]

 

The output:

Merc Type:  I own a Merc S-Class

> This shows that the ‘getType’ on the instance is being called.

BMW Type:  BLANK

> This shows that the ‘getType’ on the prototype is being called.

Merc Object:  { type: ‘Merc S-Class’, getType: [Function] }

> This shows the ‘merc’ object structure in JSON format. We see the property and function on the instance.

BMW Object:  {}

> This shows the ‘bmw’ object structure in JSON format. We see there are no properties or functions attached to the instance.

Thoughts on Error Handling

Most code has natural boundaries as defined by classes, functions and remote interfaces.

The execution path for a program creates a chain of calls across these boundaries, tears it down as the calls complete and again builds it up as new calls are made.

All is well till one of the calls does not complete successfully. Then an exception is thrown which travels all the way up the chain and somewhere along the line it comes across your code. Or maybe it was a call to your code that does not complete successfully!

What to do when this happens? How to handle the exception?

Do you log it and carry on or do you stop the execution and bomb out or you could just carry on pretending nothing is wrong.

There is no single right answer to this question, just a set of good options that you get to pick from:

1) Log a warning message

This option is easy to understand and easier to forget while writing code. It should be combined with all the other options to give better visibility.

The key to effective logging is first choosing the right Logging API and then using the chosen Logging API correctly! It is a common feature of software to have to little or two much logging. Or the bad use of Error Levels where Level ERROR gives a trickle of messages where as Level INFO floods the logs with messages. Level WARN is often bypassed and Level DEBUG often misused to do ‘machine-gun’ logging.

For secure systems logging should be done carefully so as to not expose any information in an unencrypted log file (e.g. logging user credentials, database server access settings etc.).

Use Level ERROR for when you cannot continue with normal execution (e.g. required data files are missing or required data is not valid)

Use Level WARN for when you can continue but with limited functionality (e.g. not able to connect to remote services – waiting to retry)

Use Level INFO for when you want to inform the user about interesting events (like successfully established a connection or processed a certain number of records)

Use Level DEBUG for when you want to peek under the hood of the application (like logging properties used to initiate a connection or requests sent/response received – beware this is not very secure if logged to a general-access plain text file)

This option should be used no matter which of the other options is chosen. There is nothing as annoying as an application failing with just an error message and nothing in the logs or seeing an exception flash on the console a second before it closes.

2) Return a constant neutral value

In case of a problem we return a constant neutral value and carry on as if nothing happened. For example if we are supposed to return a Set of objects (either from our code or by calling another method) and we are unable to do that for some reason then you can return a blank Set with no items – this would be a constant Set variable which is returned as a neutral value.

For the code that calls this method, we absorb the exception propagation. The only way the calling code can detect any problems is if it treats the returned ‘neutral’ value as an ‘illegal value’. It can use one of the options presented here or ignore it and carry on.

Best Practice: If you are using a neutral constant return value(s) in case of an error make sure you do two things; log the error internally for your reference and make sure if it is an API method you document the fact. This will make sure anyone who calls your code knows the constant neutral value(s) and can treat them as illegal if required.

Another way to use a neutral constant value is to define a max and min range for the return value. In case the actual value is above the max or below the min value then replace it with the relevant constant value (MAX_VALUE or MIN_VALUE).

3) Substitute previous/next valid piece of data

In case of a problem we return the last known or next available valid value. This is fairly useful at the edge of your system where you are dealing with data streams or large quantities of data where it is required that all calls return valid data and not throw any exceptions or revert to constant values (for example a stream of currency data where one call to the remote service fails). You would want to also provide a neutral constant value as well in case there are issues at the beginning where no valid values are present.

For the calling code this provides no mechanism to detect any exceptions down the chain. So the called code that implements this behaviour absorbs all exceptions. That is why this is really useful for the edge of your system when dealing with other remote services, databases and files. If you use this technique make sure you log the fact that you are skipping some invalid values till you get a valid one or you have not been able to get a new valid value so you are re-using the previous one, that will make sure you can detect issues with the remote systems and inform the user (e.g. database login credentials not valid, remote service unavailable or few data file entries are corrupt) while making sure your internal code remains stable.

Also make sure you document this behaviour properly!

4) Return an error code response

This is fairly useful when building a remote or packaged API for external consumption especially when indicating internal errors which the user can do little about. Some examples include: an internal service is no longer responding, internal file I/O errors, issues related to memory management on the remote system etc.

Error codes make it easier for users to log trouble tickets with the help-desk.

Once with the help-desk the trouble ticket can then be routed based on the error code (e.g. does O&M Team just need to restart a failed service or is this a memory leak issue which needs to be passed on to the Dev Team).

We should be careful not to return error codes for issues that can be resolved by the user. In those cases a descriptive error message is the way to go.

As an example: assume you have a form which takes in personal details of the user and then uses one or more remote services to process that data.

– For form validations (email addresses, telephone numbers etc.) we should return a proper descriptive error message.

– For issues related to network connectivity (remote service not reachable) we should return a proper descriptive error message.

– For issues related to the remote service which the user cannot do anything about (as described earlier) the error code should be returned with link to the help-desk contact details and perhaps more information (maybe an auto generated trouble ticket id – see next section).

5) Call an error processing routine/service

This is one where we detect an error response and call an error processing routine or service. This is especially use full not just for complex rule-based logging but also for automatic error reporting, trouble ticket creation, service performance management, self-monitoring etc.

It is often useful to have a service that encapsulates error handling logic rather than have your catch block or return value checks peppered with if-else blocks.

In this case the error response or exception is passed on to a service or routine that encapsulates the error processing logic. Some of the things that such a service or routine might do:

– Decide which log file to log the error in

– Decide the level of the error and create self-monitoring events and/or change life-cycle state of the system (restart, soft-shutdown etc.)

– Interface with trouble ticketing systems (e.g. when you get a major exception in Windows 7 OS it offers to send details to Microsoft)

– Interface with performance monitoring systems to report the health of the service

6) Shutdown (Fail-fast)

This means that the system is shutdown or made un-available as soon as any exception of significance is detected.

This behaviour is often required from critical pieces of software which should not work in a degraded state (so called mission critical software). For example you don’t want the auto-pilot of an A380 to work when it is getting internal errors while performing I/O. You want to kill that instance and switch over to a secondary system or warn the pilot and immediately transfer control to manual.

This is also very important for systems that deal with sensitive data such as online-banking applications (it is better to be not available to process online payments than to provide unreliable service). Users might accept a ‘Site Down’ notice but they will definitely NOT accept incorrect processing of their online payment instructions.

From the example above, because we failed fast and made the banking web-site unavailable we did not allow the impact of the error to spread to the user’s financial transaction.

 

Java: Getting a webpage using java.net

There is an easy way to get a web page as a HTML string. In fact there are two basic ways to do this using the basic java.net package. Why would we need this?

Some use-cases include:

  • Getting information from a web-page for text aggregation (say from a news site)
  • Creating a ‘page filter’ which pre-processes web-pages to strip out unsafe content

For both the examples we will use the basic java.net.URL and java.net.URLConnection entities in java.net package.

 

Type 1: Encapsulated Side-effects

The first example opens up the URL connection and obtains an InputStream from it. A BufferedReader wraps the InputStream from the URL which is then read into a StringBuilder. Then we return the String containing the HTML page source (if all goes well).

The code listing:

 

java.net.1

Few points to remember:

– Always use a StringBuilder instead of the concat ‘+’ operator, Strings are immutable which means every time you call ‘+’ a new String object is born. StringBuilder uses a character array.

– Always make sure to have proper exception handling and a finally block which closes the InputStream

This code encapsulates the ‘side-effect’ of reading from a remote URL and returns a String data.

 Type 2: Exposed Side-effects

The second example is very simple. It is a sub-set of the code shown in the previous section.

In this second example we simply open the URL connection and obtain an InputStream from it which we return to the calling program. The responsibility of using the InputStream to get the page source is left to the calling function. This is especially useful if you want to work directly with an InputStream instead of a String representation. One such example is when using a parser to parse the page source.

The big disadvantage of using this method is that it exposes the side-effect related code to the main application. For example if the Internet connection goes down or the server goes down while the InputStream is being read the calling application will encounter an error and therefore behave unpredictably.

java.net.2

The way to get the best of both worlds (encapsulated side-effects and providing an InputStream to a calling function) is to use Example 1 and return a String object which can then be converted into a ‘byte stream’.

Living with ActiveMQ JMS – Nice Features and Weird Errors

Apache ActiveMQ is the bad-boy of all the JMS servers out there. Firstly it is ‘free’. Secondly it is very good. For those who think free does not go with good (like I did) – well, lets just say life got a bit sweeter for them.

Auto-failover

ActiveMQ provides built-in failover handling. Failover handling is very important for any kind of JMS application. Failover handling means deciding what to do when something bad happens which is outside your control.

For example your application has subscribed to a Topic but the JMS server drops the connection or the network link goes down. Your application is left with a JMS error and a dead connection to deal with.

There are several things you might want to do in that case, such as:

– Try reconnecting after some time and keep on trying till you can reconnect

– Switch over to another JMS server (if present)

We might also need to configure things like how soon to reconnect, which URL to treat as primary (in case of switching).

As ActiveMQ provides this functionality ‘out-of-the-box’ it makes life easier. The way to implement this is very simple as well. We just add the failover settings to the Naming URI in the JMS Connection settings.

A normal JMS Naming URI looks like: tcp://hostname:port (e.g. tcp://localhost:6666)

If we wanted to use the ActiveMQ specific failover we change the JMS Naming URI as:

For a single server auto-reconnect –

failover://(tcp://hostname:port) or failover:(tcp://hostname:port)
*Try both the versions (with and without the //) and see what works in your specific case.

In this case if the connection goes down for some external reason, ActiveMQ will try and reconnect.

For a backup-server auto-switching –

failover:(tcp://primary:61616,tcp://secondary:61616) or 
failover://(tcp://primary:61616,tcp://secondary:61616)
*Again try both the versions (with and without the //) and see what works in your specific case.

In this case if the primary connections goes down then ActiveMQ will try and switch to the secondary. In case both servers are active (e.g. for load-balancing) we may want to choose randomly between the two or in case of primary-secondary we may want start with primary first and keep the secondary for failover. We can do this by using the ‘randomize’ option:

failover:(tcp://primary:61616,tcp://secondary:61616)?randomize=false

Here randomize=false means the primary URI will be tried first.

Check the URL in the reference section for more configurations.

[Further Reference: http://activemq.apache.org/failover-transport-reference.html]

Weird Errors

javax.jms.JMSException: Failed to build body from bytes. Reason: java.io.StreamCorruptedException: invalid type code: 09

If you see this error in your application when trying to call the getObject method on an ‘Object Message’ it means something has gone wrong when you tried to de-serialize the object at your end.

This error is thrown when the Java IO library does not recognize the type code its found in the object data.

This is most often caused by mismatched ActiveMQ libraries between the sender (which serializes the object) and receiver (which de-serializes it).

If you have no control on the sending application then its time to hold your head in your hands and cry. You just learned a valuable lesson – if you are using JMS to decouple your applications then ALSO use a neutral message format (like XML) instead of serialized objects.

Efficient Data Load using Java with Oracle

Getting your data from its source (where it is generated) to its destination (where it is used) can be very challenging, especially when you have performance and data-size constraints (how is that for a general statement?).

The standard Extract-Transform-Load sequence explains what is involved in any such Source -> Destination data-transfer at a high level.

We have a data-source (a file, a database, a black-box web-service) from which we need to ‘extract’ data, then we need to ‘transform’ it from source format to destination format (filtering, mapping etc.) and finally ‘load’ it into the destination (a file, a database, a black-box web-service).

In many situations, using a commercial third-party data-load tool or a data-loading component integrated with the destination  (e.g. SQL*Loader) is not a viable option. This scenario can be further complicated if the data-load task itself is a big one (say upwards of 500 million records within 24 hrs.).

One example of the above situation is when loading data into a software product using a ‘data loader’ specific to it. Such ‘customized’ data-loaders allow the decoupling of the products’ internal data schema (i.e. the ‘Transform’ and ‘Load’ steps) from the source format (i.e. the ‘Extract’ step).

The source format can then remain fixed (a good thing for the customers/end users) and the internal data schema can be changed down the line (a good thing for product developers/designers), simply by modifying the custom data-loader sitting between the product and the data source.

In this post I will describe some of the issues one can face while designing such a data-loader in Java (1.6 and upwards) for an Oracle (11g R2 and upwards) destination. This is not a comprehensive post on efficient Java or Oracle optimization. This post is based on real-world experience designing and developing such components. I am also going to assume that you have a decent ‘server’ spec’d to run a large Oracle database.

Preparing the Destination 

We prepare the Oracle destination by making sure our database is fully optimized to handle large data-sizes. Below are some of the things that you can do at database creation time:

– Make sure you use BIGFILE table-spaces for large databases. BIGFILE table-spaces provide efficient storage for large databases.

– Make sure you have large enough data-files for TEMP and SYSTEM table-space.

– Make sure the constraints, indexes and primary keys are defined properly as these can have a major impact on performance.

For further information on Oracle database optimization at creation time you can use Google (yes! Google is our friend!).

 

Working with Java and Using JDBC

This is the first step to welcoming the data into your system. We need to extract the data from the source using Java, transforming it and then using JDBC to inject it into Oracle (using the product specific schema).

There are two separate interfaces for the Java component here:

1) Between Data Source and the Java Code

2) Between the Java Code and the Data Destination (Oracle)

Between Data Source and Java Code

Let us use a CSV (comma-separated values) format data-file as the data-source. This will add a bit of variety to the example.

Using the ‘BufferedReader’ (java.io) one can easily read gigabyte size files line by line. This will work best if each line in CSV contains one data row thereby we can read-process-discard the line. Not requiring to store more than a line at a time in memory, will allow your application to have a small memory footprint.

Between the Java Code and the Destination

The second interface is where things get really interesting. Making Java work efficiently with Oracle via JDBC. Here the most important feature while inserting data into the database, that you cannot do without, is batched prepared statement. Using Prepared Statements (PS) without batching is like taking two steps forward and ten steps back. In fact using PS without batching can be worse than using normal statements. Therefore always use PSs, batch them together and execute them as a batch (using executeBatch method).

A point about the Oracle JDBC drivers, make sure the batch size is reasonable (i.e. less than 10K). This is because when using certain versions of the Oracle JDBC driver, if you create a very large batch, the batched insert can fail silently while you are left feeling pleased that you just loaded a large chunk of data in a flash. You will discover the problem only if you check the the row count in the database, after the load.

If the data-load involves sequential updates (i.e. a mix of inserts, updates and deletes) then also batching can be used without destroying the data integrity. Create separate batches for the insert, update and delete prepared statements and execute them in the following order:

  1. Insert batches
  2. Update batches
  3. Delete batches
One drawback of using batches is that if a statement in the batch fails, it fails the whole batch which makes it problematic to detect exactly which statement failed (another reason to use small batch sizes of ~ 100).
Constraints and Primary Keys
The Constraints and Primary Keys (CoPs) on the database act as the gatekeepers at the destination. The data-load program is like a truck driving into the destination with a lot of cargo (data).
CoPs can either be disabled while the data-load is carried out or they can remain on. In case we disabled them during the data-load, when re-enabling them we can have Oracle check the existing data against them or ignore existing data and only enable it for any new operations.
Whether CoPs are enabled or disabled and whether post-load validation of existing data is carried out can have a major affect on the total data-load time. We have three main options when it comes to CoPs and data loading:
  1. Obviously the quickest option, in terms of our data-load, is to drive the truck through the gates (CoPs disabled) and dump the cargo (data) at the destination, without stopping for a check at the gate or after unloading (CoPs enabled for future changes but existing data not validated). This is only possible if the contract with the data-source provider puts the full responsibility for data accuracy with the source.
  2. The slowest option will be if the truck is stopped at the gates (CoPs enabled), unloaded and each cargo item examined by the gatekeepers (all the inserts checked for CoPs violations) before being allowed inside the destination.
  3. A compromise between the two (i.e. the middle path) would be to allow the truck to drive into the destination (CoPs disabled), unload the truck and at the time of transferring the cargo to the destination, check the it (CoPs enabled after load and existing data validated).
The option chosen depends on the specific problem and the various data-integrity requirements. It might be easier to do the data file validation ‘in memory’ before an expensive data-load process is carried out and then we can use the first option.
Indexes
We need indexes and primary keys for performing updates and deletes (try a large table update or delete with and without indexes – then thank God for indexes!).
If your data load consists of only inserts and you are loading data into an empty or nearly empty table (w.r.t. amount of data being loaded), it might be a good idea to drop any indexes on it before starting the load.
This is so because as the data is inserted into a table, then any indexes on it are updated as well which takes additional time. If the table already contains a lot of data as compared to the the size of the new data being loaded then the time saved by dropping indexes will be wasted when trying to rebuild the index.
After the data is loaded we need to rebuild any dropped indexes and re-enable CoPs. Be warned that re-building indexes and re-enabling CoPs can be very time consuming and can take a lot of SYSTEM and TEMP space.
Logging
Oracle, being a ‘safe’ database, maintains a ‘redo’ log so that in case of a database failure we can perform recovery and return the database to its original state. This logging can be disabled by using the nologging option which can lead to a significant performance boost in case of inserts and index creations.
A major drawback of using nologging is that you loose the ability to ‘repeat’ any operations performed while this option is set. When using this option it is very important to take a database backup before and after the load process
Nologging is something that should be used judiciously and with a lot of planning to handle any side-effects.
Miscellaneous 
There are several other exotic techniques for improving large data loads on the Oracle side, such as partitioned tables. But these require more than ‘basic’ changes to the destination database.
Data-loading optimization for ‘big’ data is like a journey without end. I will keep updating this post as I discover new things. Please feel free share your comments and suggestions with me!