There is an easy way to get a web page as a HTML string. In fact there are two basic ways to do this using the basic java.net package. Why would we need this?
Some use-cases include:
- Getting information from a web-page for text aggregation (say from a news site)
- Creating a ‘page filter’ which pre-processes web-pages to strip out unsafe content
For both the examples we will use the basic java.net.URL and java.net.URLConnection entities in java.net package.
Type 1: Encapsulated Side-effects
The first example opens up the URL connection and obtains an InputStream from it. A BufferedReader wraps the InputStream from the URL which is then read into a StringBuilder. Then we return the String containing the HTML page source (if all goes well).
The code listing:
Few points to remember:
– Always use a StringBuilder instead of the concat ‘+’ operator, Strings are immutable which means every time you call ‘+’ a new String object is born. StringBuilder uses a character array.
– Always make sure to have proper exception handling and a finally block which closes the InputStream
This code encapsulates the ‘side-effect’ of reading from a remote URL and returns a String data.
Type 2: Exposed Side-effects
The second example is very simple. It is a sub-set of the code shown in the previous section.
In this second example we simply open the URL connection and obtain an InputStream from it which we return to the calling program. The responsibility of using the InputStream to get the page source is left to the calling function. This is especially useful if you want to work directly with an InputStream instead of a String representation. One such example is when using a parser to parse the page source.
The big disadvantage of using this method is that it exposes the side-effect related code to the main application. For example if the Internet connection goes down or the server goes down while the InputStream is being read the calling application will encounter an error and therefore behave unpredictably.
The way to get the best of both worlds (encapsulated side-effects and providing an InputStream to a calling function) is to use Example 1 and return a String object which can then be converted into a ‘byte stream’.