Java: Getting a webpage using

There is an easy way to get a web page as a HTML string. In fact there are two basic ways to do this using the basic package. Why would we need this?

Some use-cases include:

  • Getting information from a web-page for text aggregation (say from a news site)
  • Creating a ‘page filter’ which pre-processes web-pages to strip out unsafe content

For both the examples we will use the basic and entities in package.


Type 1: Encapsulated Side-effects

The first example opens up the URL connection and obtains an InputStream from it. A BufferedReader wraps the InputStream from the URL which is then read into a StringBuilder. Then we return the String containing the HTML page source (if all goes well).

The code listing:

Few points to remember:

– Always use a StringBuilder instead of the concat ‘+’ operator, Strings are immutable which means every time you call ‘+’ a new String object is born. StringBuilder uses a character array.

– Always make sure to have proper exception handling and a finally block which closes the InputStream

This code encapsulates the ‘side-effect’ of reading from a remote URL and returns a String data.

 Type 2: Exposed Side-effects

The second example is very simple. It is a sub-set of the code shown in the previous section.

In this second example we simply open the URL connection and obtain an InputStream from it which we return to the calling program. The responsibility of using the InputStream to get the page source is left to the calling function. This is especially useful if you want to work directly with an InputStream instead of a String representation. One such example is when using a parser to parse the page source.

The big disadvantage of using this method is that it exposes the side-effect related code to the main application. For example if the Internet connection goes down or the server goes down while the InputStream is being read the calling application will encounter an error and therefore behave unpredictably.

The way to get the best of both worlds (encapsulated side-effects and providing an InputStream to a calling function) is to use Example 1 and return a String object which can then be converted into a ‘byte stream’.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s