In the previous posts I have discussed Shared State and State Change:
Let us put all this to good use in this post!
A standard use-case is where we have a large data file (e.g. CSV like this: Price Paid Data) that needs to be crunched where each row is transformed to a new row (there is no aggregation going on). The simplest processing paradigm is to do everything sequentially.
Step 1: Read next row from file
Step 2: Process row
Step 3: Write row to output file
Step 4: Go to 1 if not End Of File
At the end of the process we end up with a file having the same number of rows but different set of columns.]
This is easy to code up (few lines of Python!) but is impossible to scale. So as our file grows in size, longer we have to wait. Reading and writing in sequence will increase linearly with the size of the file. More processing we need to do in each loop longer the wait becomes.
If we have to process the file again and again (to apply different aggregations and transformations) things become more difficult.
If the large data file is ‘read-only’ things become easier to process. We can do parallel reads without any fear. Even if we do sequential reads we can process each row in parallel and do sequential batching in the aggregate (which in this case is a simple row write) – where we don’t write one row but a batch of rows. This can be seen in Figure 1 (top). This is still not the best option. The iteration time will increase as the file size increases.
The real power comes from the use of a distributed file system, which means that we can fully parallelize the pipeline till the aggregation step (where we still batch write rows). This can be done by breaking file into blocks or chunks so we can iterate in parallel.
The chunking of the file is still a sequential step but it needs to be done only once as the file is loaded into the File System. Then we can perform any operation.
I have attempted (as a means of learning Rust) writes a single iterator based program with parallel processing of the transformation function and batched writer.
There is a pitfall here, we can’t keep adding handlers as the overhead of parallelization will start to creep up. At the end of the day the writer is a single threaded processor so it cannot be scaled up and can only handle a given set of handlers at a time (depending on the production rate of the handler).
There is a way of parallelizing the writer as well. We can produce a chunked file as output. This allows us to run parallel iterate-transform-write pipelines on each chunk. This can be shown in Figure 2.
The file for the Rust program is below.