Insights

Performance Gains in the Enterprise via Parallel I/O

December 17, 2010

In recent years, chip manufacturers have stated that CPU speeds cannot continue to increase. Instead, hardware manufacturers have begun to create chips with more cores, making them available on lower-priced hardware. The responsibility now falls to the software industry to take advantage of these multiple-core processors.

There are currently a plethora of parallelization frameworks available that take advantage of the multi-core machines. Most examples, however, are focused on more scientific or computationally intensive operations such as dealing with prime numbers, ray tracing, or working on matrix operations. Although these are great ways of utilizing the extra horsepower that multiple cores provide, they certainly do not directly help the average enterprise application.

Unfortunately, enterprise development has traditionally not required a multi-threaded development focus for the majority of applications. Consequently, development teams in most enterprises are not well versed in multi-threaded programming. Therefore, just introducing the requirement for parallel I/O via multi-threaded programming with no automated guidance in the form of frameworks and tools is a recipe for disaster.

In this whitepaper, we look at how developers can take advantage of multi-threaded code to reduce the response time of an expense application by 70% by simply applying parallel I/O concepts without changing the current code structure or requiring new training for the development team. Through the use of a parallel I/O framework, development teams can create multi-threaded applications without having to deal with the complex programming of multi-threaded code.

Understanding the Service Pattern

In the life of an enterprise, it is not uncommon to acquire, partner or simply use services from different sources to add to the value chain by offering more services on existing data. In most cases, these new, value-added services are dependent on a number of data sources or other services that are distributed across a diverse geography and running on different technology stacks.

Figure 1 depicts an enterprise service pattern that can be commonly found across many different verticals.

Figure 1 High Level Overview of System

We can roughly categorize the responsibilities of such services into three parts:

  1. Define the data to be used from disparate systems
  2. Orchestrate the data aggregation process
  3. Run business logic on the aggregated data
  4. Format and return to the caller

Fortunately, or as some might point out unfortunately, step 3 (running of business logic on aggregated data) is not the most time-consuming operation in this list of tasks. Instead, the majority of the time is spent waiting for other services to return the data required for the business logic execution stage.

Sometimes, the result returned from one dependent service call in a multi-service call scenario is enough to stop the processing of the entire request. In these cases, we ideally would like to catch the situation early on and return to the caller immediately in order not to waste any computing resources. In essence, what needs to be parallelized in most typical scenarios is not the business logic execution itself but the data retrieval from different sources. In order to achieve this, we will be looking at parallel I/O.

Although the idea for parallel I/O is not a new one, it usually involves dealing with multi-threaded programming and therefore is not widely used in the industry.

Multi-threaded/parallel programming is a paradigm shift for most developers, and a complex one at that. Creating robust, scalable parallel applications with a set of developers not familiar with multi-threaded programming is an additional challenge that most development projects should avoid if possible. What is needed is a solution that takes advantage of overlapped I/O that can be easily implemented by developers not familiar with the complexities of parallel programming

How Can We Enable Developers to Use Parallel I/O?

When working on a project that will be maintained by another team, it is imperative to take into consideration what the team is familiar with and what will make their lives easier from a maintenance point of view. In our experience, to help a development team that had little experience with parallel programming, we decided what was needed was a framework that allowed the programmer to work in familiar ways. The following is a list of goals that became our guiding principles in creating a parallelization framework:

  1. Current application should be able to utilize the framework without structural changes to the code base.
  2. The library should not leak threading or parallelization concepts to the developer.
  3. The following types of functionality should be supported at the very minimum:
  1. Defining a data-retrieval task to be executed in parallel I/O
  2. Starting the task
  3. Collecting the result
  4. Canceling a task that has not been run.
  5. Preserving the order across tasks for data dependency situations.

There are many different solutions available on different platforms that achieve some/all of the required goals above. Instead of simply jumping into an existing framework, it would be more beneficial to understand the building blocks of such a framework and how they interact with each other. In the next section, we explore just that.

Building Blocks of Parallel I/O

There are several very important building blocks when enabling parallel I/O at a high level. They can be categorized as follows:Tasks are the logical groupings of steps that you want to execute in a parallel manner. For our purposes, we are going to only use I/O-related code in our tasks, although we could also inter-twine some business logic-related code to provide pre-processing steps such as decryption, translation, etc. in these tasks as well.

  1. Futures hold the return values of tasks. They also represent a value that will be returned from the task at a later point in time. If the task has not completed, the future does not contain the actual return value of the service but merely gives a reference to it such that when the data is available, it will be used.
  2. Parallel Execution Unit is the actual mechanism that runs the I/O tasks in parallel and returns the results.
  3. Synchronization Objects help ensure that the tasks are being ordered the way they should be and that the futures are returned to the caller as they are available.

Figure 2 below gives an overview of different components in the library.

Figure 2 Overview of different components in the library

a. How the Building Blocks Work Together

i.Translation of Tasks into Execution Units

From the parallel I/O Library point of view, each task is the smallest unit of execution. Therefore, the very first thing the library needs to do is to translate the tasks into the target platform’s smallest execution units. Depending on the platform and language you work with, it might be light-weight threads, threads, processes or events. Again, depending on the number of physical resources on the machine, these units of execution might be executed by a smaller set of physical execution units. Re-using some threads/processes might be necessary to avoid thrashing.

Once the units of execution are defined, the next thing to do is to make sure they are run in the specified order. Using thread-safe queues is a good practice to enable the correct ordering of tasks.

ii. Returning the Task Results (Futures) to the Caller

Tasks will internally implement the mechanism to return the futures (values). Futures by default need to be returned as soon as the job starts, and at this point the results of the tasks are not ready. Thus, it is necessary to implement a mechanism that will block the caller when the caller explicitly requests the future’s value while the task is still running, and will return the result as soon as the task completes.

It is also advisable to return the exceptions to the callers when they are ready to handle them, instead of while other calls are going on; this simplifies the usage of the library. In order to do this, tasks internally need to capture any exceptions and re-throw them when the future’s result is explicitly requested by the caller.

To simplify the collection of results from many tasks, it may be desirable to return the results as an enumeration through which the caller can iterate to collect them in the order expected.

iii. Checkpoints and Cancellations

Sometimes, if one of N tasks fails, there might be no reason to execute the rest of the tasks even if they are already defined and started. In a scenario like this, what are needed are checkpoints and cancellations. Checkpoints are explicit points at which the current task can be cancelled. These can come in two flavors:

  • System Defined Checkpoints: this is mainly right before the task is about to be executed.
  • Developer Defined Checkpoints: explicit points in the code at which the developer indicates a cancellation is possible.

Once the checkpoints are defined, a signaling mechanism is needed to change the state of the flag to cancellable. The actual checkpoint will observe the value of this flag and cancel the operation if it is requested by the caller.

b. How Developers Consume the Overlapped I/O Library

Now that we know what the building blocks are, let’s take a look at how they work together and make parallel I/O work.
  1. First, the developer registers the I/O tasks she wants to parallelize.
  2. She indicates the order in which she wants to retrieve the results.
  3. She starts the tasks.
  4. The system collects and processes the results
  5. The system handles exceptions and returns the final result to the caller.

Figure 3 Sequence of events for assisted parallel I/O

Performance Gain

The biggest reason to use a parallel I/O library is to increase the performance of your enterprise application. To demonstrate the benefits, let’s look at a fairly common scenario and see how parallel I/O improves the performance of the system.

A service is being developed that processes expense reports filed by a user every month. The system gets the data from three different systems and processes the results. In order to retrieve the data from the different system’s ID translation is done, services are called, and the responses are mapped to an internal XML document structure that the service can process.

Since the data lives on different servers, response time of the services that return the data range from 3 seconds to 7 seconds. Internal processing of each report takes 2 seconds because each item is checked against the allowance categories and approved.

Currently, the code is to collect and process the data sequentially. The current implementation produces the following run-time.

Although a non-critical task, a department that employs 5,000 people needs to make sure that it can accommodate expense report processing within 24 hours for all employees. The department does not want to incur extra initial cost and maintenance overhead by purchasing more servers that will only be useful for this task one day every month. If a parallel library was used, the developers would be able to write code that would consume the data in a parallel fashion without changing their programming habits. If such a model was used, the following run-time would be obtained:

In this scenario we have saved about 15 seconds from the original 25 second (60%) response time without adding significant complexity to the code base and without changing the programming model.

Sample Code Using the Parallel I/O Library

However, it is imperative to note that the programming model to which the developers are accustomed should be maintained as much as possible to ease adoption. In the sequential model, the following code would be used.

By hiding the aspects of tasks, threads, synchronization and exceptions, the parallel I/O library would enable the developer to develop code that would run twice as fast. In the next whitepaper, we will look at how we can start creating a library like the one described above.

of |