We have been asked multiple times about the difference between CuLab - GPU Toolkit for LabVIEW and NI GPU Analysis Toolkit. Basically they are intended to accelerate computations on GPU in LabVIEW. But they are different. And to demonstrate that difference we have taken an example from NI GPU Analysis Toolkit, implemented same functionality with CuLab and compared them in terms of simplicity, performance and functionality.
Note: NI GPU Analysis Toolkit does not have shorter name, we will call it "lvgpu" here, as this name is used to identify toolkit's folder in "vi.lib" and/or "examples".
Heat Equation Solver Example
As a reference we have taken "Heat Equation Solver" example which comes with lvgpu installer. The example can be found in NI Example Finder, as shown below.
Out if the box this examples shows ~500us execution times (snapshot below).
There are lots of ways on how to improve the performance, which is a topic for another blog post, but for this post we have done some modifications to the original code to be able to measure real GPU performance and to make a fair comparison.
The modifications are following:
Separate display part from calculations loop to avoid its influence on the performance measurement.
Add graphs to display execution times.
Clean up the block diagram to be displayed here.
The final version of the block diagram can be seen below.
The implementation of the same functionality based on CuLab is shown in the next snapshot.
Simplicity
When comparing the two implementations, the CuLab version is significantly simpler and more concise.
In the code three main sections can be identified: initialization, execution and resource freeing. Let's go through these sections and see how they compare in terms of simplicity.
Initialization
As can be seen from the snapshots below, the initialization section in the CuLab version is much cleaner and more in line with how it would be done in native LabVIEW.
Execution
Let's zoom into these sections for both implementations.
It is clear that CuLab version also requires fewer wires and is easier to understand. Notice that we used axpy (y=a*x+y) function to fuse multiplication with last addition, to to simplify the code and get small performance gain.
Resource Management and Freeing
Another important aspect is the resource management and freeing.
In case of lvgpu the user should "manually" allocate and keep track of all the resources (GPU memory, device/stream references, BLAS/FFT contexts, etc.) to be able use them further in the code, and free at the end.
In the case of CuLab, this is just a single VI call, even no additional wiring is required (except error wires, of course).
The way CuLab achieves this simplicity in resource management is that most of the descriptive information about GPU related data, as well as initialization of different CUDA libraries is handled automatically within the API VIs, which simplifies the toolkit and allows for more efficient and scalable code.
Functionality
As can be seen from the execution sections of the code (snapshots above), lvgpu cannot provide all the functionality of the algorithm to run on the GPU, so some of the code has to be executed on CPU.
Note: lvgpu was not designed for full functionality. It is a reference example/framework for which additional functionality can be added by implementing custom CUDA kernels and/or wrappers.
CuLab, on the other hand, provides a wide range of functions out of the box making the process of GPU code development much simpler.
In addition, the lack of functionality in lvgpu has a huge impact on performance, as demonstrated in the next section.
Performance
Let's now compare the performance of two implementations.
Below are timing results for lvgpu and CuLab based implementation respectively.
The lvgpu-based implementation takes approximately ~134us (snapshot above) to complete a single iteration of the algorithm, while the CuLab-based implementation takes approximately ~84us (snapshot below), making it approximately 60% faster.
And there are several reasons for that, which we will try to explain below.
Reason #1 - Functionality
As discussed earlier, lvgpu does not provide all the necessary functions for the algorithm, so some of the code must run on the CPU. For example the code portion below runs on CPU in order to complete the required functionality.
Running this portion of code on CPU should not have had large influence on the performance, as it is not of the computationally heaviest part in the algorithm. More important is the consequence, which is the next reason for slow performance.
Reason #2 - Data Movement
Being unable to run the whole computation on GPU can cause performance issues as it requires data to be copied from the GPU to the CPU for processing, and then sent back to the GPU.
As you can see from the snippet below, having partial results on the CPU causes data to be pushed back to the GPU on every iteration (code highlighted in red in snippet below).
Data movement between the GPU and CPU is costly and drastically slows down the overall performance.
Reason #3 - Implementation Efficiency
In order to demonstrate the performance difference of identical functions, we have created another benchmark based on "Heat Equation Solver", where we have shrunk the functionality of the reference example and kept only Matrix-Vector multiplications. Below are respective implementations in lvgpu and CuLab.
Their corresponding performance results are shown below.
CuLab was designed with efficiency in mind.
This is demonstrated by the fact that it is ~40% faster than lvgpu (78us vs 110us) when comparing the implementations of the same functionality for Matrix-Vector multiplication.
Summary
Overall, when using an accelerator like a GPU to speed up your code, it's important to have a simple and intuitive API that makes the code readable and concise. CuLab was designed to meet all three criteria: simplicity, functionality, and performance. It is high in functionality and easy to use without compromising performance.
This is first part of the benchmark. In the next blog post we will be implementing Multi-Channel FFT example, and see what results can be achieved there.
Code Repository
LabVIEW project used in this demo can be downloaded and tested from our GitHub repository.
System Setup
Below is the specification of the PC used for running these experiments.
Support Information
In case of any questions or assistance required with CuLab or related projects, please contact us at info@ngene.co
Comments