Image Normalization: Comparing CPU vs GPU Performance in PyTorch

This post has been on my to-do list for a long time, and I’m excited to finally have the time to write it. This topic is particularly interesting to me, and I hope you find it useful.

It all started while I was testing some of our internal data preprocessing pipelines and began thinking about the performance of the normalization step. Image normalization is a straightforward process: subtract the mean and divide by the standard deviation. These operations are well-suited for parallel processing, which is where GPUs excel. However, factors like data types and data transfer between the CPU and GPU also play a significant role.

My goal is to compare the performance differences between CPU and GPU for image normalization and explore the impact of data types and data transfer. Let’s get started!

Setup

Let’s start by creating some test data and importing the necessary libraries:

import torch
from torchvision.transforms import v2
from torch.profiler import profile, record_function, ProfilerActivity, schedule
from torch.utils.data import Dataset, DataLoader

batch_size = 10
height, width = 720, 1280
# Create test images as uint8
img_uint8 = torch.randint(0, 256, (batch_size, 3, height, width), dtype=torch.uint8)
# Create float version since torch.Normalize expects float input
img_float = img_uint8.float()

# ImageNet normalization values
mean = torch.tensor([123.675, 116.28, 103.53])
std = torch.tensor([58.395, 57.12, 57.375])

To profile our operations, we’ll use PyTorch’s built-in profiler. This is just because I wanted to play with it and this looked like a good opportunity. We’ll define a custom profiling function to simplify the process:

def custom_profile(func, activities, times=10):
    with profile(activities=activities, record_shapes=True) as p:
        for _ in range(times):
            func()
            p.step()
    print(p.key_averages().table(sort_by="cpu_time_total", row_limit=10))

Basic CPU Normalization

Let’s start by implementing our normalization function and profiling it on the CPU:

def my_normalize(image, mean, std):
    # Reshape mean and std for broadcasting
    mean = mean.view(-1, 1, 1)
    std = std.view(-1, 1, 1)
    return (image - mean) / std

custom_profile(
    lambda: my_normalize(img_uint8, mean, std), 
    activities=[ProfilerActivity.CPU]
)

The profiler reveals some interesting insights:

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              aten::sub        36.69%     643.926ms        67.74%        1.189s     118.902ms            10  
              aten::div        32.25%     565.990ms        32.25%     565.990ms      56.599ms            10  
               aten::to         0.00%      73.817us        31.06%     545.093ms      54.509ms            10  
         aten::_to_copy         0.02%     324.441us        31.05%     545.019ms      54.502ms            10  
            aten::copy_        31.01%     544.341ms        31.01%     544.341ms      54.434ms            10  
    aten::empty_strided         0.02%     354.039us         0.02%     354.039us      35.404us            10  
             aten::view         0.01%     225.351us         0.01%     225.351us      11.268us            20  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.755s

The basic operations (aten::sub and aten::div) account for most of the execution time.
A significant portion of time is also spent on type conversion operations (aten::to, aten::_to_copy).
These conversions occur because the input is uint8, while the mean and standard deviation are floats. So PyTorch converts the input to float before performing the operations.

Now, let’s compare the performance when using float as input instead:

custom_profile(
    lambda: my_normalize(img_float, mean, std), 
    activities=[ProfilerActivity.CPU]
)

--------------  ------------  ------------  ------------  ------------  ------------  ------------  
          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::sub        50.19%     569.169ms        50.19%     569.169ms      56.917ms            10  
     aten::div        49.79%     564.562ms        49.79%     564.562ms      56.456ms            10  
    aten::view         0.02%     203.964us         0.02%     203.964us      10.198us            20  
--------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.134s

This is much faster! Here’s our first takeaway: always use the correct data types to avoid unnecessary conversions. Casting between types is costly and can significantly impact performance.

At this point I became quite interested in knowing how my simple function would compare to PyTorch’s built-in normalization function. So I couldn’t resist the temptation to compare them. Note that I have used the new torchvision transforms v2 API:

torch_normalize = v2.Normalize(mean, std)
custom_profile(lambda: torch_normalize(img_float), activities=[ProfilerActivity.CPU])

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::sub        87.41%     785.070ms        87.41%     785.070ms      78.507ms            10  
                   aten::div_        12.21%     109.692ms        12.21%     109.692ms      10.969ms            10  
                   aten::item         0.30%       2.692ms         0.33%       2.921ms      32.458us            90  
             aten::is_nonzero         0.02%     136.106us         0.32%       2.899ms      96.634us            30  
    aten::_local_scalar_dense         0.03%     229.645us         0.03%     229.645us       2.552us            90  
                   aten::view         0.02%     169.587us         0.02%     169.587us       8.479us            20  
                  aten::empty         0.02%     161.253us         0.02%     161.253us       8.063us            20  
                     aten::to         0.00%      14.746us         0.00%      14.746us       0.737us            20  
             aten::lift_fresh         0.00%      12.286us         0.00%      12.286us       0.614us            20  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 898.176ms

We can spot a few extra operations because PyTorch checks that the types and dimensions are correct (along with other performance optimizations, as seen in the source code).

You might notice a small difference: there’s an aten::div_ operation instead of aten::div. This happens because PyTorch’s Normalize function does the division in-place, while our custom function does not. Still, the main operations are the same, and the performance is very similar.

This means we’ve mostly achieved the same behavior. Great! Now, let’s see how it runs on the GPU.

GPU Implementation

We only need a simple adjustment to run our code on the GPU. Specifically, we need to move our tensors to the GPU and ensure the operations execute there. Additionally, I’ve included CUDA in our profiler activity:

img_gpu = img_float.cuda()
mean_gpu = mean.cuda()
std_gpu = std.cuda()

custom_profile(
    lambda: my_normalize(img_gpu, mean_gpu, std_gpu),
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]
)

The profiler output shows the following:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  cudaDeviceSynchronize        85.85%      20.758ms        85.85%      20.758ms      20.758ms       0.000us         0.00%       0.000us       0.000us             1  
                                              aten::sub        10.09%       2.439ms        13.14%       3.177ms     317.717us      10.450ms        48.68%      10.450ms       1.045ms            10  
                                       cudaLaunchKernel         3.31%     799.561us         3.31%     799.561us      39.978us       0.000us         0.00%       0.000us       0.000us            20  
                                              aten::div         0.55%     133.436us         0.81%     195.034us      19.503us      11.019ms        51.32%      11.019ms       1.102ms            10  
                                             aten::view         0.21%      50.311us         0.21%      50.311us       2.516us       0.000us         0.00%       0.000us       0.000us            20  
void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.450ms        48.68%      10.450ms       1.045ms            10  
void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      11.019ms        51.32%      11.019ms       1.102ms            10  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 24.181ms
Self CUDA time total: 21.469ms

There are a few new operations in the trace:

cudaLaunchKernel: A CUDA runtime API function that launches a CUDA kernel on the GPU. This is the mechanism used to execute GPU code asynchronously.
cudaDeviceSynchronize: Forces the CPU to wait until all previously launched GPU operations are complete.

The key observation here is that the sum and div operations take significantly less time on the GPU compared to the CPU. This is because the GPU can parallelize these operations across many cores, making it much faster than the CPU. This supports our initial hypothesis that the GPU should be faster for this kind of operation.

But… is there anything else we can improve? Can we make this optimization even better?

The importance of data transfer

So this is where things get interesting. We have seen that using the correct data types is crucial for performance. Avoiding unnecessary conversions can save a lot of time. But have you considered the time it takes to move data between the CPU and GPU? Let’s start with a simple exploration:

def custom_pipeline(img, mean, std):
  img_cuda = img.to("cuda")
  return my_normalize(img_cuda, mean, std)

custom_profile(
    lambda: custom_pipeline(img_float, mean, std), 
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]
)

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::to         1.02%       2.999ms        98.06%     288.017ms      28.802ms       0.000us         0.00%     267.229ms      26.723ms            10  
                                         aten::_to_copy         0.07%     213.702us        97.03%     285.018ms      28.502ms       0.000us         0.00%     267.229ms      26.723ms            10  
                                            aten::copy_         0.09%     271.131us        96.91%     284.653ms      28.465ms     267.229ms        92.59%     267.229ms      26.723ms            10  
                                        cudaMemcpyAsync        96.68%     283.985ms        96.68%     283.985ms      28.398ms       0.000us         0.00%       0.000us       0.000us            10  
                                              aten::div         0.09%     278.727us         0.97%       2.859ms     285.949us      10.986ms         3.81%      10.986ms       1.099ms            10  
                                             cudaMalloc         0.83%       2.424ms         0.83%       2.424ms       2.424ms       0.000us         0.00%       0.000us       0.000us             1  
                                  cudaDeviceSynchronize         0.66%       1.952ms         0.66%       1.952ms       1.952ms       0.000us         0.00%       0.000us       0.000us             1  
                                              aten::sub         0.18%     514.698us         0.26%     771.397us      77.140us      10.402ms         3.60%      10.402ms       1.040ms            10  
                                       cudaLaunchKernel         0.14%     410.469us         0.14%     410.469us      20.523us       0.000us         0.00%       0.000us       0.000us            20  
                                  cudaStreamSynchronize         0.14%     396.956us         0.14%     396.956us      39.696us       0.000us         0.00%       0.000us       0.000us            10  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 293.730ms
Self CUDA time total: 288.617ms

It’s interesting to see that cudaMemcpyAsync takes up to 96.68% of the total CPU time! That’s a lot!

Can we avoid this data transfer? Unfortunately no, the data must be on the GPU to be processed. But can we optimize it? Yes! We can do this by transferring less data.

How? By using the correct data types! uint8 is 1/4 the size of float32 (1 byte vs. 4 bytes). So, we can move the data to the GPU as uint8, convert it to float32, and then normalize it. If memory bandwidth is the bottleneck, this should be faster!

def custom_pipeline_uint8(img, mean, std):
  img_cuda = img.to("cuda").float()
  return my_normalize(img_cuda, mean, std)

custom_profile(
    lambda: custom_pipeline_uint8(img_uint8, mean, std), 
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]
)

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::to         1.09%     935.260us        95.88%      82.642ms       4.132ms       0.000us         0.00%      62.728ms       3.136ms            20  
                                         aten::_to_copy         0.31%     264.719us        94.80%      81.707ms       4.085ms       0.000us         0.00%      62.728ms       3.136ms            20  
                                            aten::copy_         0.42%     365.007us        94.21%      81.197ms       4.060ms      62.728ms        74.50%      62.728ms       3.136ms            20  
                                        cudaMemcpyAsync        93.10%      80.242ms        93.10%      80.242ms       8.024ms       0.000us         0.00%       0.000us       0.000us            10  
                                  cudaDeviceSynchronize         3.42%       2.949ms         3.42%       2.949ms       2.949ms       0.000us         0.00%       0.000us       0.000us             1  
                                  cudaStreamSynchronize         0.45%     385.047us         0.45%     385.047us      38.505us       0.000us         0.00%       0.000us       0.000us            10  
                                       cudaLaunchKernel         0.43%     370.317us         0.43%     370.317us      12.344us       0.000us         0.00%       0.000us       0.000us            30  
                                              aten::sub         0.26%     221.024us         0.37%     317.485us      31.748us      10.457ms        12.42%      10.457ms       1.046ms            10  
                                    aten::empty_strided         0.28%     244.689us         0.28%     244.689us      12.234us       0.000us         0.00%       0.000us       0.000us            20  
                                              aten::div         0.17%     142.558us         0.24%     210.994us      21.099us      11.010ms        13.08%      11.010ms       1.101ms            10  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 86.192ms
Self CUDA time total: 84.195ms

And it is indeed faster! This is a great optimization! We often focus on how efficient the GPU is at computations (which is true), but we should pay more attention to the data being transferred between the CPU and GPU and look for ways to optimize that!

Although torch profiler has been very useful here, let’s compare things now using a classic timer to see how total time compares:

import time

def timer(func):
    start = time.time()
    func()
    return time.time() - start

gpu_time_from_float = timer(lambda: custom_pipeline(img_float, mean, std))
gpu_time_from_uint = timer(lambda: custom_pipeline_uint8(img_uint8, mean, std))

print(f"Time transferring floats: {gpu_time_from_float:.4f}s")
print(f"Time transferring uint8: {gpu_time_from_uint:.4f}s")
print(f"Speedup: {gpu_time_from_float / gpu_time_from_uint:.2f}x")

Time transferring floats: 0.0306s
Time transferring uint8: 0.0083s
Speedup: 3.69x

This is something to always keep in mind when optimizing your data pipeline and I am pretty sure this can help many people out there to tune their inference pipelines!

I imagine the avid reader might be wondering: hasn’t PyTorch been designed to handle data loading and preprocessing on the GPU to avoid it sitting idle, waiting for data? That’s true. During training, if we can fully parallelize data preprocessing on the CPU and model inference on the GPU, we can achieve optimal performance. However, when serving a model, we typically only need to load a few images on demand and apply normalization. In such cases, we can leverage the optimizations discussed in this post.

Conclusion

In this post, we’ve explored the performance differences between CPU and GPU when normalizing images in PyTorch, and examined how data transfer and data types influence preprocessing efficiency. Key takeaways for optimizing your pipeline include:

Always use the correct data types to avoid unnecessary conversions.
Minimize the amount of data transferred between the CPU and GPU.
Optimize your data pipeline to take full advantage of the GPU’s parallel processing capabilities.

Miguel Méndez

Posts