Vector Addition in CUDA (CUDA C/C++ program for Vector Addition)

Posted by Unknown at 05:40 | 33 comments

We will contrive a simple example to illustrate threads and how we use them to code with CUDA C. Imagine having two lists of numbers where we want to sum corresponding elements of each list and store the result in a third list. If you have any background in linear algebra, you will recognize this operation as summing two vectors.

This article gives you detailed description about Vector Addition in CUDA C. This article will also let you know how to lunch kernel in CUDA with variable number of threads with limitation.

If you are not clear with idea of thread and block architecture and how to decide. please go through this link

CPU vector sums

First we’ll look at one way this addition can be accomplished with traditional C code:

#define N 10

void add ( int *a, int *b, int *c )

{

int tid = 0; // this is CPU zero, so we start at zero

while (tid < N)

{

c[tid] = a[tid] + b[tid];

tid += 1; // we have one CPU, so we increment by one

}

int main( void )

{

int a[N], b[N], c[N];

// fill the arrays 'a' and 'b' on the CPU

for (int i=0; i<N; i++)

{

a [i] = -i;

b[i] = i * i;

}

add( a, b, c );

// display the results

for (int i=0; i<N; i++)

{

printf( "%d + %d = %d\n", a[i], b[i], c[i] );

}

return 0;

}

Most of the articles bears almost no explanation, but we will briefly look at the add() function to explain why we overly complicated it. One would typically code this in a slightly simpler

manner, like so:

void add( int *a, int *b, int *c )

{

for (i=0; i < N; i++)

{

c[i] = a[i] + b[i];

}

Our slightly more convoluted method was intended to suggest a potential way to parallelize the code on a system with multiple CPUs or CPU cores. For example, with a dual-core processor, one could change the increment to 2 and have one core initialize the loop with tid = 0 and another with tid = 1. The first core would add the even-indexed elements, and the second core would add the odd indexed elements. This amounts to executing the following code on each of the two CPU cores:

CPU CORE 1

void add ( int *a, int *b, int *c )

{

int tid = 0;

while (tid < N)

{

c[tid] = a[tid] + b[tid];

tid += 1;

}

CPU CORE 2

void add ( int *a, int *b, int *c )

{

int tid = 1;

while (tid < N)

{

c[tid] = a[tid] + b[tid];

tid += 2;

}

Of course, doing this on a CPU would require considerably more code than we have included in this example. You would need to provide a reasonable amount of infrastructure to create the worker threads that execute the function add() as well as make the assumption that each thread would execute in parallel, a scheduling assumption that is unfortunately not always true.

GPU vector sums

We can accomplish the same addition very similarly on a GPU by writing add() as a device function. This should look similar to code you saw in the previous chapter. But before we look at the device code, we present main(). Although the GPU implementation of main() is different from the corresponding CPU version, nothing here should look new:

Using only Threads

#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )

{

if (err != cudaSuccess)

{

printf( "%s in %s at line %d\n", cudaGetErrorString( err ),

file, line );

exit( EXIT_FAILURE );

}

const short N = 10 ;

// CUDA Kernel for Vector Addition

__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)

{

//Get the id of thread within a block

unsigned short tid = threadIdx.x ;

if ( tid < N ) // check the boundry condition for the threads

dev_c [tid] = dev_a[tid] + dev_b[tid] ;

}

int main (void)

{

//Host array

int Host_a[N], Host_b[N], Host_c[N];

//Device array

int *dev_a , *dev_b, *dev_c ;

//Allocate the memory on the GPU

HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

//fill the Host array with random elements on the CPU

for ( int i = 0; i <N ; i++ )

{

Host_a[i] = -i ;

Host_b[i] = i*i ;

}

//Copy Host array to Device array

HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));

HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

//Make a call to GPU kernel

Vector_Addition <<< 1, N >>> (dev_a , dev_b , dev_c ) ;

//Copy back to Host array from Device array

HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

//Display the result

for ( int i = 0; i<N; i++ )

printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

//Free the Device array memory

cudaFree (dev_a) ;

cudaFree (dev_b) ;

cudaFree (dev_c) ;

system("pause");

return 0 ;

}

In this code, we are launching with a number in the angle brackets :

Vector_Addition<<<1,N>>>( dev _ a, dev _ b, dev _ c );

What gives?

Recall that we left those two numbers in the angle brackets unexplained; we stated vaguely that they were parameters to the runtime that describe how to launch the kernel. Well, the first number in those parameters represents the number of parallel blocks in which we would like the device to execute our kernel. In this case, we’re passing the value N for this parameter. For example, if we launch with Vector_Addition<<<1,2>>>(), you can think of the runtime creating two copies of the kernel and running them in parallel. We call each of these parallel invocations a thread. With Vector_Addition<<<1,256>>>(), you would get 256 threads running on the GPU. Parallel programming has never been easier. But this raises an excellent question: The GPU runs N copies of our kernel code, but how can we tell from within the code which block is currently running? This question brings us to the second new feature of the example, the kernel code itself. Specifically, it brings us to the variable threadIdx.x:

If you recall the CPU-based example with which we began, you will recall that we needed to walk through indices from 0 to N-1 in order to sum the two vectors. Since the runtime system is already launching a kernel where each block will have one of these indices, nearly all of this work has already been done for us. Because we’re something of a lazy lot, this is a good thing. It affords us more time to blog, probably about how lazy we are.

The last remaining question to be answered is, why do we check whether tid is less than N? It should always be less than N, since we’ve specifically launched our kernel such that this assumption holds. But our desire to be lazy also makes us paranoid about someone breaking an assumption we’ve made in our code. Breaking code assumptions means broken code.

Using only blocks

#include <stdio.h>

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )

static void HandleError( cudaError_t err, const char *file, int line )

{

if (err != cudaSuccess)

{

printf( "%s in %s at line %d\n", cudaGetErrorString( err ),

file, line );

exit( EXIT_FAILURE );

}

const short N = 10 ;

// CUDA Kernel for Vector Addition

__global__ void Vector_Addition ( const int *dev_a , const int *dev_b , int *dev_c)

{

//Get the id of thread within a block

unsigned short tid = blockIdx.x ;

if ( tid < N ) // check the boundry condition for the threads

dev_c [tid] = dev_a[tid] + dev_b[tid] ;

}

int main (void)

{

//Host array

int Host_a[N], Host_b[N], Host_c[N];

//Device array

int *dev_a , *dev_b, *dev_c ;

//Allocate the memory on the GPU

HANDLE_ERROR ( cudaMalloc((void **)&dev_a , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_b , N*sizeof(int) ) );

HANDLE_ERROR ( cudaMalloc((void **)&dev_c , N*sizeof(int) ) );

//fill the Host array with random elements on the CPU

for ( int i = 0; i <N ; i++ )

{

Host_a[i] = -i ;

Host_b[i] = i*i ;

}

//Copy Host array to Device array

HANDLE_ERROR (cudaMemcpy (dev_a , Host_a , N*sizeof(int) , cudaMemcpyHostToDevice));

HANDLE_ERROR (cudaMemcpy (dev_b , Host_b , N*sizeof(int) , cudaMemcpyHostToDevice));

//Make a call to GPU kernel

Vector_Addition <<< N, 1 >>> (dev_a , dev_b , dev_c ) ;

//Copy back to Host array from Device array

HANDLE_ERROR (cudaMemcpy(Host_c , dev_c , N*sizeof(int) , cudaMemcpyDeviceToHost));

//Display the result

for ( int i = 0; i<N; i++ )

printf ("%d + %d = %d\n", Host_a[i] , Host_b[i] , Host_c[i] ) ;

//Free the Device array memory

cudaFree (dev_a) ;

cudaFree (dev_b) ;

cudaFree (dev_c) ;

system("pause");

return 0 ;

}

Note

If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line const int N 10 to 100 or 500 to launch tens of thousands of parallel thread. Be warned, though: No dimension of your launch of blocks may exceed 512 (or 1024). This is simply a hardware-imposed limit, so you will start to see failures if you attempt launches with more blocks than this. In the next articles, we will see how to work within this limitation.

Using blocks and Threads both
For Better Implementation Read this article

Got Questions?
Feel free to ask me any question because I'd be happy to walk you through step by step!

Reference and External Links

CUDA C Programming Guide
Programming Massively Parallel Processors By David B. Kirk and Wen-mei W.Hwu

33 comments:

Anonymous26 February 2013 at 08:42
hello there,
I'm just a newbie in cuda programming and my program shows that my cpu is faster than the gpu, do you happen to know anything about it?? thank you
ReplyDelete
Replies
Unknown1 March 2013 at 05:51
Hello there,

There may be lots of Reason for that, some of them are
1. Your gpu is not busy all the time; you should lunchat leas 192 threads to make your gpu busy
2. your transferring more data but you computing less on GPU; transferring more data will cause time increase
3. your thread block and grid block did not configure optimally; try to fine tune with your block size and grid size

you may calculate the how much time you are speding on computation and transfering data from gpu to cpu
use this link for performance matrix :
http://cuda-programming.blogspot.in/2013/01/how-to-implement-performance-metrics-in.html

you can also profile your code in visual profiler to get to know about kernel time excuation
Thank you
ReplyDelete
Replies
Anonymous12 March 2013 at 09:25
thanks for the reply i figured out what my problem was, and the link you shared was really helpful thanks bro.....
ReplyDelete
Replies
Unknown13 January 2014 at 06:07
I think there's a mistake on the "CPU CORE 1" snippet above. It shoud be tid+=2, instead of tid+=1. After all you're moving forward 2 elements at a time on both cpu cores.
ReplyDelete
Replies
Suhel Sayyad10 August 2014 at 06:03
Code shows NO Cuda Capable device found.
If I remove HANDLE ERROR and execute I get the output but garbage data.
ReplyDelete
Replies
Anonymous4 June 2015 at 12:01
The note you have at the bottom is incorrect. The maximum gird size is 65535x65535x65535. Each block may have no more than 1024 threads total.
ReplyDelete
Replies
Anonymous8 July 2015 at 18:56
Nitin bhaiya...You copied most of it from CUDA by example book...Anyway good going bhaiya...
ReplyDelete
Replies
Accounts Data Entry9 July 2016 at 05:42
This blog awesome and i learn a lot about programming from here.The best thing about this blog is that you doing from beginning to experts level.
ReplyDelete
Replies
Unknown11 September 2016 at 05:04
If loop iteration which sums corresponding elements of the two arrays and places the sum in the third array (C[i] = A[i] + B[i]). and addition for two large double array (>=1K)
How to initialize for 1K input ?
ReplyDelete
Replies
Anonymous10 September 2018 at 22:39
Nitin bhaiya can I call you bhaiya ??
ReplyDelete
Replies
Escort London Girls15 November 2018 at 06:02
Such news should be made more widely available.
ReplyDelete
Replies
QuantumPirate90006 October 2023 at 00:58
bitlis
kastamonu
çorum
van
sakarya
1Z42İ
ReplyDelete
Replies
WarpDrive0076 October 2023 at 15:34
uşak
van
hakkari
elazığ
bingöl
7VW8RD
ReplyDelete
Replies
Anita8 October 2023 at 16:25
görüntülü show
ücretlishow
LT0D
ReplyDelete
Replies
BinarySerpentXY1234567819 October 2023 at 11:36
ankara parça eşya taşıma
takipçi satın al
antalya rent a car
antalya rent a car
ankara parça eşya taşıma
5JPJKQ
ReplyDelete
Replies
HyperspaceBard1Q2319 October 2023 at 19:43
kastamonu evden eve nakliyat
antep evden eve nakliyat
balıkesir evden eve nakliyat
erzincan evden eve nakliyat
ankara evden eve nakliyat
MZNWLN
ReplyDelete
Replies
2BE12AugustusC7D746 November 2023 at 17:41
1C6CC
Gümüşhane Lojistik
Van Parça Eşya Taşıma
Antalya Parça Eşya Taşıma
Ağrı Lojistik
Mersin Evden Eve Nakliyat
ReplyDelete
Replies
A25C4CheyenneCA98D11 November 2023 at 06:45
504BD
Kırklareli Parça Eşya Taşıma
Iğdır Evden Eve Nakliyat
Zonguldak Lojistik
Mamak Fayans Ustası
Çerkezköy Halı Yıkama
Sweat Coin Hangi Borsada
Van Şehirler Arası Nakliyat
Mardin Lojistik
İzmir Şehir İçi Nakliyat
ReplyDelete
Replies
239F2Donald6623114 November 2023 at 12:45
E1596
Siirt Parça Eşya Taşıma
Anc Coin Hangi Borsada
Mardin Parça Eşya Taşıma
Kırşehir Parça Eşya Taşıma
Kütahya Parça Eşya Taşıma
Batman Parça Eşya Taşıma
Kars Evden Eve Nakliyat
Muğla Şehirler Arası Nakliyat
Silivri Çatı Ustası
ReplyDelete
Replies
B5283Gibson50F764 December 2023 at 04:56
3294E
referans kodu
ReplyDelete
Replies
9E454Marlene2292019 December 2023 at 17:22
3E649
aksaray tamamen ücretsiz sohbet siteleri
düzce mobil sohbet
giresun goruntulu sohbet
nevşehir bedava sohbet uygulamaları
elazığ rastgele canlı sohbet
kırşehir bedava görüntülü sohbet sitesi
tunceli bedava görüntülü sohbet sitesi
maraş tamamen ücretsiz sohbet siteleri
canlı sohbet bedava
ReplyDelete
Replies
B6EA1AliaF01F419 December 2023 at 19:47
2554E
balıkesir sohbet odaları
aydın telefonda görüntülü sohbet
konya mobil sesli sohbet
hakkari telefonda görüntülü sohbet
random görüntülü sohbet
kütahya en iyi ücretsiz sohbet siteleri
ağrı parasız görüntülü sohbet uygulamaları
kadınlarla sohbet et
kırşehir bedava sohbet
ReplyDelete
Replies
AF5D3ConradBB1B820 December 2023 at 16:15
6B418
rastgele görüntülü sohbet uygulamaları
artvin nanytoo sohbet
şırnak rastgele sohbet
kayseri canlı görüntülü sohbet
bedava sohbet siteleri
siirt canlı sohbet sitesi
adana sesli görüntülü sohbet
antep bedava sohbet chat odaları
artvin kadınlarla rastgele sohbet
ReplyDelete
Replies
B1905Rachael782B923 December 2023 at 04:43
BC14E
Amasya Telefonda Kadınlarla Sohbet
bedava sohbet uygulamaları
Ardahan Sesli Sohbet Odası
canlı görüntülü sohbet siteleri
isparta ücretsiz sohbet uygulaması
istanbul canlı sohbet odası
Sakarya Tamamen Ücretsiz Sohbet Siteleri
giresun en iyi ücretsiz sohbet siteleri
kilis görüntülü sohbet kadınlarla
ReplyDelete
Replies
49FE9Marcel0D6A423 December 2023 at 13:00
FDD18
Urfa Sesli Mobil Sohbet
mardin sesli sohbet sitesi
rastgele sohbet odaları
canlı sohbet et
Tokat Sohbet Odaları
yozgat canlı görüntülü sohbet odaları
bayburt sohbet sitesi
adana parasız sohbet siteleri
aksaray yabancı görüntülü sohbet uygulamaları
ReplyDelete
Replies
FE0FEAnabel945C31 January 2024 at 22:15
FEC27
kütahya parasız görüntülü sohbet uygulamaları
Karabük Canlı Sohbet Odaları
Antep Canlı Sohbet Siteleri
Hakkari Bedava Görüntülü Sohbet Sitesi
bolu sohbet chat
erzurum mobil sohbet et
Adıyaman Canlı Görüntülü Sohbet Odaları
Edirne En İyi Ücretsiz Sohbet Uygulamaları
telefonda kadınlarla sohbet
ReplyDelete
Replies
22563Nikolas6DEF83 January 2024 at 03:41
0C7D9
Siirt En İyi Ücretsiz Görüntülü Sohbet Siteleri
bursa canlı görüntülü sohbet odaları
Antalya Rastgele Sohbet Odaları
mardin canlı sohbet odası
ankara canli goruntulu sohbet siteleri
ordu sesli sohbet siteler
burdur en iyi görüntülü sohbet uygulaması
diyarbakır en iyi ücretsiz sohbet uygulamaları
bayburt en iyi sesli sohbet uygulamaları
ReplyDelete
Replies
2247FDeon71CE46 January 2024 at 08:23
A6635
Görüntülü Sohbet
Kwai Beğeni Hilesi
Twitch İzlenme Satın Al
Gate io Borsası Güvenilir mi
Görüntülü Sohbet
Twitch İzlenme Hilesi
Tiktok Beğeni Satın Al
Binance Neden Tercih Edilir
Star Atlas Coin Hangi Borsada
ReplyDelete
Replies
3B8C7BernardDE47C17 January 2024 at 23:07
55F88
Coin Nasıl Alınır
Lunc Coin Hangi Borsada
Pepecoin Coin Hangi Borsada
Bitcoin Nasıl Alınır
Coin Madenciliği Nasıl Yapılır
Chat Gpt Coin Hangi Borsada
Coin Üretme
Okex Borsası Güvenilir mi
Soundcloud Dinlenme Hilesi
ReplyDelete
Replies
72E5EPhilipA8F5E1 February 2024 at 15:03
63F75
ledger wallet
safepal
looksrare
defilama
ledger desktop
poocoin
defillama
dexscreener
trezor suite
ReplyDelete
Replies
81E80Harold793223 February 2024 at 03:45
85D82
ledger desktop
zkswap
chainlist
ledger wallet
solflare
raydium
aave
layerzero
phantom
ReplyDelete
Replies

Add comment

Help us to improve our quality and become contributor to our blog

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Vector Addition in CUDA (CUDA C/C++ program for Vector Addition)

33 comments:

Recent Post

About Me

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin

CUDA Programming

Prefer Your Language

Search This Blog

Tags

Related Posts

Share This

Vector Addition in CUDA (CUDA C/C++ program for Vector Addition)

33 comments:

Recent Post

About Me

Subscribe To

Total Pageviews

Labels

Blog Archive

Labels

Like Us

Cloud

Admin