date: 2022-08-22 16:39:25

tags: cuda

CUDA

cuda program kernel function:

__global__ void functionName(){}

or

void __global__ functionName(){}


Hello World

Hello.cu

```c++ #include __global__ void GPU_print(){ printf("Hello World\n"); } int main(int argc,char **argv){ GPU_print<<<2,2>>>(); cudaDeviceSynchronize(); return 0; } ```

Compile:

nvcc Hello.cu -o Hello

Execute:

./Hello

Result:

Hello World

Hello World

Hello World

Hello World

why? <<<2,2>>>


CUDA kernel run in Device. We should realise the differences of Host datas with Device datas.

Let's see the CUDA functions !

cudaMalloc(

					(void\*\*)&ptr,  int \*ptr -> (void\*\*)&ptr 

					size, sizeof(ptr_type);

				)

ptr will get some places from Device

cudaMemcpy(

				dst, date to dst

				src, date from src

				size, how many bytes

				kind,where to where(cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost and ......)

				)

cudaDeviceSynchronize()

Wait operate in Device finish


GPU has lot's of same things with CPU

different kinds of memory is one of the same things.

dynamic common Element
```c++ double *a; int M=1000; cudaMalloc((void**)&a,M); cudaMemcpy(....); kernel<<>>(a); ```
static global Element
global and static
__device__ double a[5];
__device__ int b;

compiler should know the size of static global Element,and we need't be used by arg-way

needn't kernel<<<1,1>>>(a);

constant Element by arg-way
```c++ int a=1; kernel<<<1,1>>>(a); ```

read-only and max size is 4KB

constant Element by __constant__
```c++ __constant__ int a=1; ```

read-only and max size is 64 in most NVIDIA GPU

How to copy datas from/to constant Element ?

cudaMemcpyFromSymbol(

											dst, datas to dst

											src, datas from src

											copykind, cudaMemcpyDeviceToHost

											)

cudaMemcpyToSymbol(

										dst, datas to dst

										src, datas from src

										copykind, cudaMemcpyHostToDevice

										)
dynamic shared memory
```c++ __global__ void kernel(){ int n=10; __shared__ int a[n]; } ```

read and write, 64KB per block. same block with same shared memory value

static shared memory
outside:

kernel<<<1,1,sharedMemorySize>>>()

inside:

extern __shared__ double a[]

not *a, pointer isn't array

register memory
fast and small
__global__ void kernel(){
    int a=1;
    const int b=1;
}
dynamic unified memory
```c++ double *x,*y; const int M=sizeof(double)*10000; cudaMallocManaged((void**)&x,M); cudaMallocManaged((void**)&y,M); *x=1; *y=2; kernel<<>>(x,y); ```
static unified memory
```c++ __device__ __managed__ int ret[1000]; __device__ __managed__ int a; int main(){ kernel<<>>(); cudaSynchronize(); printf("%d\n",a); } ```
Free Memory
`cudaFree(void* ptr)`

Atoi operate

look like synchronize, but really without synchronize

Table

CUDA Stream

Let's see the functions!

type: cudaStream_t stream

cudaStreamCreate(&stream)

cudaStreamDestory(stream)

cudaStreamSynchronize(stream), wait stream finish

cudaStreamQuery(stream), check stream finish or not, cudaSuccess or cudaErrorNotReady

run in stream : kernel<<<gridSize, blockSize,shared_size,stream>>>();

cudaMemcpyAsync(

								void *dst

								const void *src

								size_t count

								enum cudaMemcpyKind kind

								cudaStream_t stream, if you want to use default stream, by 0

								)

which datas can be MemcpyAsync ?

cudaMallocHost(void**ptr, size_t size)

cudaHostAlloc(void**ptr, size_t size,size_t flag)

if flag==cudaHostAllocDefault, cudaMallocHost equal to cudaHostAlloc

Free:cudaFreeHost(void* ptr)