date: 2022-08-22 16:39:25
tags: cuda
CUDA
cuda program kernel function:
__global__ void functionName(){}
or
void __global__ functionName(){}
Hello World
Hello.cu
```c++ #includeCompile:
nvcc Hello.cu -o Hello
Execute:
./Hello
Result:
Hello World
Hello World
Hello World
Hello World
why? <<<2,2>>>
CUDA kernel run in Device. We should realise the differences of Host datas with Device datas.
Let's see the CUDA functions !
cudaMalloc(
(void\*\*)&ptr, int \*ptr -> (void\*\*)&ptr
size, sizeof(ptr_type);
)
ptr will get some places from Device
cudaMemcpy(
dst, date to dst
src, date from src
size, how many bytes
kind,where to where(cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost and ......)
)
cudaDeviceSynchronize()
Wait operate in Device finish
GPU has lot's of same things with CPU
different kinds of memory is one of the same things.
dynamic common Element
```c++ double *a; int M=1000; cudaMalloc((void**)&a,M); cudaMemcpy(....); kernel<<static global Element
global and static__device__ double a[5];
__device__ int b;
compiler should know the size of static global Element,and we need't be used by arg-way
needn't kernel<<<1,1>>>(a);
constant Element by arg-way
```c++ int a=1; kernel<<<1,1>>>(a); ```read-only and max size is 4KB
constant Element by __constant__
```c++ __constant__ int a=1; ```read-only and max size is 64 in most NVIDIA GPU
How to copy datas from/to constant Element ?
cudaMemcpyFromSymbol(
dst, datas to dst
src, datas from src
copykind, cudaMemcpyDeviceToHost
)
cudaMemcpyToSymbol(
dst, datas to dst
src, datas from src
copykind, cudaMemcpyHostToDevice
)
dynamic shared memory
```c++ __global__ void kernel(){ int n=10; __shared__ int a[n]; } ```read and write, 64KB per block. same block with same shared memory value
static shared memory
outside:kernel<<<1,1,sharedMemorySize>>>()
inside:
extern __shared__ double a[]
not *a, pointer isn't array
register memory
fast and small__global__ void kernel(){
int a=1;
const int b=1;
}
dynamic unified memory
```c++ double *x,*y; const int M=sizeof(double)*10000; cudaMallocManaged((void**)&x,M); cudaMallocManaged((void**)&y,M); *x=1; *y=2; kernel<<static unified memory
```c++ __device__ __managed__ int ret[1000]; __device__ __managed__ int a; int main(){ kernel<<Free Memory
`cudaFree(void* ptr)`Atoi operate
look like synchronize, but really without synchronizeCUDA Stream
Let's see the functions!type: cudaStream_t stream
cudaStreamCreate(&stream)
cudaStreamDestory(stream)
cudaStreamSynchronize(stream), wait stream finish
cudaStreamQuery(stream), check stream finish or not, cudaSuccess or cudaErrorNotReady
run in stream : kernel<<<gridSize, blockSize,shared_size,stream>>>();
cudaMemcpyAsync(
void *dst
const void *src
size_t count
enum cudaMemcpyKind kind
cudaStream_t stream, if you want to use default stream, by 0
)
which datas can be MemcpyAsync ?
cudaMallocHost(void**ptr, size_t size)
cudaHostAlloc(void**ptr, size_t size,size_t flag)
if flag==cudaHostAllocDefault, cudaMallocHost equal to cudaHostAlloc
Free:cudaFreeHost(void* ptr)