Copying struct pointers from Device to Host
I am trying to copy the variable input from device to the host using cudaMemcpy but I am obtaining GPUassert: invalid argument on gpuErrchk(cudaMemcpy(list, d_list, sizeof(uint8_t*)*3, cudaMemcpyDeviceToHost));. Could you please let me know why?
Does the Cuda standard library have `shared_ptr`s that can be used in device code?
I should have asked this before starting work on the reference counting Cuda backend for Spiral. I know it has them for host code, but I am wondering about device kernels specifically. Though Spiral has a reference counting backend for C, and now even Cuda, the problem is that it makes a lot more sense for C than it does for C++. It’d be very easy to break the various ref counting compiler passes through the use of macros in the new Cuda backend, so leaving the memory management to the C++ compiler and making use of shared_ptr
might be a better option.
How do I use flexible array members in Cuda structs?
typedef struct { int refc; unsigned long len; float ptr[]; } Array0; extern “C” __global__ void entry0() {} error: incomplete type is not allowed float ptr[]; ^ I am working on a reference counting backend for the Spiral language, and in the host C backend that is already there, I am using flexible array members […]
nvrtc is not limiting register usage
I’m trying to limit the number of registered used to increase the occupancy of my kernel. I’m compiling the kernel at runtime using the options
Getting a weird error while trying to set-up GPU acceleration with CUDA
← [ 0 ; 9 3 m 2 0 2 4 – 0 6 – 0 3 1 9 : 2 7 : 3 3 . 5 8 2 9 1 1 6 [ W : o n n x r u n t i m e : , t r a n s f o r m e r _ m e m c p y . c c : 7 4 o n n x r u n t i m e : : M e m c p y T r a n s f o r m e r : : A p p l y I m p l ] 2 M e m c p y n o d e s a r e a d d e d t o t h e g r a p h t f 2 o n n x f o r C U D A E x e c u t i o n P r o v i d e r . I t m i g h t h a v e n e g a t i v e i m p a c t o n p e r f o r m a n c e ( i n c l u d i n g u n a b l e t o r u n C U D A g r a p h ) . S e t s e s s i o n _ o p t i o n s . l o g _ s e v e r i t y _ l e v e l = 1 t o s e e t h e d e t a i l l o g s b e f o r e t h i s m e s s a g e . ← [ m
← [ 0 ; 9 3 m 2 0 2 4 – 0 6 – 0 3 1 9 : 2 7 : 3 3 . 6 0 3 9 6 4 6 [ W : o n n x r u n t i m e : , t r a n s f o r m e r _ m e m c p y . c c : 7 4 o n n x r u n t i m e : : M e m c p y T r a n s f o r m e r : : A p p l y I m p l ] 1 M e m c p y n o d e s a r e a d d e d t o t h e g r a p h t f 2 o n n x f o r C U D A E x e c u t i o n P r o v i d e r . I t m i g h t h a v e n e g a t i v e i m p a c t o n p e r f o r m a n c e ( i n c l u d i n g u n a b l e t o r u n C U D A g r a p h ) . S e t s e s s i o n _ o p t i o n s . l o g _ s e v e r i t y _ l e v e l = 1 t o s e e t h e d e t a i l l o g s b e f o r e t h i s m e s s a g e . ← [ m
← [ 0 ; 9 3 m 2 0 2 4 – 0 6 – 0 3 1 9 : 2 7 : 3 3 . 6 2 4 8 1 3 9 [ W : o n n x r u n t i m e : , s e s s i o n _ s t a t e . c c : 1 1 6 6 o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ] S o m e n o d e s w e r e n o t a s s i g n e d t o t h e p r e f e r r e d e x e c u t i o n p r o v i d e r s w h i c h m a y o r m a y n o t h a v e a n n e g a t i v e i m p a c t o n p e r f o r m a n c e . e . g . O R T e x p l i c i t l y a s s i g n s s h a p e r e l a t e d o p s t o C P U t o i m p r o v e p e r f . ← [ m
← [ 0 ; 9 3 m 2 0 2 4 – 0 6 – 0 3 1 9 : 2 7 : 3 3 . 6 3 2 6 3 7 9 [ W : o n n x r u n t i m e : , s e s s i o n _ s t a t e . c c : 1 1 6 8 o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ] R e r u n n i n g w i t h v e r b o s e o u t p u t o n a n o n – m i n i m a l b u i l d w i l l s h o w n o d e a s s i g n m e n t s . ← [ m
cuda dynamic allocate in device,and copy to device, is it possible
I pass a struct pointer to a device function, and this device function edits this pointer.
cuda dynaimc allocation in device?
I pass a struct pointer to a device function, and this device function edits this pointer.
what’s cga in cuda programming model
Hi I could understand CTA, which is cooperative thread arrays.
But what’s the CGA? What’s the relationship between cta and cga. I don’t see a document that could well explain these.
When is shfl.sync fast?
Using the .idx
option of shfl.sync
, it is possible to arbitrarily permute registers between threads in a single warp. The hope is that by using shfl.sync
, you can avoid storing and then loading data from shared memory.
ImportError: libc10_cuda.so Heeelp
When I was fine-tuning LLM, the following error occurred: