Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 533e5de with merge base 55a71e6 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531. |
This PR needs a
|
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
3d8da75 to
07765c3
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
| cudaError_t err = cudaSuccess; | ||
| const auto caller_stream = executorch::extension::cuda::getCallerStream(); | ||
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream); | ||
| } else { | ||
| err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice); | ||
| } |
| // TODO: validate caller stream device matches index. | ||
| // For now assert single-GPU case. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_host_to_device only supports device 0, got %d", | ||
| static_cast<int>(index)); |
| // TODO: validate caller stream device matches index. | ||
| // For now assert single-GPU case. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_device_to_host only supports device 0, got %d", | ||
| static_cast<int>(index)); |
| cudaStream_t s; | ||
| ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess); | ||
| executorch::extension::cuda::CallerStreamGuard g(s); | ||
|
|
||
| CudaAllocator& a = CudaAllocator::instance(); | ||
| auto res = a.allocate(256, 0); | ||
| ASSERT_TRUE(res.ok()); | ||
| void* d = res.get(); | ||
| std::vector<uint8_t> h(256, 7); | ||
| // should take async branch internally, still return Ok | ||
| EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok); | ||
| a.deallocate(d, 0); | ||
| cudaStreamDestroy(s); |
| cudaStream_t s; | ||
| ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess); | ||
| executorch::extension::cuda::CallerStreamGuard g(s); | ||
|
|
||
| CudaAllocator& a = CudaAllocator::instance(); | ||
| auto res = a.allocate(256, 0); | ||
| ASSERT_TRUE(res.ok()); | ||
| void* d = res.get(); | ||
| std::vector<uint8_t> h_src(256, 5), h_dst(256, 0); | ||
| ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok); | ||
| EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok); | ||
| EXPECT_EQ(h_src, h_dst); | ||
|
|
||
| a.deallocate(d, 0); | ||
| cudaStreamDestroy(s); |
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
07765c3 to
b316b71
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
b316b71 to
98081dc
Compare
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| // TODO: validate caller stream device matches index. | ||
| // For now assert index is -1 or 0. | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| index == -1 || index == 0, | ||
| InvalidArgument, | ||
| "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d", | ||
| static_cast<int>(index)); |
| cudaError_t err = cudaSuccess; | ||
| const auto caller_stream = executorch::extension::cuda::getCallerStream(); | ||
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream); | ||
| // We don't synchronize the stream here because the caller is expected to |
| if (caller_stream) { | ||
| err = cudaMemcpyAsync( | ||
| dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream); | ||
| if (err == cudaSuccess) { | ||
| err = cudaStreamSynchronize(*caller_stream); | ||
| } |
| ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok); | ||
| EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok); | ||
| EXPECT_EQ(h_src, h_dst); |
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
98081dc to
1e001a5
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531
c657616 to
fd2e388
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
fd2e388 to
9bfc44e
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
32968c0 to
056c25c
Compare
Summary: Pull Request resolved: pytorch#20498 CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
056c25c to
96b9452
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
96b9452 to
f042310
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
f042310 to
0dd28cc
Compare
0dd28cc to
f042310
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
f042310 to
6ec0025
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531
6ec0025 to
4b6fde9
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. - Added CIs for unit tests. Reviewed By: Gasoonjia Differential Revision: D109590531
4b6fde9 to
5bfd1b4
Compare
Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. - Added CIs for unit tests. Reviewed By: Gasoonjia Differential Revision: D109590531
5bfd1b4 to
533e5de
Compare
Summary:
CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via
getCallerStream()),copy_host_to_deviceandcopy_device_to_hostusecudaMemcpyAsync. When no caller stream is set, the synchronouscudaMemcpypath is used as before.Additionally:
dst/srcreturnError::InvalidArgumentinstead of aborting incudaMemcpy, and zero-byte copies returnError::Okearly.//executorch/extension/cuda:caller_streamdependency in TARGETS.extension_cudadependencies to CMakeLists.txt.test_cuda_allocatorwith coverage for sync/async paths and error handling.Reviewed By: Gasoonjia
Differential Revision: D109590531