vllm.v1.worker.gpu.pp_handler ¶
Pipeline Parallelism handler for V2 Model Runner.
PPHandler ¶
Pipeline parallelism handler for Model Runner V2.
Manages sampled token synchronization between PP ranks. Only instantiated when PP is enabled (pp_size > 1).
Source code in vllm/v1/worker/gpu/pp_handler.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
maybe_broadcast_sampled_tokens ¶
maybe_broadcast_sampled_tokens(
sampler_output: SamplerOutput,
num_sampled: Tensor,
num_rejected: Tensor,
) -> None
Broadcast sampled tokens from the last PP rank to all other ranks.
No-ops if this is not the last rank.
Broadcasts sampled_token_ids [num_reqs, max_sample_len], num_sampled [num_reqs], and num_rejected [num_reqs] to support both regular decode and speculative decoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sampler_output | SamplerOutput | SamplerOutput from sampling. | required |
num_sampled | Tensor | Number of accepted tokens per request. | required |
num_rejected | Tensor | Number of rejected tokens per request. | required |
Source code in vllm/v1/worker/gpu/pp_handler.py
maybe_receive_sampled_tokens ¶
maybe_receive_sampled_tokens(
num_reqs: int, max_sample_len: int = 1
) -> tuple[Tensor, Tensor, Tensor] | None
Receive sampled tokens broadcast by the last PP rank.
Returns None if this is the last rank (which samples, not receives).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_reqs | int | Number of requests in the batch. | required |
max_sample_len | int | Maximum number of tokens sampled per request (1 for regular decode, >1 for speculative decoding). | 1 |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor, Tensor] | None | None if called on last rank. |
tuple[Tensor, Tensor, Tensor] | None | Otherwise, tuple of (sampled_tokens, num_sampled, num_rejected): |
tuple[Tensor, Tensor, Tensor] | None |
|
tuple[Tensor, Tensor, Tensor] | None |
|
tuple[Tensor, Tensor, Tensor] | None |
|