Monitoring Metrics

April 29, 2026 · View on GitHub

简体中文

Monitoring Metrics

After FastDeploy is launched, it supports continuous monitoring of the FastDeploy service status through Metrics. When starting FastDeploy, you can specify the port for the Metrics service by configuring the metrics-port parameter.

CategoryMetric NameTypeDescriptionUnit
Requestfastdeploy:requests_numberCounterTotal number of received requestscount
Requestfastdeploy:request_success_totalCounterNumber of successfully processed requestscount
Requestfastdeploy:num_requests_runningGaugeNumber of requests currently runningcount
Requestfastdeploy:num_requests_waitingGaugeNumber of requests currently waitingcount
Latencyfastdeploy:time_to_first_token_secondsHistogramTime to generate the first token (TTFT)s
Latencyfastdeploy:time_per_output_token_secondsHistogramTime interval between generated tokens (TPOT)s
Latencyfastdeploy:e2e_request_latency_secondsHistogramEnd-to-end request latency distributions
Latencyfastdeploy:request_inference_time_secondsHistogramTime spent in the RUNNING phases
Latencyfastdeploy:request_queue_time_secondsHistogramTime spent in the WAITING phases
Latencyfastdeploy:request_prefill_time_secondsHistogramTime spent in the Prefill phases
Latencyfastdeploy:request_decode_time_secondsHistogramTime spent in the Decode phases
Tokenfastdeploy:prompt_tokens_totalCounterTotal number of processed prompt tokenscount
Tokenfastdeploy:generation_tokens_totalCounterTotal number of generated tokenscount
Tokenfastdeploy:request_prompt_tokensHistogramPrompt token count per requestcount
Tokenfastdeploy:request_token_ratioHistogramToken generation rate per Requestcount
Tokenfastdeploy:request_generation_tokensHistogramGeneration token count per requestcount
Tokenfastdeploy:request_params_max_tokensHistogramDistribution of max_tokens per requestcount
Batchfastdeploy:available_batch_sizeGaugeNumber of additional requests that can be inserted during Decodecount
Batchfastdeploy:batch_sizeGaugeActual batch size during inferencecount
Batchfastdeploy:max_batch_sizeGaugeMaximum batch size configured at service startupcount
KV Cachefastdeploy:cache_config_infoGaugeCache configuration info of the inference enginecount
KV Cachefastdeploy:hit_req_rateGaugePrefix cache hit rate at the request level%
KV Cachefastdeploy:hit_token_rateGaugePrefix cache hit rate at the token level%
KV Cachefastdeploy:cpu_hit_token_rateGaugeCPU-side token-level prefix cache hit rate%
KV Cachefastdeploy:gpu_hit_token_rateGaugeGPU-side token-level prefix cache hit rate%
KV Cachefastdeploy:prefix_cache_token_numCounterTotal number of tokens in prefix cachecount
KV Cachefastdeploy:prefix_gpu_cache_token_numCounterTotal number of prefix cache tokens on GPUcount
KV Cachefastdeploy:prefix_cpu_cache_token_numCounterTotal number of prefix cache tokens on CPUcount
KV Cachefastdeploy:available_gpu_block_numGaugeAvailable GPU blocks in cache (including unreleased prefix blocks)count
KV Cachefastdeploy:free_gpu_block_numGaugeNumber of free GPU blocks in cachecount
KV Cachefastdeploy:max_gpu_block_numGaugeTotal number of GPU blocks initialized at startupcount
KV Cachefastdeploy:max_cpu_block_numGaugeTotal number of CPU blocks initialized at startupcount
KV Cachefastdeploy:available_gpu_resourceGaugeRatio of available GPU blocks to total GPU blocks%
KV Cachefastdeploy:gpu_cache_usage_percGaugeGPU KV cache utilization%
KV Cachefastdeploy:send_cache_failed_numCounterTotal number of cache send failurescount

Accessing Metrics

  • Access URL: http://localhost:8000/metrics
  • Metric Type: Prometheus format

Trace Events

FastDeploy outputs structured trace events to trace.log at key stages of request processing, useful for diagnosing per-request latency bottlenecks. Each trace log entry contains fields such as timestamp (milliseconds), request_id, event, and stage.

Common Events (Mixed / All Instances)

StageEventDescription
PREPROCESSINGPREPROCESSING_STARTAPI Server begins preprocessing the request
PREPROCESSINGPREPROCESSING_ENDEngine receives the request, preprocessing complete
SCHEDULEREQUEST_SCHEDULE_STARTRequest enters the scheduling flow
SCHEDULEREQUEST_QUEUE_STARTRequest enters the scheduling queue
SCHEDULEREQUEST_QUEUE_ENDRequest dequeued from the scheduling queue
SCHEDULERESOURCE_ALLOCATE_STARTResource allocation begins for the request
SCHEDULEPREPARE_PREFIX_CACHE_STARTPrefix cache block matching begins
SCHEDULEPREPARE_PREFIX_CACHE_ENDPrefix cache block matching complete
SCHEDULERESOURCE_ALLOCATE_ENDResource allocation complete
SCHEDULEREQUEST_SCHEDULE_ENDScheduling flow complete
PREFILLINFERENCE_STARTRequest sent to GPU for inference
PREFILLFIRST_TOKEN_GENERATEDFirst token generated
DECODEDECODE_STARTEnters Decode phase
DECODEINFERENCE_ENDInference complete (all tokens generated)
DECODEPREEMPTEDRequest preempted
DECODERESCHEDULED_INFERENCE_STARTPreempted request rescheduled for execution
POSTPROCESSINGWRITE_CACHE_TO_STORAGE_STARTBegins writing KV Cache to external storage
POSTPROCESSINGWRITE_CACHE_TO_STORAGE_ENDKV Cache written to external storage
POSTPROCESSINGPOSTPROCESSING_STARTPost-processing begins
POSTPROCESSINGPOSTPROCESSING_ENDPost-processing complete, response sent

PD Disaggregation — Prefill (P) Instance Events

StageEventDescription
SCHEDULEASK_DECODE_RESOURCE_STARTP begins requesting resources from D (sends ZMQ request)
SCHEDULEASK_DECODE_RESOURCE_ENDP receives resource allocation confirmation from D (with dest_block_ids)
PREFILLPREFILL_INFERENCE_ENDP instance Prefill inference complete
POSTPROCESSINGCHECK_CACHE_TRANSFER_STARTP begins waiting for KV Cache transfer to complete
POSTPROCESSINGCHECK_CACHE_TRANSFER_ENDKV Cache transfer confirmed, ready to send first token to D

PD Disaggregation — Decode (D) Instance Events

StageEventDescription
DECODEDECODE_PROCESS_PREALLOCATE_REQUEST_STARTD begins processing resource allocation request from P
DECODEDECODE_PROCESS_PREALLOCATE_REQUEST_ENDD completes resource allocation and returns dest_block_ids to P
DECODEDECODE_PROCESS_PREFILLED_REQUEST_STARTD receives first token from P, begins processing Prefilled request
DECODEDECODE_PROCESS_PREFILLED_REQUEST_ENDD adds Prefilled request to running queue
DECODEDECODE_INFERENCE_ENDD instance Decode inference complete

Request Lifecycle Sequence

Mixed mode (single instance, full inference):

PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END → INFERENCE_START
→ FIRST_TOKEN_GENERATED → DECODE_START → INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END

PD Disaggregation — Prefill (P) Instance:

PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ ASK_DECODE_RESOURCE_START → ASK_DECODE_RESOURCE_END
→ RESOURCE_ALLOCATE_START → RESOURCE_ALLOCATE_END
→ INFERENCE_START → PREFILL_INFERENCE_END
→ CHECK_CACHE_TRANSFER_START → CHECK_CACHE_TRANSFER_END → [send first token to D]

PD Disaggregation — Decode (D) Instance:

PREPROCESSING_START → PREPROCESSING_END → REQUEST_QUEUE_START → REQUEST_QUEUE_END
→ DECODE_PROCESS_PREALLOCATE_REQUEST_START → DECODE_PROCESS_PREALLOCATE_REQUEST_END
→ [wait for P to complete prefill and transfer KV Cache]
→ DECODE_PROCESS_PREFILLED_REQUEST_START → DECODE_PROCESS_PREFILLED_REQUEST_END
→ INFERENCE_START → DECODE_INFERENCE_END
→ POSTPROCESSING_START → POSTPROCESSING_END