티스토리

Jeongchul Kim

검색하기

NVIDIA Triton Inference Server with MLflow

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server with MLflow

김 정출 2024. 10. 25. 22:52

NVIDIA Triton Inference Server with MLflow

NVIDIA Triton Inference Server와 MLflow를 연동하여 모델을 관리하고 모니터링하는 방법은 주로 Triton에서 모델을 서빙하고, MLflow에서 모델의 버전 관리, 실험 추적, 메타데이터 관리 등을 하는 식으로 구현할 수 있습니다.

1. 모델 관리 및 배포

모델을 MLflow로 관리
- MLflow에 모델을 등록하여 모델 버전을 관리하고 실험 결과를 추적합니다.
- 모델을 저장할 때는 mlflow.log_model()을 사용해 로컬 또는 원격의 MLflow 서버에 모델을 저장합니다.
- 모델 훈련 후, MLflow에서 제공하는 log_model 함수를 통해 훈련된 모델을 MLflow 서버에 등록합니다. 예를 들어, PyTorch 모델을 훈련했다면 다음과 같이 등록할 수 있습니다.
- 이 작업은 MLflow가 모델 버전을 관리하고 추후 필요할 때 Triton에 배포할 모델을 선택할 수 있게 합니다.

import mlflow.pytorch

with mlflow.start_run():
    model = train_model()  # 훈련된 모델
    mlflow.pytorch.log_model(model, "model_name")

실험 추적
- 훈련 중의 각종 메트릭을 MLflow에 로깅하여 실험을 추적합니다. mlflow.log_metric("accuracy", accuracy)와 같이 사용하여 다양한 메트릭을 기록할 수 있습니다.

모델을 Triton에 배포할 형식으로 변환
- Triton Inference Server에서 서빙하려면 모델이 ONNX, TensorFlow, TorchScript 등 Triton이 지원하는 형식이어야 합니다.
- MLflow에 저장된 모델을 로드하여 Triton이 지원하는 형식으로 변환하고, 변환된 모델을 Triton 모델 저장소에 저장합니다.

import mlflow.pytorch
import torch

# MLflow에서 모델을 로드
model = mlflow.pytorch.load_model("runs:/<run_id>/model")

# TorchScript로 변환
scripted_model = torch.jit.script(model)
scripted_model.save("model_repository/model_name/1/model.pt")

예를 들어, mlflow.pytorch.load_model()로 PyTorch 모델을 로드한 후, 이를 torch.jit.trace()를 사용해 TorchScript 형식으로 변환합니다.

Triton Inference Server 모델 저장소 구성
- 모델 저장소 디렉토리 구조를 설정합니다. 모델 저장소에는 모델 이름과 버전에 따라 디렉토리 구조를 맞춰야 하며, 예시:
- models/ ├── model_name_1/ │ ├── 1/ │ │ └── model.onnx └── model_name_2/ ├── 1/ │ └── model.savedmodel/

2. Triton과 MLflow 연동

MLflow에서 모델을 불러오는 API 작성
- MLflow의 모델 버전에 따라 Triton 모델 저장소에 새로운 모델 버전을 배포할 수 있도록 자동화하는 스크립트를 작성할 수 있습니다.
- 이 API는 MLflow에 등록된 최신 모델을 로드하고, Triton 모델 저장소에 업데이트하는 역할을 합니다.

import mlflow
import torch

# 최신 모델 로드
latest_model_uri = mlflow.get_latest_versions("model_name")[-1].source
model = mlflow.pytorch.load_model(latest_model_uri)

# 모델 변환 및 저장
scripted_model = torch.jit.script(model)
scripted_model.save("model_repository/model_name/new_version/model.pt")

# Triton 재시작(필요 시)

모델 추론 요청 및 추적
- Triton에 배포된 모델로 추론을 요청하고 결과를 MLflow에 로깅하여 실험 결과를 추적할 수 있습니다.
- 이를 위해서는 추론 API 요청 코드에서 예측 결과와 메타데이터를 MLflow에 로깅하는 로직을 추가합니다.

import tritonclient.http as httpclient

triton_client = httpclient.InferenceServerClient(url="localhost:8000")

inputs = [httpclient.InferInput("INPUT_NAME", [1, 3, 224, 224], "FP32")]
outputs = [httpclient.InferRequestedOutput("OUTPUT_NAME")]

# 추론 요청
results = triton_client.infer("model_name", inputs=inputs, outputs=outputs)
output_data = results.as_numpy("OUTPUT_NAME")

3. MLflow로 추론 메트릭 관리 및 모니터링

MLflow에서 메트릭과 로깅
- 추론을 요청할 때마다 Latency, 예측 결과 등을 MLflow의 메트릭으로 로깅하여 모니터링합니다.
- 예를 들어, mlflow.log_metric("latency", latency)로 추론 지연 시간을 기록할 수 있습니다.

import time
import mlflow

start_time = time.time()
output_data = triton_client.infer("model_name", inputs=inputs, outputs=outputs)
latency = time.time() - start_time

# 메트릭 로깅
mlflow.log_metric("latency", latency)
mlflow.log_metric("accuracy", calculate_accuracy(output_data))  # 사용자 정의 함수

모니터링 대시보드 구성
- MLflow의 UI를 통해 모델 성능과 추론 결과를 시각적으로 모니터링할 수 있습니다.
- 필요에 따라 Grafana와 같은 모니터링 도구와 연동하여 보다 상세한 모니터링을 구축할 수도 있습니다.

이러한 과정을 통해 NVIDIA Triton Inference Server와 MLflow를 통합하여 모델을 효과적으로 관리하고 추론의 성능을 모니터링할 수 있습니다.

저작자표시 비영리