Amazon SageMaker AI 2025年振り返り第1部:柔軟なトレーニングプランと推論ワークロードのコストパフォーマンス向上
Amazon SageMaker AIは2025年、推論ワークロード向けに柔軟なトレーニングプラン(容量予約機能)を導入し、GPU不足によるデプロイ遅延を解消するとともに、推論コストの最適化を実現した。
キーポイント
Flexible Training Plansの導入
SageMaker AIが推論エンドポイント向けに容量予約機能「Flexible Training Plans」を提供開始し、大規模言語モデル(LLM)の推論において予測可能なGPU可用性を確保できるようになった。
予約ワークフローの簡素化と透明性
インスタンスタイプ、数量、期間、時間帯を指定して予約を作成し、ARN(Amazon Resource Name)を発行するシンプルなプロセスにより、予算計画の正確性とインフラ可用性への懸念を軽減している。
運用の柔軟性とスケーラビリティ
予約期間中、モデルバージョンの更新やインスタンス数の増減(スケーリング)が可能であり、初期デプロイから本格的な負荷テストまで柔軟に対応できる設計となっている。
LoRA アダプターの高速登録と自動ロード
ベース推論コンポーネントを指定してアダプターを登録すると、登録処理は1秒未満で完了し、初回使用時に自動的にロードされる。
メモリ効率の最適化設定
環境変数 'SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY' を設定することで、メモリ内に保持するアダプター数を制限し、リソース効率を管理できる。
アダプターの動的ロードとコスト最適化
Invoke with adapter機能により、初回呼び出し時にメモリにロードされるアダプターをデカップルし、アクティブな推論リクエストに対してのみ計算コストを支払うことで、複数のLoRAアダプターをより費用対効果高く管理可能となった。
2025年のSageMaker AIの包括的な強化
柔軟なトレーニングプラン、マルチAZ高可用性、制御された同時実行性、EAGLE-3による適応型推論デコーディングなどの新機能により、生成AIの運用複雑さとインフラコストを削減し、スケーラビリティと信頼性を向上させた。
影響分析・編集コメントを表示
影響分析
この発表は、LLM推論における最大のボトルネックであるGPU供給の不安定さを解消する重要な一歩であり、特に大規模モデルのビジネス実装やピーク時のバースト処理において、インフラリスクを大幅に低減する。AWSは単なるリソース提供から「予測可能な容量保証」への価値転換を図っており、競合他社との差別化要因となる可能性がある。
編集コメント
LLMビジネス化において「GPUがない」という理由でデプロイが延期されるリスクは非常に高い。AWSがこの課題に対し、予約機能という形でインフラの「保険」のような仕組みを提供したのは実用的かつ戦略的な判断である。
sagemaker = boto3.client('sagemaker')
ベースモデルを持つ推論コンポーネントを作成
response = sagemaker.create_inference_component(
InferenceComponentName='llama-base-ic',
EndpointName='my-endpoint',
Specification={
'Container': {
'Image': 'your-container-image',
'Environment': {
'SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY': '10'
}
},
'ComputeResourceRequirements': {
'NumberOfAcceleratorDevicesRequired': 2,
'MinMemoryRequiredInMb': 16384
}
}
)
- LoRAアダプターを登録:
アダプターを登録 - 1秒未満で完了
response = sagemaker.create_inference_component(
InferenceComponentName='my-custom-adapter',
EndpointName='my-endpoint',
Specification={
'BaseInferenceComponentName': 'llama-base-ic',
'Container': {
'ArtifactUrl': 's3://amzn-s3-demo-bucket/adapters/customer-support/'
}
}
)
- アダプターを呼び出し(初回使用時に自動的にロード):
runtime = boto3.client('sagemaker-runtime')
アダプターで呼び出し - 初回呼び出し時にメモリにロード
response = runtime.invoke_endpoint(
EndpointName='my-endpoint',
InferenceComponentName='llama-base-ic',
TargetModel='s3://amzn-s3-demo-bucket/adapters/customer-support/',
ContentType='application/json',
Body=json.dumps({'inputs': 'Your prompt here'})
)
- 不要になったアダプターを削除:
sagemaker.delete_inference_component(
InferenceComponentName='my-custom-adapter'
)
この動的ロード機能は、SageMakerの既存の推論インフラストラクチャにシームレスに統合され、同じベースモデルをサポートし、標準的なInvokeEndpoint APIとの互換性を維持します。アダプター登録とリソース割り当てを分離することで、推論リクエストを実際に処理しているコンピュートリソースに対してのみ支払い、より費用対効果の高い方法で多数のLoRAアダプターをデプロイおよび管理できるようになりました。
結論
2025年のSageMaker AIの機能強化は、生成AI推論を本番ワークロードに対してより利用しやすく、信頼性が高く、費用対効果の高いものにするための大きな飛躍を表しています。Flexible Training Plansが推論エンドポイントをサポートするようになったことで、重要なモデル評価、期間限定のテスト、予測可能なトラフィック急増への対応など、必要な時に正確に予測可能なGPU容量を確保できるようになりました。推論コンポーネント向けのマルチAZ高可用性、制御された同時実行性、NVMeキャッシュを活用した並列スケーリングの導入により、本番デプロイメントがAvailability Zones全体で回復力を維持しながら迅速にスケールできるようになります。EAGLE-3の投機的デコードは、出力品質を犠牲にすることなくスループットを向上させ、動的マルチアダプター推論は、単一のエンドポイントでより多くのファインチューニングされたLoRAアダプターを効率的に管理するチームを支援します。これらの機能を組み合わせることで、AIを大規模に実行する際の運用の複雑さとインフラストラクチャコストを削減し、チームが基盤となるインフラストラクチャの管理ではなく、モデルを通じた価値提供に集中できるようになります。
これらの改善は、現在AI実践者が直面している最も差し迫った課題のいくつか、すなわち信頼性の高いコンピュート容量の確保、大規模での低遅延推論の実現、マルチモデルデプロイメントの複雑さの増大への対処に直接応えています。透明性のある容量予約、インテリジェントなリソース管理、測定可能なスループット向上をもたらすパフォーマンス最適化を組み合わせることで、SageMaker AIは組織が自信を持って生成AIアプリケーションをデプロイできるよう支援します。ファインチューニングされたアダプターがトレーニングから本番ホスティングへ直接流れる、モデルカスタマイゼーションとデプロイメントのシームレスな統合は、実験から本番への移行をさらに加速します。
生成AI推論ワークロードを加速する準備はできていますか?探求
原文を表示
In 2025, Amazon SageMaker AI saw dramatic improvements to core infrastructure offerings along four dimensions: capacity, price performance, observability, and usability. In this series of posts, we discuss these various improvements and their benefits. In Part 1, we discuss capacity improvements with the launch of Flexible Training Plans. We also describe improvements to price performance for inference workloads. In Part 2, we discuss enhancements made to observability, model customization, and model hosting.
Flexible Training Plans for SageMaker
SageMaker AI Training Plans now support inference endpoints, extending a powerful capacity reservation capability originally designed for training workloads to address the critical challenge of GPU availability for inference deployments. Deploying large language models (LLMs) for inference requires reliable GPU capacity, especially during critical evaluation periods, limited-duration production testing, or predictable burst workloads. Capacity constraints can delay deployments and impact application performance, particularly during peak hours when on-demand capacity becomes unpredictable. Training Plans can help solve this problem by making it possible to reserve compute capacity for specified time periods, facilitating predictable GPU availability precisely when teams need it most.
The reservation workflow is designed for simplicity and flexibility. You begin by searching for available capacity offerings that match your specific requirements—selecting instance type, quantity, duration, and desired time window. When you identify a suitable offering, you can create a reservation that generates an Amazon Resource Name (ARN), which serves as the key to your guaranteed capacity. The upfront, transparent pricing model helps support accurate budget planning while minimizing concerns about infrastructure availability, so teams can focus on their evaluation metrics and model performance rather than worrying about whether capacity will be available when they need it.
Throughout the reservation lifecycle, teams maintain operational flexibility to manage their endpoints as requirements evolve. You can update endpoints to new model versions while maintaining the same reserved capacity, using iterative testing and refinement during evaluation periods. Scaling capabilities help teams adjust instance counts within their reservation limits, supporting scenarios where initial deployments are conservative, but higher throughput testing becomes necessary. This flexibility helps make sure teams aren’t locked into rigid infrastructure decisions while still being able to benefit from the reserved capacity during critical time windows.
With support for endpoint updates, scaling capabilities, and seamless capacity management, Training Plans help give you control over both GPU availability and costs for time-bound inference workloads. Whether you’re running competitive model benchmarks to select the best-performing variant, performing limited-duration A/B tests to validate model improvements, or handling predictable traffic spikes during product launches, Training Plans for inference endpoints help provide the capacity guarantees teams need with transparent, upfront pricing. This approach is particularly valuable for data science teams conducting week-long or month-long evaluation projects, where the ability to reserve specific GPU instances in advance minimizes the uncertainty of on-demand availability and enables more predictable project timelines and budgets.
For more information, see Amazon SageMaker AI now supports Flexible Training Plans capacity for Inference.
Price performance
Enhancements made to SageMaker AI in 2025 help optimize inference economics through four key capabilities. Flexible Training Plans extend to inference endpoints with transparent upfront pricing. Inference components add Multi-AZ availability and parallel model copy placement during scaling that help accelerate deployment. EAGLE-3 speculative decoding delivers increased throughput improvements on inference requests. Dynamic multi-adapter inference enables on-demand loading of LoRA adapters.
Improvements to inference components
Generative models only start delivering value when they’re serving predictions in production. As applications scale, inference infrastructure must be as dynamic and reliable as the models themselves. That’s where SageMaker AI inference components come in. Inference components provide a modular way to manage model inference within an endpoint. Each inference component represents a self-contained unit of compute, memory, and model configuration that can be independently created, updated, and scaled. This design helps you operate production endpoints with greater flexibility. You can deploy multiple models, adjust capacity quickly, and roll out updates safely without redeploying the entire endpoint. For teams running real-time or high-throughput applications, inference components help bring fine-grained control to inference workflows. In the following sections, we review three major enhancements to SageMaker AI inference components that make them even more powerful in production environments. These updates add Multi-AZ high availability, controlled concurrency for multi-tenant workloads, and parallel scaling for faster response to traffic surges. Together, they help make running AI at scale more resilient, predictable, and efficient.
Building resilience with Multi-AZ high availability
Production systems face the same truth: failures happen. A single hardware fault, network issue, or Availability Zone outage can disrupt inference traffic and affect user experience. Now, SageMaker AI inference components automatically distribute workloads across multiple Availability Zones. You can run multiple inference component copies per Availability Zone, and SageMaker AI helps intelligently route traffic to instances that are healthy and have available capacity. This distribution adds fault tolerance at every layer of your deployment.
Multi-AZ high availability offers the following benefits:
- Minimizes single points of failure by spreading inference workloads across Availability Zones
- Automatically fails over to healthy instances when issues occur
- Keeps uptime high to meet strict SLA requirements
- Enables balanced cost and resilience through flexible deployment patterns
For example, a financial services company running real-time fraud detection can benefit from this feature. By deploying inference components across three Availability Zones, traffic can seamlessly redirect to the remaining Availability Zones if one goes offline, helping facilitate uninterrupted fraud detection when reliability matters most.
Parallel scaling and NVMe caching
Traffic patterns in production are rarely steady. One moment your system is quiet; the next, it’s flooded with requests. Previously, scaling inference components happened sequentially—each new model copy waited for the previous one to initialize before starting. During spikes, this sequential process could add several minutes of latency. With parallel scaling, SageMaker AI can now deploy multiple inference component copies simultaneously when an instance and the required resources are available. This helps shorten the time required to respond to traffic surges and improves responsiveness for variable workloads. For example, if an instance needs three model copies, they now deploy in parallel instead of waiting on one another. Parallel scaling helps accelerate the deployment of model copies onto inference components but does not accelerate the scaling up of models when traffic increases beyond provisioned capacity. NVMe caching helps accelerate model scaling for already provisioned inference components by caching model artifacts and images. NVMe caching’s ability to reduce scaling times helps reduce inference latency during traffic spikes, lower idle costs through faster scale-down, and provide greater elasticity for serving unpredictable or volatile workloads.
EAGLE-3
SageMaker AI has introduced (Extrapolation Algorithm for Greater Language-model Efficiency (EAGLE)-based adaptive speculative decoding to help accelerate generative AI inference. This enhancement supports six model architectures and helps you optimize performance using either SageMaker-provided datasets or your own application-specific data for highly adaptive, workload-specific results. The solution streamlines the workflow from optimization job creation through deployment, making it seamless to deliver low-latency generative AI applications at scale without compromising generation quality. EAGLE works by predicting future tokens directly from the model’s hidden layers rather than relying on an external draft model, resulting in more accurate predictions and fewer rejections. SageMaker AI automatically selects between EAGLE-2 and EAGLE-3 based on the model architecture, with launch support for LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM, GptOssForCausalLM (EAGLE-3), and Qwen3NextForCausalLM (EAGLE-2). You can train EAGLE models from scratch, retrain existing models, or use pre-trained models from SageMaker JumpStart, with the flexibility to iteratively refine performance using your own curated datasets collected through features like Data Capture. The optimization workflow integrates seamlessly with existing SageMaker AI infrastructure through familiar APIs (create_model, create_endpoint_config, create_endpoint) and supports widely used training data formats, including ShareGPT and OpenAI chat and completions. Benchmark results are automatically generated during optimization jobs, providing clear visibility into performance improvements across metrics like Time to First Token (TTFT) and throughput, with trained EAGLE models showing significant gains over both base models and EAGLE models trained only on built-in datasets.
To run an EAGLE-3 optimization job, run the following command in the AWS Command Line Interface (AWS CLI):
aws sagemaker --region us-west-2 create-optimization-job \
--optimization-job-name \
--account-id \
--deployment-instance-type ml.p5.48xlarge \
--max-instance-count 10 \
--model-source '{
"SageMakerModel": { "ModelName": "Created Model name" }
}' \
--optimization-configs'{
"ModelSpeculativeDecodingConfig": {
"Technique": "EAGLE",
"TrainingDataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "Enter custom train data location"
}
}
}' \
--output-config '{
"S3OutputLocation": "Enter optimization output location"
}' \
--stopping-condition '{"MaxRuntimeInSeconds": 432000}' \
--role-arn "Enter Execution Role ARN"For more details, see Amazon SageMaker AI introduces EAGLE based adaptive speculative decoding to accelerate generative AI inference.
Dynamic multi-adapter inference on SageMaker AI Inference
SageMaker AI helped enhance the efficient multi-adapter inference capability introduced at re:Invent 2024, which now supports dynamic loading and unloading of LoRA adapters during inference invocations rather than pinning them at endpoint creation. This enhancement helps optimize resource utilization for on-demand model hosting scenarios.
Previously, the adapters were downloaded to disk and loaded into memory during the CreateInferenceComponent API call. With dynamic loading, adapters are registered using a lightweight, synchronous CreateInferenceComponent API, then downloaded and loaded into memory only when first invoked. This approach supports use cases where you can register thousands of fine-tuned adapters per endpoint while maintaining low-latency inference.
The system implements intelligent memory management, evicting least popular models during resource constraints. When memory reaches capacity—controlled by the SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY environment variable—the system automatically unloads inactive adapters to make room for newly requested ones. Similarly, when disk space becomes constrained, the least recently used adapters are evicted from storage. This multi-tier caching strategy facilitates optimal resource utilization across CPU, GPU memory, and disk.
For security and compliance alignment, you can explicitly delete adapters using the DeleteInferenceComponent API. Upon deletion, SageMaker unloads the adapter from the base inference component containers and removes it from disk across the instances, facilitating the complete cleanup of customer data. The deletion process completes asynchronously with automatic retries, providing you with control over your adapter lifecycle while helping meet stringent data retention requirements.
This dynamic adapter loading capability powers the SageMaker AI serverless model customization feature, which helps you fine-tune popular AI models like Amazon Nova, DeepSeek, Llama, and Qwen using techniques like supervised fine-tuning, reinforcement learning, and direct preference optimization. When you complete fine-tuning through the serverless customization interface, the output LoRA adapter weights flow seamlessly to deployment—you can deploy to SageMaker AI endpoints using multi-adapter inference components. The hosting configurations from training recipes automatically include the appropriate dynamic loading settings, helping make sure customized models can be deployed efficiently without requiring you to manage infrastructure or load the adapters at endpoint creation time.
The following steps illustrate how you can use this feature in practice:
- Create a base inference component with your foundation model:
import boto3
sagemaker = boto3.client('sagemaker')
# Create base inference component with foundation model
response = sagemaker.create_inference_component(
InferenceComponentName='llama-base-ic',
EndpointName='my-endpoint',
Specification={
'Container': {
'Image': 'your-container-image',
'Environment': {
'SAGEMAKER_MAX_NUMBER_OF_ADAPTERS_IN_MEMORY': '10'
}
},
'ComputeResourceRequirements': {
'NumberOfAcceleratorDevicesRequired': 2,
'MinMemoryRequiredInMb': 16384
}
}
)- Register Your LoRA adapters:
# Register adapter - completes in < 1 second
response = sagemaker.create_inference_component(
InferenceComponentName='my-custom-adapter',
EndpointName='my-endpoint',
Specification={
'BaseInferenceComponentName': 'llama-base-ic',
'Container': {
'ArtifactUrl': 's3://amzn-s3-demo-bucket/adapters/customer-support/'
}
}
)- Invoke your adapter (it loads automatically on first use):
runtime = boto3.client('sagemaker-runtime')
# Invoke with adapter - loads into memory on first call
response = runtime.invoke_endpoint(
EndpointName='my-endpoint',
InferenceComponentName='llama-base-ic',
TargetModel='s3://amzn-s3-demo-bucket/adapters/customer-support/',
ContentType='application/json',
Body=json.dumps({'inputs': 'Your prompt here'})
)- Delete adapters when no longer needed:
sagemaker.delete_inference_component(
InferenceComponentName='my-custom-adapter'
)This dynamic loading capability integrates seamlessly with the existing inference infrastructure of SageMaker, supporting the same base models and maintaining compatibility with the standard InvokeEndpoint API. By decoupling adapter registration from resource allocation, you can now deploy and manage more LoRA adapters cost-effectively, paying only for the compute resources actively serving inference requests.
Conclusion
The 2025 SageMaker AI enhancements represent a significant leap forward in making generative AI inference more accessible, reliable, and cost-effective for production workloads. With Flexible Training Plans now supporting inference endpoints, you can gain predictable GPU capacity precisely when you need it—whether for critical model evaluations, limited-duration testing, or handling traffic spikes. The introduction of Multi-AZ high availability, controlled concurrency, and parallel scaling with NVMe caching for inference components helps make sure production deployments can scale rapidly while maintaining resilience across Availability Zones. The adaptive speculative decoding of EAGLE-3 delivers increased throughput without sacrificing output quality, and dynamic multi-adapter inference helps teams efficiently manage more fine-tuned LoRA adapters on a single endpoint. Together, these capabilities help reduce the operational complexity and infrastructure costs of running AI at scale, so teams can focus on delivering value through their models rather than managing underlying infrastructure.
These improvements directly address some of the most pressing challenges facing AI practitioners today: securing reliable compute capacity, achieving low-latency inference at scale, and managing the growing complexity of multi-model deployments. By combining transparent capacity reservations, intelligent resource management, and performance optimizations that help deliver measurable throughput gains, SageMaker AI helps organizations deploy generative AI applications with confidence. The seamless integration between model customization and deployment—where fine-tuned adapters flow directly from training to production hosting—further helps accelerate the journey from experimentation to production.
Ready to accelerate your generative AI inference workloads? Explor
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み