tiny-trtllm
Minimal C++ implementation of TensorRT-LLM's core architecture (~3,000 lines)
A minimal, educational implementation of TensorRT-LLM’s core architecture in ~3,000 lines of C++.
Focus
System-level C++ runtime, not CUDA kernels. All computation uses TensorRT built-in layers and cuBLAS. The project demonstrates how TensorRT-LLM’s serving infrastructure works:
- Paged KV Cache - Memory-efficient key-value cache management
- Inflight Batching - Dynamic batching for optimal GPU utilization
- TensorRT Plugin Mechanism - Custom operator integration
- C++ Runtime - The infrastructure that ties it all together
Architecture
┌─────────────────────────────────────────────────┐
│ Python API │
│ (TinyLLM class) │
├─────────────────────────────────────────────────┤
│ pybind11 bindings │
├──────────┬──────────┬──────────┬────────────────┤
│ Engine │ KV Cache │Scheduler │ Builder │
│ Runtime │ Manager │(Inflight │ (Network + │
│(TRT exec)│ (Paged) │ Batching)│ Weights) │
├──────────┴──────────┴──────────┴────────────────┤
│ TensorRT Engine + Plugins │
└─────────────────────────────────────────────────┘
Why This Project
Understanding production LLM serving systems requires diving into complex codebases. This project distills the essential concepts into readable, well-documented code that serves as a learning resource for anyone interested in LLM infrastructure.