System Software Engineer, LLM Inference
Position Overview
System Software Engineers in this role operate at the intersection of LLM inference optimization and novel hardware bring‑up, co‑designing software abstractions with the hardware architecture team. You will extend leading open‑source inference engines for CXL‑aware memory management and build an open software layer that enables any host server to leverage a CXL‑attached KV‑cache accelerator, including cryptographic acceleration for confidential LLM inference on sensitive enterprise workloads.
Key Responsibilities
▸ Extend advanced attention mechanisms in leading inference engines for CXL‑based block‑level KV‑cache offloading, enabling seamless hot/cold tiering between local high‑bandwidth memory and CXL‑attached DDR5 pools on the target hardware platform.
▸ Design and implement the Open KV Connector (OKC) protocol stack, including host‑side drivers and device‑side firmware, so that inference engines can treat the platform as a first‑class CXL memory expander.
▸ Benchmark and optimize TTFT (Time‑To‑First‑Token), throughput, and memory utilization on RISC‑V / CXL hardware, and build automated performance regression suites.
▸ Implement prefix caching, KV quantization (FP8/INT4), and speculative KV eviction policies tuned for CXL latency characteristics.
▸ Integrate FHE and TEE‑based cryptographic acceleration for confidential inference workloads, including homomorphic attention computation prototypes using modern open‑source FHE frameworks.
▸ Collaborate with hardware architects on NDP programming models and define software APIs for near‑memory compute offload.
▸ Contribute upstream to major LLM inference and RISC‑V software ecosystem projects to establish the company as an open‑source thought leader.
▸ Support POC deployments with large cloud partners, diagnosing and resolving performance bottlenecks in production‑like inference environments.
Required Skills & Experience
▸ 5+ years in systems software with a focus on LLM inference optimization; direct experience with vLLM, TensorRT‑LLM, SGLang, or comparable inference engines is required.
▸ Expert‑level C++ and/or Rust skills, plus strong Python for benchmarking, tooling, and experimentation.
▸ Deep understanding of KV‑cache management algorithms: PagedAttention, prefix caching, chunked prefill, speculative decoding, and grouped‑query attention (GQA/MQA).
▸ Practical experience with custom memory allocators, NUMA‑aware programming, and direct device‑memory access patterns.
▸ Familiarity with PCIe/CXL‑style driver development, whether in the Linux kernel or user‑space accelerator frameworks.
▸ Working knowledge of FHE schemes (CKKS, BFV/BGV) and TEE‑based confidential computing (e.g., Intel TDX, AMD SEV‑SNP) as applied to ML inference workloads.
▸ Strong debugging skills, including use of perf, flamegraphs, memory profilers, and cycle‑accurate or detailed simulators.
Preferred Qualifications
▸ Proven open‑source contributions to vLLM, SGLang, or the broader RISC‑V software ecosystem.
▸ Experience with hardware bring‑up, including work on pre‑silicon RTL simulation or early‑silicon platforms.
▸ Background in distributed inference: disaggregated prefill/decode, pipeline parallelism, and tensor parallelism across CXL‑connected nodes.
▸ Familiarity with competing accelerator architectures, with the ability to identify where OKC and the underlying platform must differentiate to drive enterprise adoption.
- Locations
- California