A Dataflow Compiler and Simulator for Heterogeneous Optical-Digital Architectures
Supervisor
Suitable for
Abstract
Abstract
Lumai is developing a 3D optical AI accelerator capable of executing matrix–vector operations with significantly higher energy efficiency than conventional digital hardware such as Nvidia GPUs. Instead of relying on digital parallelism, computation is performed through optical dataflow, enabling extremely high throughput at low energy cost.
However, this architecture imposes two critical constraints that current AI models (like Vision Transformers) are not natively designed for:
- Streaming Computation
Data flows continuously through the processor, with limited random access to past activations or memory. Architectures such as State-Space Models (e.g., Mamba) are therefore better aligned than architectures relying on full attention and key–value caches, such as Transformers. - Extreme Low Precision (Int4) + Analog Noise
To maximize optical efficiency, weights are represented in 4-bit integer format, and computations incur analog noise and non-ideal signal propagation. Many current deep learning models assume FP32/BF16 precision and can degrade significantly under such constraints.
We are seeking Master students to explore the algorithmic, software, and hardware co-design challenges associated with this architecture. Given the novelty of running state-space models on optical hardware, these projects have strong potential to lead to publishable research.
Project 2: The Systems Co-Design Path
Title: A Dataflow Compiler and Simulator for Heterogeneous Optical-Digital Architectures
Our accelerator is heterogeneous: it pairs a high-speed Optical Core (for Matrix-Vector Multiplication) with a standard Host CPU or DSP (for non-linearities and control). We need a compiler that automatically partitions a trained PyTorch model to map the right operations to the right processor.
Core Objectives:
- Graph Capture: Develop a tool to trace the computational graph of a VideoMamba model
- Automated Partitioning: Create a compiler pass that tags operations based on the hardware strengths:
- Optical Core: Static Matrix Multiplications (Int4, High Throughput).
- Host CPU/DSP: Element-wise operations, Activations, and complex State Updates (High Precision, Low Throughput).
- State Memory Management: Design a "State Buffer" strategy to handle the passing of the hidden state between the Optical Unit and the Host CPU/DSP across timesteps, minimizing data transfer penalties.
Deliverable: A Compiler Prototype and a Performance Visualizer that estimates throughput (FPS/Watt) for Video Mamba on Lumai hardware compared to Nvidia A100/H100 baselines.
Why Apply?
- Publication Goal: The intersection of Optical Computing, Mamba/SSMs, and Low-bit Quantization is a "hot topic" in current research. We strongly encourage and will support the submission of this work to top-tier venues such as MLSys, NeurIPS, or ICLR.
- Impact: Your work will directly influence the architecture of a new class of AI hardware.
- Mentorship: Direct collaboration with the Lumai engineering team.