Abstract
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2025 |
| Subtitle of host publication | Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track |
| Place of Publication | Kerrville, TX |
| Publisher | Association for Computational Linguistics |
| Pages | 1035-1043 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798891763333 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | The 2025 Conference on Empirical Methods in Natural Language Processing - Suzhou, China Duration: 4 Nov 2025 → 9 Nov 2025 |
Conference
| Conference | The 2025 Conference on Empirical Methods in Natural Language Processing |
|---|---|
| Country/Territory | China |
| City | Suzhou |
| Period | 4/11/25 → 9/11/25 |
Fingerprint
Dive into the research topics of 'LLMs on a budget? Say HOLA'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver