Skip to main navigation Skip to search Skip to main content

LLMs on a budget? Say HOLA

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase
Original languageEnglish
Title of host publicationEMNLP 2025
Subtitle of host publicationProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Place of PublicationKerrville, TX
PublisherAssociation for Computational Linguistics
Pages1035-1043
Number of pages9
ISBN (Electronic)9798891763333
DOIs
Publication statusPublished - 2025
EventThe 2025 Conference on Empirical Methods in Natural Language Processing - Suzhou, China
Duration: 4 Nov 20259 Nov 2025

Conference

ConferenceThe 2025 Conference on Empirical Methods in Natural Language Processing
Country/TerritoryChina
CitySuzhou
Period4/11/259/11/25

Fingerprint

Dive into the research topics of 'LLMs on a budget? Say HOLA'. Together they form a unique fingerprint.

Cite this