TY - GEN
T1 - Breaking boundaries
T2 - Scientific satellite events held in conjunction with the International Conference on Service-Oriented Computing (21st : 2023)
AU - Zakershahrak, Mehrdad
AU - Ghodratnama, Samira
PY - 2024
Y1 - 2024
N2 - The deployment of transformer models on edge devices like smartphones and tablets is pivotal for leveraging machine learning benefits in real-world scenarios. However, it brings forth challenges including hardware compatibility, memory efficiency, energy efficiency, and real-time performance. We introduce a versatile Hardware Abstraction Layer (HAL) to (1) bridge pre-trained transformer models with the target hardware for optimized deployment, and (2) incorporate intermediate representations (IR) as a crucial element. The IR facilitates seamless execution of models across diverse hardware backends, ensuring enhanced privacy, security, and functionality, especially in regions with limited internet connectivity. Our HAL, endowed with configurable parameters, dynamic model optimizations, and a modular design, caters to varied performance objectives, offering a unified layer that eases the deployment of IR while focusing on user-specified performance priorities. The main contribution of this work is the introduction of IR within the HAL framework, pushing the frontier in edge-device machine learning deployments to focus on latency, energy efficiency, or memory usage. Our results exhibit that the proposed HAL, with its IR component, significantly trims down deployment time and boosts inference efficiency, without compromising model accuracy on iPhone devices.
AB - The deployment of transformer models on edge devices like smartphones and tablets is pivotal for leveraging machine learning benefits in real-world scenarios. However, it brings forth challenges including hardware compatibility, memory efficiency, energy efficiency, and real-time performance. We introduce a versatile Hardware Abstraction Layer (HAL) to (1) bridge pre-trained transformer models with the target hardware for optimized deployment, and (2) incorporate intermediate representations (IR) as a crucial element. The IR facilitates seamless execution of models across diverse hardware backends, ensuring enhanced privacy, security, and functionality, especially in regions with limited internet connectivity. Our HAL, endowed with configurable parameters, dynamic model optimizations, and a modular design, caters to varied performance objectives, offering a unified layer that eases the deployment of IR while focusing on user-specified performance priorities. The main contribution of this work is the introduction of IR within the HAL framework, pushing the frontier in edge-device machine learning deployments to focus on latency, energy efficiency, or memory usage. Our results exhibit that the proposed HAL, with its IR component, significantly trims down deployment time and boosts inference efficiency, without compromising model accuracy on iPhone devices.
KW - Deep Learning Model Optimization
KW - Hardware Abstraction Layer (HAL)
KW - Performance Optimization
KW - Real-Time Performance Monitoring
KW - ultra-low edge-device inference
UR - http://www.scopus.com/inward/record.url?scp=85189757582&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-0989-2_6
DO - 10.1007/978-981-97-0989-2_6
M3 - Conference proceeding contribution
AN - SCOPUS:85189757582
SN - 9789819709885
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 62
EP - 71
BT - Service-oriented computing
A2 - Monti, Flavia
A2 - Plebani, Pierluigi
A2 - Moha, Naouel
A2 - Paik, Hye-young
A2 - Barzen, Johanna
A2 - Ramachandran, Gowri
A2 - Bianchini, Devis
A2 - Tamburri, Damian A.
A2 - Mecella, Massimo
PB - Springer, Springer Nature
CY - Singapore
Y2 - 28 November 2023 through 1 December 2023
ER -