Hamid Farzaneh

E-mail

Phone

Visitor's Address

hamid.farzaneh@tu-dresden.de

+49 (0)351 463 43729

Helmholtzstrasse 18,3rd floor, BAR III55

01069 Dresden
Germany

Curriculum Vitae

Hamid Farzaneh received his bachelor's degree in Computer Engineering from Shiraz University in August 2019, and his master's degree in Computer Systems and Architecture from Shahid Beheshti University in November 2021.

In August 2022, he joined the chair as a research assistant. He works on high-level compiler frameworks (like MLIR) and optimization for data and computation mapping onto highly heterogeneous systems with mainstream CPUs, FPGAs, SRAM, DRAM, and emerging NVMs and accelerators.

Student Topics

The volume of data processing in these applications has skyrocketed in recent years and demands significantly higher off-chip memory bandwidth. However, increasing the off-chip bandwidth is becoming increasingly expensive and is strictly constrained by the chip package and system models. To overcome the memory wall and capacity and power walls, computer architects are moving to non-Von-Neumann system models like near-memory and in-memory computing. However, The programmability aspect of these systems has received relatively less attention. Using the power of compilers, I tackle issues in high performance, energy efficiency, and hardware/software cooperation of these systems.

In that regard, my current main topics are:

Working on high-level compiler frameworks (like MLIR) and optimizing data and computation mapping for data and computation mapping onto heterogeneous systems
Developing models for managing workloads in heterogeneous systems

Possible student topics include:

Front-ends for MLIR Computing-in-Memory(CIM) Compiler

End-to-end compilation flows for CIM-capable systems exist, but interfaces to high-level languages are missing (limited). The goal of this project is to design and implement front-ends to enable lowering high-level languages/descriptions to the CIM compilers.

Heterogeneous Systems: Mapping and Optimizations

Future systems are predicted to be highly heterogenous, and efficient mapping and optimizations of the application on a heterogenous system are central to the system’s performance. This project aims to design an MLIR-based automatic infrastructure to map kernels to the fitting hardware target and optimize for it.

Also, if you have a related topic in mind, please feel free to reach out.

Publications

2025
Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi] [Bibtex & Downloads]

CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms

Reference

Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi]

Abstract
The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.

Bibtex

@InProceedings{khan_asplos25,
author = {Khan, Asif Ali and Farzaneh, Hamid and Friebel, Karl F. A. and Fournier, Clement and Chelini, Lorenzo and Castrillon, Jeronimo},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4},
title = {CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms},
doi = {10.1145/3622781.3674189},
isbn = {9798400703911},
location = {Rotterdam, The Netherlands},
pages = {31--46},
publisher = {Association for Computing Machinery},
series = {ASPLOS '25},
url = {https://dl.acm.org/doi/pdf/10.1145/3622781.3674189},
abstract = {The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.},
month = mar,
numpages = {16},
year = {2025},
}

Downloads

2504_Khan_CINM_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3766

×

2024
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]

SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs

Reference

Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]

Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.

Bibtex

@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}

Downloads

2406_Farzaneh_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3726

×
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]

C4CAM: A Compiler for CAM-based In-memory Accelerators

Reference

Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]

Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.

Bibtex

@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}

Downloads

2405_Farzaneh_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3738

×
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]

Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers

Reference

Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.

Bibtex

@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}

Downloads

2403_Niemier_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3715

×
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

Reference

Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.

Bibtex

@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3716

×

2023
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]

Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications

Reference

Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]

Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.

Bibtex

@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}

Downloads

2309_Henkel_CASES [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3654

×
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]

Efficient Associative Processing with RTM-TCAMs

Reference

João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.

Bibtex

@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}

Downloads

2307_deLima_iMACAW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3566

×

Hamid Farzaneh

2025

2024

2023