Asif Ali Khan

Dr.-Ing. Asif Ali Khan

E-mail

Phone

Visitor's Address

asif_ali.khan@tu-dresden.de

+49 (0)351 463 43729

Helmholtzstrasse 18, BAR III55

Curriculum Vitae

Asif Ali Khan received his Ph.D. degree from TU Dresden in 2022. He has a background in computer architecture, optimizing compilers, and emerging nonvolatile memory technologies. At the Chair for Compiler Construction, Asif has worked/is working on systems with heterogeneous memory technologies and their optimizations. More specifically, he has developed compilation and simulation tools and novel data and instruction placement strategies for racetrack memories. In the context of near-memory and in-memory computing, he has collaborated with others to accelerate real-world applications from the machine learning and bioinformatics domains. Asif has also contributed to developing software abstractions for non-von-Neumann computing paradigms.

Student Thesis Topics

I work on the intersection of optimizing compilers and emerging memory/compute architectures. If you are interested to do your Bachelor/Master/Diploma thesis in these domains, I could help you in supervising topics similar to the following. In case you have a different topic in mind that is related, please feel free to reach out and we will talk about it.

Compilation for non-von-Neumann architectures:

Data movement in von-Neumann computing systems consumes as high as 60% of the total system's energy. The novel compute-in-memory (CIM) paradigm completely eliminates data movement by exploiting the physical properties of the memory devices to implement various logic and compute operations in-place. Recent research has demonstrated implementing all logic operations and complex arithmetic operations like matrix-matrix multiplication and floating point arithmetic operations (see the figure below). Most often, CIM systems employ nonvolatile memory (NVM) technologies such as phase change memory (PCM), resistive RAM (RRAM), magnetic RAM (MRAM), and spintronics-based racetrack memories (RTMs). All these NVM technologies have their own strengths and limitations.

In this project, you will work on developing/extending a compiler that hides/mitigate these limitations and exploit the full potential of these novel architectures. As a starting point, you will use one of our in-house developed infrastructures targetting such systems.

Requirements: Good knowledge of C/C++
Beneficial: basic compiler knowledge, LLVM, MLIR
Related work: A recent thesis (MA) and a publication

Simulation infrastructure for compute-in-memory systems

To quickly characterize and understand the various tradeoffs in CIM systems (see the description of the project above), it is imperative to have simulation tools that can model them. In this project, you will work on developing a gem5-based simulation infrastructure for CIM systems (similar to the following). The tools will enable the exploration of different NVM technologies at different memory hierarchy levels. Feel free to ping me for more details.

Requirements: Good knowledge of C/C++
Beneficial: basic computer architecture knowledge

Optimizing Data Movement in the Memory Hierarchy:

Data movement in the memory hierarchy is expensive and memory systems, generally, are blind to running applications. In this project, you will analyze the applications statically to find out which hierarchy level best suits the memory requirement of the current input program (there are tools that can be used) and place it accordingly. If done correctly, you may guarantee a 100% cache hit rate.

Requirements: Good knowledge of C/C++

In case of interest do not hesitate to contact me: asif_ali.khan@tu-dresden.de

Publications

2025
João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025. [Bibtex & Downloads]

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

Reference

João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025.

Bibtex

@InProceedings{delima_ccmcc25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Marc Dietrich and Jeronimo Castrillon and Asif Ali Khan},
booktitle = {Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC)},
title = {Efficient In-Memory Acceleration of Sparse Block Diagonal {LLM}s},
location = {Dresden, Germany},
publisher = {IEEE},
month = oct,
numpages = {6},
year = {2025},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3852

×
Anderson Faustino da Silva, Hamid Farzaned, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems" (to appear), Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. [Bibtex & Downloads]

LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems

Reference

Anderson Faustino da Silva, Hamid Farzaned, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems" (to appear), Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025.

Abstract
Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.

Bibtex

@InProceedings{dasilva_samos25,
author = {Anderson Faustino da Silva and Hamid Farzaned and Joao Paulo Cardoso De Lima and Asif Ali Khan and Jeronimo Castrillon},
booktitle = {Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2025-07},
title = {{LearnCNM2Predict}: Transfer Learning-based Performance Model for CNM Systems},
location = {Samos, Greece},
organization = {IEEE},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
month = jul,
numpages = {17},
year = {2025},
abstract = {Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3839

×
João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, New York, NY, USA, Jun 2025. [Bibtex & Downloads]

All-in-memory Stochastic Computing using ReRAM

Reference

João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, New York, NY, USA, Jun 2025.

Bibtex

@InProceedings{delima_dac25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Mehran Shoushtari Moghadam and Sercan Aygun and Jeronimo Castrillon and M. Hassan Najafi and Asif Ali Khan},
booktitle = {Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25)},
title = {All-in-memory Stochastic Computing using {ReRAM}},
location = {San Francisco, California},
publisher = {Association for Computing Machinery},
series = {DAC '25},
address = {New York, NY, USA},
month = jun,
numpages = {6},
year = {2025},
}

Downloads

2506_deLima_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3818

×
Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi] [Bibtex & Downloads]

CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms

Reference

Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi]

Abstract
The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.

Bibtex

@InProceedings{khan_asplos25,
author = {Khan, Asif Ali and Farzaneh, Hamid and Friebel, Karl F. A. and Fournier, Clement and Chelini, Lorenzo and Castrillon, Jeronimo},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4},
title = {CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms},
doi = {10.1145/3622781.3674189},
isbn = {9798400703911},
location = {Rotterdam, The Netherlands},
pages = {31--46},
publisher = {Association for Computing Machinery},
series = {ASPLOS '25},
url = {https://dl.acm.org/doi/pdf/10.1145/3622781.3674189},
abstract = {The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.},
month = mar,
numpages = {16},
year = {2025},
}

Downloads

2504_Khan_CINM_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3766

×
Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025. [Bibtex & Downloads]

Modeling and Simulating Emerging Memory Technologies: A Tutorial

Reference

Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025.

Bibtex

@Article{chen2025_sppsim,
author = {Yun-Chih Chen and Tristan Seidl and Nils Hölscher and Christian Hakert and Minh Duy Truong and Jian-Jia Chen and João Paulo C. de Lima and Asif Ali Khan and Jeronimo Castrillon and Ali Nezhadi and Lokesh Siddhu and Hassan Nassar and Mahta Mayahinia and Mehdi Baradaran Tahoori and Jörg Henkel and Nils Wilbert and Stefan Wildermann and Jürgen Teich},
title = {Modeling and Simulating Emerging Memory Technologies: A Tutorial},
eprint = {2502.10167},
url = {https://arxiv.org/abs/2502.10167},
archiveprefix = {arXiv},
primaryclass = {cs.AR},
year = {2025},
month = feb,
}

Downloads

2502_Chen_SPPSim [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3815

×

2024
João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024. [Bibtex & Downloads]

Count2Multiply: Reliable In-memory High-Radix Counting

Reference

João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-memory High-Radix Counting", Arxiv, pp. 1-14, Sep 2024.

Abstract
Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.

Bibtex

@Misc{delima_count2multiply,
author = {Jo{\~a}o Paulo C. de Lima and Benjamin F. Morris III and Asif Ali Khan and Jeronimo Castrillon and Alex K. Jones},
title = {Count2Multiply: Reliable In-memory High-Radix Counting},
pages = {1-14},
publisher = {Arxiv},
month=sep,
year={2024},
eprint={2409.10136},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2409.10136},
abstract = {Big data processing has exposed the limits of compute-centric hardware acceleration due to the memory-to-processor bandwidth bottleneck. Consequently, there has been a shift towards memory-centric architectures, leveraging substantial compute parallelism by processing using the memory elements directly. Computing-in-memory (CIM) proposals for both conventional and emerging memory technologies often target massively parallel operations. However, current CIM solutions face significant challenges. For emerging data-intensive applications, such as advanced machine learning techniques and bioinformatics, where matrix multiplication is a key primitive, memristor crossbars suffer from limited write endurance and expensive write operations. In contrast, while DRAM-based solutions have successfully demonstrated multiplication using additions, they remain prohibitively slow. This paper introduces Count2Multiply, a technology-agnostic digital-CIM method for performing integer-binary and integer-integer matrix multiplications using high-radix, massively parallel counting implemented with bitwise logic operations. In addition, Count2Multiply is designed with fault tolerance in mind and leverages traditional scalable row-wise error correction codes, such as Hamming and BCH codes, to protect against the high error rates of existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. We also explore the acceleration potential of racetrack memories due to their shifting properties, which are natural for Count2Multiply, and their high endurance. Compared to the state-of-the-art in-DRAM method, Count2Multiply achieves up to 10x speedup, 3.8x higher GOPS/Watt, and 1.4x higher GOPS/area, while the RTM counterpart offers gains of 10x, 57x, and 3.8x.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3786

×
Leonard David Bereholschi, Mikail Yayla, Jian-Jia Chen, Kuan-Hsun Chen, Asif Ali Khan, "Evaluating the Impact of Racetrack Memory Misalignment Faults on BNNs Performance", Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Jun 2024. [Bibtex & Downloads]

Evaluating the Impact of Racetrack Memory Misalignment Faults on BNNs Performance

Reference

Leonard David Bereholschi, Mikail Yayla, Jian-Jia Chen, Kuan-Hsun Chen, Asif Ali Khan, "Evaluating the Impact of Racetrack Memory Misalignment Faults on BNNs Performance", Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Jun 2024.

Bibtex

@InProceedings{khan_samos25,
author = {Leonard David Bereholschi and Mikail Yayla and Jian-Jia Chen and Kuan-Hsun Chen and Asif Ali Khan},
booktitle = {Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) },
title = {Evaluating the Impact of Racetrack Memory Misalignment Faults on BNNs Performance},
location = {Samos, Greece},
series = {SAMOS XXIV},
month = jun,
year = {2024},
}

Downloads

2407_Bereholschi_SAMOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3780

×
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]

SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs

Reference

Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]

Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.

Bibtex

@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}

Downloads

2406_Farzaneh_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3726

×
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]

C4CAM: A Compiler for CAM-based In-memory Accelerators

Reference

Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]

Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.

Bibtex

@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}

Downloads

2405_Farzaneh_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3738

×
João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. [Bibtex & Downloads]

Full-Stack Optimization for CAM-Only DNN Inference

Reference

João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024.

Abstract
The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy

Bibtex

@InProceedings{delima_date24,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Luigi Carro and Jeronimo Castrillon},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Full-Stack Optimization for CAM-Only DNN Inference},
location = {Valencia, Spain},
pages = {1-6},
publisher = {IEEE},
series = {DATE'24},
url = {https://ieeexplore.ieee.org/document/10546805},
abstract = {The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy},
month = mar,
year = {2024},
}

Downloads

2403_deLima_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3701

×
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]

Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers

Reference

Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.

Bibtex

@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}

Downloads

2403_Niemier_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3715

×
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

Reference

Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.

Bibtex

@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3716

×
Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi] [Bibtex & Downloads]

Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory

Reference

Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi]

Bibtex

@Article{khan_ieeecal24,
author = {Asif Ali Khan and Fazal Hameed and Taha Shahroodi and Alex K. Jones and Jeronimo Castrillon},
title = {Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory},
pages = {4pp},
journal = {IEEE Computer Architecture Letters},
month = jan,
publisher = {IEEE},
year = {2024},
doi = {10.1109/LCA.2024.3350701},
url = {https://ieeexplore.ieee.org/document/10409506},
}

Downloads

2401_Khan_IEEECAL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3714

×
Preston Brazzle, Benjamin F. Morris III au2, Evan McKinney, Peipei Zhou, Jingtong Hu, Asif Ali Khan, Alex K. Jones, "Towards Error Correction for Computing in Racetrack Memory", 2024. [Bibtex & Downloads]

Towards Error Correction for Computing in Racetrack Memory

Reference

Preston Brazzle, Benjamin F. Morris III au2, Evan McKinney, Peipei Zhou, Jingtong Hu, Asif Ali Khan, Alex K. Jones, "Towards Error Correction for Computing in Racetrack Memory", 2024.

Bibtex

@misc{cirm-ecc,
title={Towards Error Correction for Computing in Racetrack Memory},
author={Preston Brazzle and Benjamin F. Morris III au2 and Evan McKinney and Peipei Zhou and Jingtong Hu and Asif Ali Khan and Alex K. Jones},
year={2024},
eprint={2407.21661},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2407.21661},
}

Downloads

2408_Khan_RTMECC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3781

×

2023
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]

Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications

Reference

Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]

Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.

Bibtex

@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}

Downloads

2309_Henkel_CASES [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3654

×
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]

Efficient Associative Processing with RTM-TCAMs

Reference

João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.

Bibtex

@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}

Downloads

2307_deLima_iMACAW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3566

×
Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi] [Bibtex & Downloads]

Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs

Reference

Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi]

Abstract
Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.

Bibtex

@InProceedings{escuin_hpca23,
author = {Carlos Escuin and Asif Ali Khan and Pablo Ibáñez-Marín and Teresa Monreal and Jeronimo Castrillon and Víctor Viñals-Yúfera},
booktitle = {the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23)},
title = {Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs},
organization = {IEEE},
pages = {179--192},
abstract = {Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9\% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1\% and 1.9\% performance, the NVM lifetime can be further increased by 28\% and 44\%, respectively.},
doi = {10.1109/HPCA56546.2023.10070968},
url = {https://doi.ieeecomputersociety.org/10.1109/HPCA56546.2023.10070968},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = mar,
year = {2023},
}

Downloads

2302_Escuin_HPCA [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3427

×
Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi] [Bibtex & Downloads]

DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories

Reference

Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6%, and 70.8% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2%.

Bibtex

@Article{khan_toc23,
author = {Asif Ali Khan and Sebastien Ollivier and Fazal Hameed and Jeronimo Castrillon and Alex K. Jones},
date = {2023-03},
journal = {IEEE Transactions on Computers},
doi = {10.1109/TC.2023.3257509},
title = {DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6\%, and 70.8\% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2\%.},
month = mar,
numpages = {15},
publisher = {IEEE},
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3524

×
Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023. [Bibtex & Downloads]

Modelling linear algebra kernels as polyhedral volume operations

Reference

Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023.

Bibtex

@InProceedings{friebel_impact23,
author = {Karl F. A. Friebel and Asif Ali Khan and Lorenzo Chelini and Jeronimo Castrillon},
booktitle = {13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Modelling linear algebra kernels as polyhedral volume operations},
location = {Toulouse, France},
url = {https://impact-workshop.org/papers/paper10.pdf},
month = jan,
year = {2023},
}

Downloads

2301_Friebel_IMPACT [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3453

×

2022
Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi] [Bibtex & Downloads]

DNA Pre-alignment Filter using Processing Near Racetrack Memory

Reference

Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi]

Abstract
Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)–an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68% and 52%, respectively, compared to the state of the art proposed DRAM-based architecture.

Bibtex

@Article{hameed_ieeecal22,
author = {Fazal Hameed and Asif Ali Khan and Sebastien Ollivier and Alex K. Jones and Jeronimo Castrillon},
date = {2022-08},
journal = {IEEE Computer Architecture Letters},
title = {DNA Pre-alignment Filter using Processing Near Racetrack Memory},
abstract = {Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68\% and 52\%, respectively, compared to the state of the art proposed DRAM-based architecture.},
month = jul,
numpages = {4},
publisher = {IEEE},
year = {2022},
doi = {10.1109/LCA.2022.3194263},
pages = {1--4},
url = {https://ieeexplore.ieee.org/document/9841612},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3361

×
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi] [Bibtex & Downloads]

ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi]

Bibtex

@Article{hakert_toc22,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Che},
title = {{ROLLED}: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees},
journal = {IEEE Transactions on Computers},
month = jul,
year = {2022},
doi = {10.1109/TC.2022.3197094},
pages = {1--14},
url = {https://ieeexplore.ieee.org/document/9851943},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3365

×
Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi] [Bibtex & Downloads]

HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches

Reference

Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi]

Abstract
Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.

Bibtex

@InProceedings{escuin_rapido22,
author = {Carlos Escuin and Asif Ali Khan and Pablo Iba{\~n}ez and Teresa Monreal and Victor Vi{\~n}als and Jeronimo Castrillon},
booktitle = {DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
date = {2022-06},
title = {HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches},
doi = {10.1145/3522784.3522801},
isbn = {9781450395663},
location = {Budapest, Hungary},
pages = {53–58},
publisher = {Association for Computing Machinery},
series = {DroneSE and RAPIDO '22},
url = {https://doi.org/10.1145/3522784.3522801},
abstract = {Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.},
address = {New York, NY, USA},
month = jun,
numpages = {6},
year = {2022},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3327

×
Asif Ali Khan, "Design and Code Optimization for Systems with Next-generation Racetrack Memories", PhD thesis, TU Dresden, 255 pp., Apr 2022. (3m5 Excellence Award – herausragende Dissertation) [Bibtex & Downloads]

Design and Code Optimization for Systems with Next-generation Racetrack Memories

Reference

Asif Ali Khan, "Design and Code Optimization for Systems with Next-generation Racetrack Memories", PhD thesis, TU Dresden, 255 pp., Apr 2022. (3m5 Excellence Award – herausragende Dissertation)

Bibtex

@PhdThesis{Khan_PhD,
author = {Khan, Asif Ali},
institution = {TU Dresden},
title = {Design and Code Optimization for Systems with Next-generation Racetrack Memories},
pages = {255 pp.},
month = apr,
year = {2022},
url = {https://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa2-795738},

}

Downloads

2204_Khan_PhD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3353

×
Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi] [Bibtex & Downloads]

Brain-inspired Cognition in Next Generation Racetrack Memories

Reference

Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi]

Abstract
Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.

Bibtex

@Article{khan_tecs22,
author = {Asif Ali Khan and Sebastien Ollivier and Stephen Longofono and Gerald Hempel and Jeronimo Castrillon and Alex K. Jones},
title = {Brain-inspired Cognition in Next Generation Racetrack Memories},
abstract = {Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.},
address = {New York, NY, USA},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
month = mar,
numpages = {28},
publisher = {Association for Computing Machinery},
year = {2022},
doi = {10.1145/3524071},
issn = {1539-9087},
url = {https://doi.org/10.1145/3524071},
volume = {21},
number = {6},
articleno = {79},
pages = {79:1--79:28},
}

Downloads

2203_Khan_TECS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3329

×

2021
Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi] [Bibtex & Downloads]

Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory

Reference

Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi]

Bibtex

@Article{multanen_toc21,
author = {Joonas Iisakki Multanen and Kari Hepola and Asif Ali Khan and Jeronimo Castrillon and Pekka J{\"a}{\"a}skel{\"a}inen},
title = {Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory},
doi = {10.1109/TC.2021.3117439},
number = {9},
pages = {2010--2021},
url = {https://ieeexplore.ieee.org/document/9557799},
volume = {71},
abstract = {As performance and energy-efficiency improvements from technology scaling are slowing down, new technologies are being researched in hopes of disrupting results. Domain wall memory (DWM) is an emerging non-volatile technology that promises extreme data density, fast access times and low power consumption. However, DWM access time depends on the memory location distance from access ports, requiring expensive shifting. This causes overheads on performance and energy consumption. In this article, we implement our previously proposed shift-reducing instruction memory placement (SHRIMP) on a RISC-V core in RTL, provide the first thorough evaluation of the control logic required for DWM and SHRIMP and evaluate the effects on system energy and energy-efficiency. SHRIMP reduces the number of shifts by 36\% on average compared to a linear placement in CHStone and Coremark benchmark suites when evaluated on the RISC-V processor system. The reduced shift amount leads to an average reduction of 14\% in cycle counts compared to the linear placement. When compared to an SRAM-based system, although increasing memory usage by 26\%, DWM with SHRIMP allows a 73\% reduction in memory energy and 42\% relative energy delay product. We estimate overall energy reductions of 14\%, 15\% and 19\% in three example embedded systems.},
journal = {IEEE Transactions on Computers},
month = oct,
year = {2021},
}

Downloads

2110_Multanen_TOC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3208

×
Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi] [Bibtex & Downloads]

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

Reference

Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi]

Bibtex

@Article{khan_tcad21,
author = {Adam Siemieniuk and Lorenzo Chelini and Asif Ali Khan and Jeronimo Castrillon and Andi Drebes and Henk Corporaal and Tobias Grosser and Martin Kong},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title = {{OCC}: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory},
month = aug,
volume={41},
number={6},
pages={1674--1686},
numpages = {12 pp},
doi = {10.1109/TCAD.2021.3101464},
issn = {1937-4151},
url = {https://ieeexplore.ieee.org/document/9502921},
publisher = {IEEE Press},
year = {2021},
abstract = {Memristive devices promise an alternative approach toward non-Von Neumann architectures, where specific computational tasks are performed within the memory devices. In the machine learning (ML) domain, crossbar arrays of resistive devices have shown great promise for ML inference, as they allow for hardware acceleration of matrix multiplications. But, to enable widespread adoption of these novel architectures, it is critical to have an automatic compilation flow as opposed to relying on a manual mapping of specific kernels on the crossbar arrays. We demonstrate the programmability of memristor-based accelerators using the new compiler design principle of multilevel rewriting, where a hierarchy of abstractions lowers programs level-by-level and perform code transformations at the most suitable abstraction. In particular, we develop a prototype compiler, which progressively lowers a mathematical notation for tensor operations arising in ML workloads, to fixed-function memristor-based hardware blocks.},
}

Downloads

2107_Khan_TCAD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3157

×
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi] [Bibtex & Downloads]

BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi]

Bibtex

@InProceedings{khan_dac21,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Chen},
booktitle = {Proceedings of the 58th Annual Design Automation Conference (DAC'21)},
title = {{BLO}wing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
doi = {10.1109/DAC18074.2021.9586167},
location = {San Francisco, California},
pages = {1111--1116},
series = {DAC '21},
url = {https://ieeexplore.ieee.org/document/9586167},
publisher = {ACM},
month = jul,
year = {2021},
}

Downloads

2112_Hakert_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2975

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi] [Bibtex & Downloads]

ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi]

Abstract
Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44% and 21-49%, respectively.

Bibtex

@Article{hameed_tetc21,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
journal = {IEEE Transactions on Emerging Topics in Computing (IEEE TETC)},
title = {{ALPHA}: A Novel Algorithm-Hardware Co-design for Accelerating {DNA} Seed Location Filtering},
pages = {12 pp.},
abstract = {Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54\%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44\% and 21-49\%, respectively.},
month = jun,
year = {2021},
doi = {10.1109/TETC.2021.3093840},
issn = {2168-6750},
}

Downloads

2107_hameed_TETC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3116

×

2020
Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi] [Bibtex & Downloads]

Polyhedral Compilation for Racetrack Memories

Reference

Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi]

Abstract
Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85% fewer shifts (average 41%), improving both performance and energy consumption by an average of 17.9% and 39.8%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.

Bibtex

@Article{khan_cases20,
author = {Asif Ali Khan and Hauke Mewes and Tobias Grosser and Torsten Hoefler and Jeronimo Castrillon},
title = {Polyhedral Compilation for Racetrack Memories},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20)},
year = {2020},
series = {CASES ’20},
month = oct,
doi = {10.1109/TCAD.2020.3012266},
url = {https://ieeexplore.ieee.org/document/9216560},
volume={39},
number={11},
pages={3968--3980},
issn = {1937-4151},
issn = {1937-4151},
abstract = {Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85\% fewer shifts (average 41\%), improving both performance and energy consumption by an average of 17.9\% and 39.8\%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.},
booktitle = {Proceedings of the 2020 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
location = {Virtual conference},
numpages = {12},
publisher = {IEEE Press},
}

Downloads

2009_Khan_CASES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2833

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi] [Bibtex & Downloads]

Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi]

Abstract
In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7% compared to a state-of-the-art direct-mapped LLC design and by 7.2% compared to an existing associative LLC design.

Bibtex

@Article{hameed_tc20,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling},
journal = {IEEE Transactions on Computers},
year = {2020},
month = oct,
abstract = {In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7\% compared to a state-of-the-art direct-mapped LLC design and by 7.2\% compared to an existing associative LLC design.},
doi = {10.1109/TC.2020.3029615},
url = {https://ieeexplore.ieee.org/document/9220805},
issn = {0018-9340},
numpages = {14},
volume={70},
number={11},
pages={1914-1927},
}

Downloads

2010_Hameed_TC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2880

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

Bibtex

@Article{khan_tecs20,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2020},
month = sep,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {6},
issn = {1539-9087},
url = {https://doi.org/10.1145/3396235},
doi = {10.1145/3396235},
articleno = {44},
numpages = {26},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32\% and 73\% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80\%.},
}

Downloads

2009_Khan_TECS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2649

×
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi] [Bibtex & Downloads]

Generalized Data Placement Strategies for Racetrack Memories

Reference

Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.

Bibtex

@InProceedings{khan_date20,
author = {Asif Ali Khan and Andr{\'e}s Goens and Fazal Hameed and Jeronimo Castrillon},
title = {Generalized Data Placement Strategies for Racetrack Memories},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
publisher = {IEEE},
location = {Grenoble, France},
month = mar,
isbn = {978-3-9819263-4-7},
pages = {1502--1507},
doi = {10.23919/DATE48585.2020.9116245},
url = {https://ieeexplore.ieee.org/document/9116245},

abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.},
}

Downloads

2003_Khan_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2547

×
Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi] [Bibtex & Downloads]

Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade

Reference

Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi]

Bibtex

@Article{khan_pieee20,
author = {Robin Bl{\"a}sing and Asif Ali Khan and Panagiotis Ch. Filippou and Chirag Garg and Fazal Hameed and Jeronimo Castrillon and Stuart S. P. Parkin},
title = {Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade},
journal = {Proceedings of the IEEE},
year = {2020},
month = mar,
volume={108},
number={8},
pages={1303-1321},
doi = {10.1109/JPROC.2020.2975719},
url = {https://ieeexplore.ieee.org/document/9045991},
}

Downloads

2003_Khan_JPROC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2599

×

2019
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi] [Bibtex & Downloads]

ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi]

Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.

Bibtex

@Article{khan_taco19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart S. P. Parkin and Jeronimo Castrillon},
title = {ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2019},
volume = {16},
number = {4},
month = dec,
year = {2019},
issn = {1544-3566},
pages = {56:1--56:23},
articleno = {56},
numpages = {23},
url = {http://doi.acm.org/10.1145/3372489},
doi = {10.1145/3372489},
acmid = {3372489},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5\%, outperforming the state of the art by up to 16.1\%.},
}

Downloads

1912_Khan_TACO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2289

×
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi] [Bibtex & Downloads]

SHRIMP: Efficient Instruction Delivery with Domain Wall Memory

Reference

Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi]

Bibtex

@InProceedings{multanen_islped19,
author = {Joonas Multanen and Asif Ali Khan and Pekka J{\"a}{\"a}skel{\"a}inen and Fazal Hameed and Jeronimo Castrillon},
title = {{SHRIMP}: Efficient Instruction Delivery with Domain Wall Memory},
booktitle = {Proceedings of the International Symposium on Low Power Electronics and Design},
year = {2019},
month = jul,
series = {ISLPED '19},
location = {Lausanne, Switzerland},
pages = {6pp},
numpages = {6},
publisher = {ACM},
address = {New York, NY, USA},
doi={10.1109/ISLPED.2019.8824954},
url = {https://ieeexplore.ieee.org/document/8824954},
}

Downloads

1907_Multanen_ISLPED [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2452

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

Bibtex

@InProceedings{kahn_lctes19,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads},
booktitle = {Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES)},
series = {LCTES 2019},
pages = {5--18},
numpages = {12},
numpages = {14},
isbn = {978-1-4503-6724-0/19/06},
doi = {10.1145/3316482.3326351},
url = {http://doi.acm.org/10.1145/3316482.3326351},
acmid = {3326351},
year = {2019},
month = jun,
location = {Phoenix, AZ, USA},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.},
acmid = {3326351},
}

Downloads

1906_Khan_LCTES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2419

×
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi] [Bibtex & Downloads]

RTSim: A Cycle-accurate Simulator for Racetrack Memories

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi]

Bibtex

@Article{khan_ieeecal19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart Parkin and Jeronimo Castrillon},
title = {{RTS}im: A Cycle-accurate Simulator for Racetrack Memories},
journal = {IEEE Computer Architecture Letters},
year = {2019},
volume = {18},
number = {1},
pages = {43--46},
month = jan,
doi = {10.1109/LCA.2019.2899306},
issn = {1556-6056},
publisher = {IEEE},
url = {https://ieeexplore.ieee.org/document/8642352}
}

Downloads

1902_khan_IEEECAL [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2288

×

2018
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi] [Bibtex & Downloads]

Performance and Energy Efficient Design of STT-RAM Last-Level-Cache

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi]

Abstract
Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75% and improve the system performance by 6.5%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82% and improves system performance by 6.8%.

Bibtex

@Article{hameed_tvlsi18,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Performance and Energy Efficient Design of STT-RAM Last-Level-Cache},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2018},
volume = {26},
number = {6},
pages = {1059--1072},
month = jun,
abstract = {Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75\% and improve the system performance by 6.5\%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82\% and improves system performance by 6.8\%.},
doi = {10.1109/TVLSI.2018.2804938},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1803_Hameed_TVLSI.pdf:PDF},
issn = {1063-8210},
numpages = {14},
url = {http://ieeexplore.ieee.org/document/8307465/}
}

Downloads

1803_Hameed_TVLSI [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2099

×
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]

NVMain Extension for Multi-Level Cache Systems

Reference

Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi]

Bibtex

@InProceedings{khan_rapido18,
author = {Asif Ali Khan and Fazal Hameed and Jeronimo Castrillon},
title = {NVMain Extension for Multi-Level Cache Systems},
booktitle = {Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {RAPIDO '18},
year = {2018},
month = jan,
pages = {7:1--7:6},
articleno = {7},
numpages = {6},
url = {http://doi.acm.org/10.1145/3180665.3180672},
doi = {10.1145/3180665.3180672},
acmid = {3180672},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
isbn = {978-1-4503-6417-1},
}

Downloads

1801_Khan_RAPIDO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2098

×

Dr.-Ing. Asif Ali Khan

2025

2024

2023

2022

2021