Prof. Jeronimo Castrillon

Prof. Dr.-Ing. Jeronimo Castrillon

E-mail

Phone

Visitor's Address

jeronimo.castrillon@tu-dresden.de

+49 (0)351 463 42716

Chair for Compiler Construction
Helmholtzstrasse 18
3rd floor, Room BAR III68
01069 Dresden
Germany

Curriculum Vitae

Jerónimo Castrillón received the Electronics Engineering degree with honors from the Pontificia Bolivariana University in Colombia in 2004, the master degree from the ALaRI Institute, University of Lugano, in Switzerland in 2006 and the Ph.D. degree (Dr.-Ing.) on Electric Engineering and Information Technology with honors from the RWTH Aachen University in Germany in 2013. From early 2009 to April 2013 Dr. Castrillón was the chief engineer of the chair for Software for Systems on Silicon at the RWTH Aachen University, where he was enrolled as research staff since late 2006. From April 2013 to April 2014 Dr. Castrillón was senior scientific staff in the same institution.

In June 2014, Dr. Castrillón joined the department of computer science of the TU Dresden as professor for compiler construction in the context of the German excellence cluster “Center for Advancing Electronics Dresden” (cfaed). His research interests lie on methodologies, languages, tools and algorithms for programming complex computing systems. He is also affiliated to the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (ScaDS.AI), the 6G-life Hub, and the Barkhausen Institut.

Prof. Castrillón has several international publications and has served as program chair and technical program committee in international conferences and workshops (e.g., LCTES, CASES, DAC, DATE, CODES-ISSS, CASES, CGO, Computing Frontiers, FPL, ICCS and MCSoC) as well as a reviewer for ACM and IEEE journals among others. Prof. Castrillón is the recipient of numerous awards, including the Swiss Excellence Government Scholarship in 2005 and the Intel Doctoral Award in 2012. In 2014 he co-founded Silexica GmbH, a company that provides programming tools for embedded multicore architectures, now with AMD/Xilinx.

Publications

2026
Tauseef Ahmed, Tao Sun, Jeronimo Castrillon, Kanishkan Vadivel, Guangzhi Tang, "Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition" (to appear), In Proceeding: 2026 International Joint Conference on Neural Networks (IJCNN), Jun 2026. [Bibtex & Downloads]

Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition

Reference

Tauseef Ahmed, Tao Sun, Jeronimo Castrillon, Kanishkan Vadivel, Guangzhi Tang, "Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition" (to appear), In Proceeding: 2026 International Joint Conference on Neural Networks (IJCNN), Jun 2026.

Bibtex

@InProceedings{ahmed_ijcnn26,
author = {Tauseef Ahmed and Tao Sun and Jeronimo Castrillon and Kanishkan Vadivel and Guangzhi Tang},
booktitle = {2026 International Joint Conference on Neural Networks (IJCNN)},
title = {Spiking and Event-driven Neuromorphic Mamba Models for Efficient Speech Recognition},
location = {Maastricht, The Netherlands},
organization = {IEEE},
month = jun,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3883

×
Jeronimo Castrillon, Michael Niemier, "Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications", In Workshop on EcoCompute: Building Sustainable Scientific Computing Practices Through Academia-Industry Collaboration (CECAM) (invited talk), May 2026. [Bibtex & Downloads]

Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications

Reference

Jeronimo Castrillon, Michael Niemier, "Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications", In Workshop on EcoCompute: Building Sustainable Scientific Computing Practices Through Academia-Industry Collaboration (CECAM) (invited talk), May 2026.

Abstract
This talk will showcase efforts to evaluate technology-driven computer architectures in application-level contexts. A particular emphasis will be made on (a) quantifying the potential efficacy of emerging (random access as well as intelligent) memory solutions for at scale workloads and (b) memory technologies that (at present) are of most interest to major semiconductor companies and are thereby more likely to become commercially available for end users (e.g., Intel, Samsung, TSMC per targeted research efforts in the Semiconductor Research Corporation’s JUMP 2.0 SUPREME research center). We will describe both bottom-up and top-down evaluation methodologies that may be employed to quantify figures of merit such as read and write latency, write energy, endurance, etc. – all of which will either directly or indirectly inform computational sustainability. We will also highlight recent work from our collaborators in the compilers community that employs multilevel intermediate representations (MLIR) to preserve program structure to generate ideal HLL to accelerator mappings (e.g., mapping Euclidean distance calculations to technology-enabled associative memory hardware that can perform said computations in-situ, thereby eliminating data transfer overhead). While recent work does aim to minimize write events owing to energy and endurance concerns of non-volatile memories, our goal is to specifically quantify array-level figures of merit for all emerging memories, thereby allowing the compiler to identify/down select to viable hardware solutions for a given workload. We will highlight recent efforts in this regard, as well as how it may create a feedback loop to materials science and device research to develop technology with maximal application-level impact.

Bibtex

@Misc{castrillon_cecam26,
author = {Jeronimo Castrillon and Michael Niemier},
date = {2026-05},
title = {Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications},
howpublished = {Workshop on EcoCompute: Building Sustainable Scientific Computing Practices Through Academia-Industry Collaboration (CECAM) (invited talk)},
location = {Lugano, Switzerland},
abstract = {This talk will showcase efforts to evaluate technology-driven computer architectures in application-level contexts. A particular emphasis will be made on (a) quantifying the potential efficacy of emerging (random access as well as intelligent) memory solutions for at scale workloads and (b) memory technologies that (at present) are of most interest to major semiconductor companies and are thereby more likely to become commercially available for end users (e.g., Intel, Samsung, TSMC per targeted research efforts in the Semiconductor Research Corporation’s JUMP 2.0 SUPREME research center). We will describe both bottom-up and top-down evaluation methodologies that may be employed to quantify figures of merit such as read and write latency, write energy, endurance, etc. – all of which will either directly or indirectly inform computational sustainability. We will also highlight recent work from our collaborators in the compilers community that employs multilevel intermediate representations (MLIR) to preserve program structure to generate ideal HLL to accelerator mappings (e.g., mapping Euclidean distance calculations to technology-enabled associative memory hardware that can perform said computations in-situ, thereby eliminating data transfer overhead). While recent work does aim to minimize write events owing to energy and endurance concerns of non-volatile memories, our goal is to specifically quantify array-level figures of merit for all emerging memories, thereby allowing the compiler to identify/down select to viable hardware solutions for a given workload. We will highlight recent efforts in this regard, as well as how it may create a feedback loop to materials science and device research to develop technology with maximal application-level impact.},
month = may,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3887

×
Tassilo Tanneberger, Erling R. Jellum, Edward A. Lee, Jeronimo Castrillon, "Macros as Abstractions: Simplifying Code Generation for Lingua Franca", Proceedings of the Workshop on Reactive Cyber-Physical Systems: Design, Simulation, and Coordination (ReCPS), Apr 2026. [Bibtex & Downloads]

Macros as Abstractions: Simplifying Code Generation for Lingua Franca

Reference

Tassilo Tanneberger, Erling R. Jellum, Edward A. Lee, Jeronimo Castrillon, "Macros as Abstractions: Simplifying Code Generation for Lingua Franca", Proceedings of the Workshop on Reactive Cyber-Physical Systems: Design, Simulation, and Coordination (ReCPS), Apr 2026.

Bibtex

@InProceedings{tanneberger_recps26,
author = {Tassilo Tanneberger and Erling R. Jellum and Edward A. Lee and Jeronimo Castrillon},
booktitle = {Proceedings of the Workshop on Reactive Cyber-Physical Systems: Design, Simulation, and Coordination (ReCPS)},
title = {Macros as Abstractions: Simplifying Code Generation for Lingua Franca},
location = {Verona, Italy},
month = apr,
year = {2026},
}

Downloads

2604_Tanneberger_ReCPS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3884

×
Jeronimo Castrillon, "Top-Down Analysis via Integrated Compilers Frameworks", In Workshop on Rapid Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2026. [Bibtex & Downloads]

Top-Down Analysis via Integrated Compilers Frameworks

Reference

Jeronimo Castrillon, "Top-Down Analysis via Integrated Compilers Frameworks", In Workshop on Rapid Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2026.

Abstract
Fuelled by exciting advances in materials and devices, in-memory computing (IMC) architectures represent a promising avenue to transcend the energy-delay bottlenecks of classical Von Neumann systems. While manual designs have demonstrated orders of magnitude improvements in efficiency, the lack of unified software stacks limits their general adoption and design-space exploration. In this talk, we discuss how high-level compiler frameworks can become enablers for top-down design and for the exploration of the vast parameter space of IMC architectures. We report on current efforts to build an integrated compiler framework based on the MLIR infrastructure. By leveraging a multi-level dialect approach, our framework abstracts away individual technology constraints to foster cross-layer re-use. Concretely, we present optimizing flows tailored for diverse IMC primitives—including cross-bars, content-addressable memories (CAMs), and bulk-wise logic operations. We argue that such integrated automation is key to navigating the increasingly heterogeneous landscape of emerging accelerators and bringing their benefits to a broader range of applications.

Bibtex

@Misc{castrillon_date2026-w04,
author = {Castrillon, Jeronimo},
howpublished = {Workshop on Rapid Design Space Explorations of Novel Hardware Solutions: from Atoms to Applications, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk)},
month = apr,
title = {Top-Down Analysis via Integrated Compilers Frameworks},
year = {2026},
abstract = {Fuelled by exciting advances in materials and devices, in-memory computing (IMC) architectures represent a promising avenue to transcend the energy-delay bottlenecks of classical Von Neumann systems. While manual designs have demonstrated orders of magnitude improvements in efficiency, the lack of unified software stacks limits their general adoption and design-space exploration. In this talk, we discuss how high-level compiler frameworks can become enablers for top-down design and for the exploration of the vast parameter space of IMC architectures. We report on current efforts to build an integrated compiler framework based on the MLIR infrastructure. By leveraging a multi-level dialect approach, our framework abstracts away individual technology constraints to foster cross-layer re-use. Concretely, we present optimizing flows tailored for diverse IMC primitives—including cross-bars, content-addressable memories (CAMs), and bulk-wise logic operations. We argue that such integrated automation is key to navigating the increasingly heterogeneous landscape of emerging accelerators and bringing their benefits to a broader range of applications.},
date = {2026-04},
location = {Verona, Italy},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3886

×
Jeronimo Castrillon, "Energy Efficiency for edge/cloud optimization", In EMEC Workshop on Energy and Material Efficiency in Cloud-Edge continuum, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2026. [Bibtex & Downloads]

Energy Efficiency for edge/cloud optimization

Reference

Jeronimo Castrillon, "Energy Efficiency for edge/cloud optimization", In EMEC Workshop on Energy and Material Efficiency in Cloud-Edge continuum, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2026.

Abstract
This talk presents the MYRTUS Design and Programming Environment (DPE), an interoperable framework for modeling and deploying applications across the computing continuum. The environment features an MLIR-based infrastructure to support diverse hardware targets while allowing the seamless import of third-party code. We report on current support for FPGA and CGRA targets. A key aspect of the DPE is to support adaptability in the computing continuum by exploring operating points at design time, enabling a runtime engine to dynamically adapt configurations for energy efficiency. Additionally, the talk covers reactive programming, including enclaves and quasi-static scheduling to bridge the gap between high-level models and low-level executable schedules, ensuring efficient and deterministic execution in the context of cyber-physical systems.

Bibtex

@Misc{castrillon_date2026-w05,
author = {Castrillon, Jeronimo},
howpublished = {EMEC Workshop on Energy and Material Efficiency in Cloud-Edge continuum, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk)},
month = apr,
title = {Energy Efficiency for edge/cloud optimization},
year = {2026},
abstract = {This talk presents the MYRTUS Design and Programming Environment (DPE), an interoperable framework for modeling and deploying applications across the computing continuum. The environment features an MLIR-based infrastructure to support diverse hardware targets while allowing the seamless import of third-party code. We report on current support for FPGA and CGRA targets. A key aspect of the DPE is to support adaptability in the computing continuum by exploring operating points at design time, enabling a runtime engine to dynamically adapt configurations for energy efficiency. Additionally, the talk covers reactive programming, including enclaves and quasi-static scheduling to bridge the gap between high-level models and low-level executable schedules, ensuring efficient and deterministic execution in the context of cyber-physical systems. },
date = {2026-04},
location = {Verona, Italy},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3888

×
Mohamed Amine Khelassi, Felix Suchert, Abderaouf Amalou, Benjamin Lesage, Anika Christmann, Robin Hapka, Jeronimo Castrillon, Mihail Asavoae, Mathieu Jan, Claire Pagetti, Selma Saidi, "Interferences within a certifiable design methodology for high-performance multi-core platforms", In Proceeding: 14th European Congress Embedded Real Time Systems (ERTS 2026), Feb 2026. [Bibtex & Downloads]

Interferences within a certifiable design methodology for high-performance multi-core platforms

Reference

Mohamed Amine Khelassi, Felix Suchert, Abderaouf Amalou, Benjamin Lesage, Anika Christmann, Robin Hapka, Jeronimo Castrillon, Mihail Asavoae, Mathieu Jan, Claire Pagetti, Selma Saidi, "Interferences within a certifiable design methodology for high-performance multi-core platforms", In Proceeding: 14th European Congress Embedded Real Time Systems (ERTS 2026), Feb 2026.

Bibtex

@inproceedings{khelassi_erts26,
title={Interferences within a certifiable design methodology for high-performance multi-core platforms},
author={Mohamed Amine Khelassi and Felix Suchert and Abderaouf Amalou and Benjamin Lesage and Anika Christmann and Robin Hapka and Jeronimo Castrillon and Mihail Asavoae and Mathieu Jan and Claire Pagetti and Selma Saidi},
booktitle={14th European Congress Embedded Real Time Systems (ERTS 2026)},
year={2026},
month = feb,
location = {Toulouse, France},
url = {https://cea.hal.science/cea-05504739/file/InterMCore.pdf},
}

Downloads

2602_Khelassi_ERTS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3871

×
João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-Memory High-Radix Counting", Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026), IEEE Computer Society, pp. 1–15, Los Alamitos, CA, USA, Feb 2026. [doi] [Bibtex & Downloads]

Count2Multiply: Reliable In-Memory High-Radix Counting

Reference

João Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones, "Count2Multiply: Reliable In-Memory High-Radix Counting", Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026), IEEE Computer Society, pp. 1–15, Los Alamitos, CA, USA, Feb 2026. [doi]

Bibtex

@InProceedings{delima_hpca26,
author = {Jo{\~a}o Paulo C. de Lima and Benjamin F. Morris III and Asif Ali Khan and Jeronimo Castrillon and Alex K. Jones},
booktitle = {Proceedings of the 32th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2026)},
title = {Count2Multiply: Reliable In-Memory High-Radix Counting},
doi = {10.1109/HPCA68181.2026.11408436},
location = {Sydney, Australia},
organization = {IEEE},
pages = {1--15},
publisher = {IEEE Computer Society},
url = {https://ieeexplore.ieee.org/document/11408436},
address = {Los Alamitos, CA, USA},
month = feb,
year = {2026},
}

Downloads

2602_Lima_HPCA [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3861

×
Anderson Faustino da Silva, Sérgio Queiroz de Medeiros, Marcelo Borges Nogueira, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "On the Precision of Dynamic Program Fingerprints based on Performance Counters", Proceedings of the 24th IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2026), Association for Computing Machinery, pp. 507–519, New York, NY, USA, Feb 2026. [doi] [Bibtex & Downloads]

On the Precision of Dynamic Program Fingerprints based on Performance Counters

Reference

Anderson Faustino da Silva, Sérgio Queiroz de Medeiros, Marcelo Borges Nogueira, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "On the Precision of Dynamic Program Fingerprints based on Performance Counters", Proceedings of the 24th IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2026), Association for Computing Machinery, pp. 507–519, New York, NY, USA, Feb 2026. [doi]

Bibtex

@InProceedings{dasilva_cgo26,
author = {Anderson Faustino da Silva and S\'{e}rgio Queiroz de Medeiros and Marcelo Borges Nogueira and Jeronimo Castrillon and Fernando Magno Quint\~{a}o Pereira},
booktitle = {Proceedings of the 24th IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2026)},
title = {On the Precision of Dynamic Program Fingerprints based on Performance Counters},
doi = {10.1109/CGO68049.2026.11395222},
location = {Sydney, Australia},
pages = {507--519},
publisher = {Association for Computing Machinery},
series = {CGO' 26},
url = {https://ieeexplore.ieee.org/abstract/document/11395222},
address = {New York, NY, USA},
month = feb,
year = {2026},
}

Downloads

2602_daSilva_CGO [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3860

×
Dimitrios Prousalis, Ioannis Messaris, Khaleelulla K. Nazeer, João Paulo Cardoso de Lima, Ahmet Samil Demirkol, Vasileios Ntinas, Hamid Farzaneh, Alon Ascoli, Jeronimo Castrillon, Ronald Tetzlaff, "6G computing for sensing: universal memcomputing using memristor cellular neural networks", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 353–376, Feb 2026. [doi] [Bibtex & Downloads]

6G computing for sensing: universal memcomputing using memristor cellular neural networks

Reference

Dimitrios Prousalis, Ioannis Messaris, Khaleelulla K. Nazeer, João Paulo Cardoso de Lima, Ahmet Samil Demirkol, Vasileios Ntinas, Hamid Farzaneh, Alon Ascoli, Jeronimo Castrillon, Ronald Tetzlaff, "6G computing for sensing: universal memcomputing using memristor cellular neural networks", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 353–376, Feb 2026. [doi]

Abstract
As 6G networks enable real-time data acquisition from millions of embedded sensors, the challenge of efficiently processing vast multi-modal datasets becomes paramount. This chapter explores how memcomputing, specifically through Memristor Cellular Neural Networks (M-CellNNs), can address these challenges by diverging from conventional compute-centric models. By leveraging volatile and non-volatile memristors, M-CellNNs can achieve high-speed, energy-efficient data processing directly at the sensor level, addressing challenges related to execution time, data privacy, and compatibility. We demonstrate the multitasking and memcomputing capabilities of M-CellNNs for simultaneous image processing, while emphasizing the need for novel software frameworks and mapping strategies to facilitate seamless integration of these advanced computing architectures. This discussion highlights M-CellNNs as a promising approach for scalable, robust, real-time data processing in 6G applications, with the potential to improve performance, accuracy, and energy efficiency.

Bibtex

@InCollection{prousalis_6GBook26,
author = {Dimitrios Prousalis and Ioannis Messaris and Khaleelulla K. Nazeer and João Paulo {Cardoso de Lima} and Ahmet Samil Demirkol and Vasileios Ntinas and Hamid Farzaneh and Alon Ascoli and Jeronimo Castrillon and Ronald Tetzlaff},
booktitle = {6G-life},
title = {6G computing for sensing: universal memcomputing using memristor cellular neural networks},
doi = {https://doi.org/10.1016/B978-0-44-327410-7.00029-6},
editor = {Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling},
isbn = {978-0-443-27410-7},
pages = {353--376},
publisher = {Academic Press},
url = {https://www.sciencedirect.com/science/article/pii/B9780443274107000296},
abstract = {As 6G networks enable real-time data acquisition from millions of embedded sensors, the challenge of efficiently processing vast multi-modal datasets becomes paramount. This chapter explores how memcomputing, specifically through Memristor Cellular Neural Networks (M-CellNNs), can address these challenges by diverging from conventional compute-centric models. By leveraging volatile and non-volatile memristors, M-CellNNs can achieve high-speed, energy-efficient data processing directly at the sensor level, addressing challenges related to execution time, data privacy, and compatibility. We demonstrate the multitasking and memcomputing capabilities of M-CellNNs for simultaneous image processing, while emphasizing the need for novel software frameworks and mapping strategies to facilitate seamless integration of these advanced computing architectures. This discussion highlights M-CellNNs as a promising approach for scalable, robust, real-time data processing in 6G applications, with the potential to improve performance, accuracy, and energy efficiency.},
month = feb,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3877

×
Fidan Mehmeti, Franz Biersack, Marco Liess, Matthias Nickel, Tung V. Doan, Robert Khasanov, Jeronimo Castrillon, Osel Lhamo, Thomas Wild, Andreas Herkersdorf, Wolfgang Kellerer, Frank H.P. Fitzek, Giang T. Nguyen, Diana Göhringer, "6G network design and operations", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 135–149, Feb 2026. [doi] [Bibtex & Downloads]

6G network design and operations

Reference

Fidan Mehmeti, Franz Biersack, Marco Liess, Matthias Nickel, Tung V. Doan, Robert Khasanov, Jeronimo Castrillon, Osel Lhamo, Thomas Wild, Andreas Herkersdorf, Wolfgang Kellerer, Frank H.P. Fitzek, Giang T. Nguyen, Diana Göhringer, "6G network design and operations", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 135–149, Feb 2026. [doi]

Abstract
This chapter presents various enhancements for 6G networks, focusing on critical challenges, such as low latency, high throughput, and resource efficiency. Achieving these objectives relies on emerging networking technologies, encompassing network function virtualization, software-defined network, and in-network computing. The chapter begins by reviewing contributions related to the placement of existing components in mobile core networks. The flexibility of 6G networks enables the introduction of in-network computing, facilitating computation within mobile networks. Subsequently, the chapter examines key contributions to in-network computing in mobile networks, covering topics such as adaptive compute platforms, application mapping, and mobile edge cloud.

Bibtex

@InCollection{mehmeti_6GBook26,
author = {Fidan Mehmeti and Franz Biersack and Marco Liess and Matthias Nickel and Tung V. Doan and Robert Khasanov and Jeronimo Castrillon and Osel Lhamo and Thomas Wild and Andreas Herkersdorf and Wolfgang Kellerer and Frank H.P. Fitzek and Giang T. Nguyen and Diana Göhringer},
booktitle = {6G-life},
title = {6G network design and operations},
doi = {https://doi.org/10.1016/B978-0-44-327410-7.00020-X},
editor = {Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling},
isbn = {978-0-443-27410-7},
pages = {135--149},
publisher = {Academic Press},
url = {https://www.sciencedirect.com/science/article/pii/B978044327410700020X},
abstract = {This chapter presents various enhancements for 6G networks, focusing on critical challenges, such as low latency, high throughput, and resource efficiency. Achieving these objectives relies on emerging networking technologies, encompassing network function virtualization, software-defined network, and in-network computing. The chapter begins by reviewing contributions related to the placement of existing components in mobile core networks. The flexibility of 6G networks enables the introduction of in-network computing, facilitating computation within mobile networks. Subsequently, the chapter examines key contributions to in-network computing in mobile networks, covering topics such as adaptive compute platforms, application mapping, and mobile edge cloud.},
month = feb,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3878

×
Hector A. Gonzalez, Javier Acevedo, Khaleelulla K. Nazeer, Clément Fournier, Abdul Rehman Aslam, Jiaxin Huang, Matthias A. Lohrmann, Robert A. Tietze, Christian Eichhorn, Stefan Gumhold, Sami Haddadin, Hamid Sadeghian, Reinhard Heckel, Frank H.P. Fitzek, Jeronimo Castrillon, Christian Mayr, "Artificial intelligence in 6G ecosystem", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 205–227, Feb 2026. [doi] [Bibtex & Downloads]

Artificial intelligence in 6G ecosystem

Reference

Hector A. Gonzalez, Javier Acevedo, Khaleelulla K. Nazeer, Clément Fournier, Abdul Rehman Aslam, Jiaxin Huang, Matthias A. Lohrmann, Robert A. Tietze, Christian Eichhorn, Stefan Gumhold, Sami Haddadin, Hamid Sadeghian, Reinhard Heckel, Frank H.P. Fitzek, Jeronimo Castrillon, Christian Mayr, "Artificial intelligence in 6G ecosystem", Chapter in 6G-life (Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling), Academic Press, pp. 205–227, Feb 2026. [doi]

Abstract
The future technical standard of sixth-generation (6G) technology for wireless communications has accelerated the arrival of interconnected autonomous systems and other sensing devices in a wide range of industrial zones, such as smart factories, smart farms, and cognitive cities, among others. The imminent digitalization of these ecosystems has created highly dynamic environments that demand real-time decisions, making it difficult for humans to keep up with all their details. These dynamic scenarios require planning and execution that is more precise and faster than the speed at which data is acquired. The use of Artificial Intelligence (AI) offers high potential to enable the monitoring and assessment of multi-modal sensor data at a superhuman level, leading to faster decisions with better precision, which reduces undesired automated behavior, while enabling new forms of interaction. This chapter describes techniques, software frameworks, compilation flows, and hardware infrastructure for achieving large-scale, energy-efficient, trustworthy, real-time, and distributed AI in the newly developed era of 6G ecosystems, which produce vast amounts of data. The chapter also describes an economic perspective on the challenges in achieving this vision.

Bibtex

@InCollection{gonzalez_6GBook26,
author = {Hector A. Gonzalez and Javier Acevedo and Khaleelulla K. Nazeer and Clément Fournier and Abdul Rehman Aslam and Jiaxin Huang and Matthias A. Lohrmann and Robert A. Tietze and Christian Eichhorn and Stefan Gumhold and Sami Haddadin and Hamid Sadeghian and Reinhard Heckel and Frank H.P. Fitzek and Jeronimo Castrillon and Christian Mayr},
booktitle = {6G-life},
title = {Artificial intelligence in 6G ecosystem},
doi = {https://doi.org/10.1016/B978-0-44-327410-7.00024-7},
editor = {Frank H.P. Fitzek and Holger Boche and Wolfgang Kellerer and Patrick Seeling},
isbn = {978-0-443-27410-7},
pages = {205--227},
publisher = {Academic Press},
url = {https://www.sciencedirect.com/science/article/pii/B9780443274107000247},
abstract = {The future technical standard of sixth-generation (6G) technology for wireless communications has accelerated the arrival of interconnected autonomous systems and other sensing devices in a wide range of industrial zones, such as smart factories, smart farms, and cognitive cities, among others. The imminent digitalization of these ecosystems has created highly dynamic environments that demand real-time decisions, making it difficult for humans to keep up with all their details. These dynamic scenarios require planning and execution that is more precise and faster than the speed at which data is acquired. The use of Artificial Intelligence (AI) offers high potential to enable the monitoring and assessment of multi-modal sensor data at a superhuman level, leading to faster decisions with better precision, which reduces undesired automated behavior, while enabling new forms of interaction. This chapter describes techniques, software frameworks, compilation flows, and hardware infrastructure for achieving large-scale, energy-efficient, trustworthy, real-time, and distributed AI in the newly developed era of 6G ecosystems, which produce vast amounts of data. The chapter also describes an economic perspective on the challenges in achieving this vision.},
month = feb,
year = {2026},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3879

×
Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesteter, João Paulo Cardoso de Lima, Jeronimo Castrillon, "Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices", In Proceeding: 8th Workshop on Accelerated Machine Learning (AccML), co-located with 21st International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 8pp, Jan 2026. [Bibtex & Downloads]

Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Reference

Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesteter, João Paulo Cardoso de Lima, Jeronimo Castrillon, "Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices", In Proceeding: 8th Workshop on Accelerated Machine Learning (AccML), co-located with 21st International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 8pp, Jan 2026.

Bibtex

@InProceedings{ruiz_y_mesa_accml26,
author = {Alejandro Ruiz y Mesa and Guilherme Korol and Moritz Riesteter and João Paulo Cardoso de Lima and Jeronimo Castrillon},
booktitle = {8th Workshop on Accelerated Machine Learning (AccML), co-located with 21st International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices},
location = {Kraków, Poland},
pages = {8pp},
month = jan,
year = {2026},
url = {https://accml.dcs.gla.ac.uk/papers/2026/8th_AccML_paper_8.pdf},
}

Downloads

2601_Ruiz_AccML [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3873

×
Jeronimo Castrillon, Jana Giceva, Yu Hua, Kimberly Keeton, Akhil Shekar, Kevin Skadron, Tianzheng Wang, Huanchen Zhang, "Declarative Memory Services", In Proceeding: Conference on Innovative Data Systems Research (CIDR), Jan 2026. [Bibtex & Downloads]

Declarative Memory Services

Reference

Jeronimo Castrillon, Jana Giceva, Yu Hua, Kimberly Keeton, Akhil Shekar, Kevin Skadron, Tianzheng Wang, Huanchen Zhang, "Declarative Memory Services", In Proceeding: Conference on Innovative Data Systems Research (CIDR), Jan 2026.

Bibtex

@InProceedings{wang_cidr26,
author = {Jeronimo Castrillon and Jana Giceva and Yu Hua and Kimberly Keeton and Akhil Shekar and Kevin Skadron and Tianzheng Wang and Huanchen Zhang},
booktitle = {Conference on Innovative Data Systems Research (CIDR)},
title = {Declarative Memory Services},
location = {Chaminade, USA},
month = jan,
year = {2026},
url = {https://www.vldb.org/cidrdb/papers/2026/p21-castrillon.pdf}
}

Downloads

2601_Wang_CIDR [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3864

×
Jiahong Bi, Lars Schütze, Jeronimo Castrillon, "MING: An Automated CNN-to-Edge MLIR HLS framework", arXiv, 2026. [doi] [Bibtex & Downloads]

MING: An Automated CNN-to-Edge MLIR HLS framework

Reference

Jiahong Bi, Lars Schütze, Jeronimo Castrillon, "MING: An Automated CNN-to-Edge MLIR HLS framework", arXiv, 2026. [doi]

Bibtex

@misc{https://doi.org/10.48550/arxiv.2602.11966,
doi = {10.48550/ARXIV.2602.11966},
url = {https://arxiv.org/abs/2602.11966},
author = {Bi, Jiahong and Schütze, Lars and Castrillon, Jeronimo},
keywords = {Hardware Architecture (cs.AR), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {MING: An Automated CNN-to-Edge MLIR HLS framework},
publisher = {arXiv},
year = {2026},
copyright = {Creative Commons Attribution 4.0 International}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3874

×

2025
Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "HARP: Energy-Aware and Adaptive Management of Heterogeneous Processors", Proceedings 26th ACM/IFIP International Middleware Conference (Middleware'25), Association for Computing Machinery, pp. 270–284, New York, NY, USA, Dec 2025. (Best Paper Award (Honorable Mention)) [doi] [Bibtex & Downloads]

HARP: Energy-Aware and Adaptive Management of Heterogeneous Processors

Reference

Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "HARP: Energy-Aware and Adaptive Management of Heterogeneous Processors", Proceedings 26th ACM/IFIP International Middleware Conference (Middleware'25), Association for Computing Machinery, pp. 270–284, New York, NY, USA, Dec 2025. (Best Paper Award (Honorable Mention)) [doi]

Abstract
Energy efficiency has become a key concern in modern computing. Major processor vendors now offer single-ISA heterogeneous processors that combine powerful and energy-efficient cores, such as Arm's big.LITTLE CPUs, Apple's M-series chips, and Intel P/E systems. However, today's OS schedulers, relying on simple cost-based thread allocation strategies, fail to fully exploit their potential.This paper presents HARP, a Linux-integrated resource-management framework for heterogeneous processors. HARP leverages application behavior through online monitoring or application descriptions and introduces a lightweight interface for two-way communication between applications and the resource manager. Through this interface, HARP learns application characteristics to guide allocation decisions, which are then relayed back to the applications so they can adapt accordingly. HARP supports various programming models, from OpenMP and Intel TBB to custom models with adaptivity features, significantly improving performance and energy efficiency, particularly in multi-application scenarios. On two representative heterogeneous systems, HARP reduces the average execution time by 12 % and the energy consumption by 28 % compared to existing methods. Overall, HARP marks a crucial step toward energy-efficient computing across diverse architectures.

Bibtex

@InProceedings{khasanov_middleware25,
author = {Till Smejkal and Robert Khasanov and Jeronimo Castrillon and Hermann H{\"a}rtig},
booktitle = {Proceedings 26th ACM/IFIP International Middleware Conference (Middleware'25)},
title = {{HARP}: Energy-Aware and Adaptive Management of Heterogeneous Processors},

doi = {10.1145/3721462.3770774},
isbn = {9798400715549},
location = {Vanderbilt University, Nashville, TN, USA},
pages = {270--284},
publisher = {Association for Computing Machinery},
series = {Middleware '25},
url = {https://doi.org/10.1145/3721462.3770774},
abstract = {Energy efficiency has become a key concern in modern computing. Major processor vendors now offer single-ISA heterogeneous processors that combine powerful and energy-efficient cores, such as Arm's big.LITTLE CPUs, Apple's M-series chips, and Intel P/E systems. However, today's OS schedulers, relying on simple cost-based thread allocation strategies, fail to fully exploit their potential.This paper presents HARP, a Linux-integrated resource-management framework for heterogeneous processors. HARP leverages application behavior through online monitoring or application descriptions and introduces a lightweight interface for two-way communication between applications and the resource manager. Through this interface, HARP learns application characteristics to guide allocation decisions, which are then relayed back to the applications so they can adapt accordingly. HARP supports various programming models, from OpenMP and Intel TBB to custom models with adaptivity features, significantly improving performance and energy efficiency, particularly in multi-application scenarios. On two representative heterogeneous systems, HARP reduces the average execution time by 12 \% and the energy consumption by 28 \% compared to existing methods. Overall, HARP marks a crucial step toward energy-efficient computing across diverse architectures.},
address = {New York, NY, USA},
month = dec,
numpages = {15},
year = {2025},
}

Downloads

2512_Khasanov_MWARE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3858

×
Julian Robledo, Jeronimo Castrillon, "Automating timing enclaves for reactive programs in Lingua Franca", Proceedings of the IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-25), pp. 9–16, Dec 2025. [doi] [Bibtex & Downloads]

Automating timing enclaves for reactive programs in Lingua Franca

Reference

Julian Robledo, Jeronimo Castrillon, "Automating timing enclaves for reactive programs in Lingua Franca", Proceedings of the IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-25), pp. 9–16, Dec 2025. [doi]

Bibtex

@InProceedings{robledo_mcsoc25,
author = {Julian Robledo and Jeronimo Castrillon},
booktitle = {Proceedings of the IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-25)},
title = {Automating timing enclaves for reactive programs in Lingua Franca},
doi = {10.1109/MCSoC67473.2025.00013},
location = {Singapore},
pages = {9--16},
url = {https://ieeexplore.ieee.org/document/11310870},
month = dec,
year = {2025},
}

Downloads

2512_Robledo_MCSOC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3849

×
João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025. [Bibtex & Downloads]

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

Reference

João Paulo Cardoso De Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan, "Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs" (to appear), Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC), IEEE, Oct 2025.

Bibtex

@InProceedings{delima_ccmcc25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Marc Dietrich and Jeronimo Castrillon and Asif Ali Khan},
booktitle = {Proceedings of the IEEE Cross-disciplinary Conference on Memory-Centric Computing (CCMCC)},
title = {Efficient In-Memory Acceleration of Sparse Block Diagonal {LLM}s},
location = {Dresden, Germany},
publisher = {IEEE},
month = oct,
numpages = {6},
year = {2025},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3852

×
Jeronimo Castrillon, Chadlia Jerad, Edward A. Lee, Claire Pagetti, Shaokai Jerry Lin, "Tradeoffs in Reactive Systems Design (Dagstuhl Seminar 25091)", In Dagstuhl Reports (Castrillon, Jeronimo and Jerad, Chadlia and Lee, Edward A. and Pagetti, Claire and Lin, Shaokai Jerry), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 15, no. 2, pp. 126–157, Dagstuhl, Germany, Oct 2025. [doi] [Bibtex & Downloads]

Tradeoffs in Reactive Systems Design (Dagstuhl Seminar 25091)

Reference

Jeronimo Castrillon, Chadlia Jerad, Edward A. Lee, Claire Pagetti, Shaokai Jerry Lin, "Tradeoffs in Reactive Systems Design (Dagstuhl Seminar 25091)", In Dagstuhl Reports (Castrillon, Jeronimo and Jerad, Chadlia and Lee, Edward A. and Pagetti, Claire and Lin, Shaokai Jerry), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 15, no. 2, pp. 126–157, Dagstuhl, Germany, Oct 2025. [doi]

Bibtex

@article{castrillon_DagRep25091,
author = {Castrillon, Jeronimo and Jerad, Chadlia and Lee, Edward A. and Pagetti, Claire and Lin, Shaokai Jerry},
title = ,
pages = {126--157},
journal = {Dagstuhl Reports},
ISSN = {2192-5283},
year = {2025},
month = oct,
volume = {15},
number = {2},
editor = {Castrillon, Jeronimo and Jerad, Chadlia and Lee, Edward A. and Pagetti, Claire and Lin, Shaokai Jerry},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum für Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagRep.15.2.126},
URN = {urn:nbn:de:0030-drops-230878},
doi = {10.4230/DagRep.15.2.126},
}

Downloads

2510_Castrillon_DagRep25091 [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3857

×
Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Combining Early Exit and Selective Prediction for Convolutional Neural Networks", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2025), IEEE, Oct 2025. [doi] [Bibtex & Downloads]

Combining Early Exit and Selective Prediction for Convolutional Neural Networks

Reference

Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Combining Early Exit and Selective Prediction for Convolutional Neural Networks", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2025), IEEE, Oct 2025. [doi]

Bibtex

@Article{bouraoui_ieeeesl-tcrs25,
author = {Hasna Bouraoui and Chadlia Jerad and Jeronimo Castrillon},
title = {Combining Early Exit and Selective Prediction for Convolutional Neural Networks},
journal = {IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2025)},
month = oct,
numpages = {4},
publisher = {IEEE},
year = {2025},
doi = {10.1109/LES.2025.3595439},
url = {https://ieeexplore.ieee.org/document/11111706},
}

Downloads

2510_ Bouraoui_TCRS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3850

×
Asif Ali Khan, Hadjer Benmeziane, Hamid Farzaneh, João Paulo C. de Lima, William Simon, Yiyu Shi, Zheyu Yan, Abu Sebastian, X. Sharon Hu, Jeronimo Castrillon, Corey Lammie, "Tutorial: Hardware-Aware Compilation and Simulation for In-Memory Computing", Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'25), Association for Computing Machinery, pp. 31-–32, New York, NY, USA, Oct 2025. [doi] [Bibtex & Downloads]

Tutorial: Hardware-Aware Compilation and Simulation for In-Memory Computing

Reference

Asif Ali Khan, Hadjer Benmeziane, Hamid Farzaneh, João Paulo C. de Lima, William Simon, Yiyu Shi, Zheyu Yan, Abu Sebastian, X. Sharon Hu, Jeronimo Castrillon, Corey Lammie, "Tutorial: Hardware-Aware Compilation and Simulation for In-Memory Computing", Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'25), Association for Computing Machinery, pp. 31-–32, New York, NY, USA, Oct 2025. [doi]

Abstract
This brief presents an overview of recent tools and research efforts aimed at enhancing the programmability and reliability of In-Memory Computing (IMC)-based systems. We discuss hardware-aware training techniques that improve model resilience to analog device imperfections, and explore mapping strategies that balance accuracy and performance for heterogeneous IMC-based accelerators. Additionally, we examine a compiler framework that abstracts hardware complexities and enables seamless integration of these accelerators into existing deployment pipelines. By combining these approaches with advanced simulation tools, we propose an end-to-end workflow that facilitates the practical deployment and optimization of IMC technologies across diverse memory types and architectural designs.

Bibtex

@InProceedings{khan_cases-tutotial25,
author = {Khan, Asif Ali and Benmeziane, Hadjer and Farzaneh, Hamid and de Lima, João Paulo C. and Simon, William and Shi, Yiyu and Yan, Zheyu and Sebastian, Abu and Hu, X. Sharon and Castrillon, Jeronimo and Lammie, Corey},
booktitle = {Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'25)},
title = {Tutorial: Hardware-Aware Compilation and Simulation for In-Memory Computing},
doi = {10.1145/3742872.3758333},
isbn = {9798400719912},
location = {Taipei International Convention Center (TICC), Taipei, Taiwan},
pages = {31-–32},
publisher = {Association for Computing Machinery},
series = {CASES '25},
url = {https://doi.org/10.1145/3742872.3758333},
abstract = {This brief presents an overview of recent tools and research efforts aimed at enhancing the programmability and reliability of In-Memory Computing (IMC)-based systems. We discuss hardware-aware training techniques that improve model resilience to analog device imperfections, and explore mapping strategies that balance accuracy and performance for heterogeneous IMC-based accelerators. Additionally, we examine a compiler framework that abstracts hardware complexities and enables seamless integration of these accelerators into existing deployment pipelines. By combining these approaches with advanced simulation tools, we propose an end-to-end workflow that facilitates the practical deployment and optimization of IMC technologies across diverse memory types and architectural designs.},
address = {New York, NY, USA},
numpages = {2},
year = {2025},
month = oct,
}

Downloads

2510_Khan_CASES-Tutorial [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3868

×
Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Yimo Xu, Guangyu Feng, Magnus Mæhlum, Jian-Jia Chen, Martin Schoeberl, Linh Thi Xuan Phan, Jeronimo Castrillon, Sanjit A. Seshia, Edward A. Lee, "Quasi-Static Scheduling for Deterministic Timed Concurrent Models on Multi-Core Hardware", In ACM Transactions on Embedded Computing Systems (TECS). Special issue, International Conference on Embedded Software (EMSOFT’25), Association for Computing Machinery, New York, NY, USA, Sep 2025. [doi] [Bibtex & Downloads]

Quasi-Static Scheduling for Deterministic Timed Concurrent Models on Multi-Core Hardware

Reference

Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Yimo Xu, Guangyu Feng, Magnus Mæhlum, Jian-Jia Chen, Martin Schoeberl, Linh Thi Xuan Phan, Jeronimo Castrillon, Sanjit A. Seshia, Edward A. Lee, "Quasi-Static Scheduling for Deterministic Timed Concurrent Models on Multi-Core Hardware", In ACM Transactions on Embedded Computing Systems (TECS). Special issue, International Conference on Embedded Software (EMSOFT’25), Association for Computing Machinery, New York, NY, USA, Sep 2025. [doi]

Abstract
To design performant, expressive, and reliable cyber-physical systems (CPSs), researchers extensively perform quasi-static scheduling for concurrent models of computation (MoCs) on multi-core hardware. However, these quasi-static scheduling approaches are developed independently for their corresponding MoCs, despite commonality in the approaches. To help generalize the use of quasi-static scheduling to new and emerging MoCs, this paper proposes a unified approach for a class of deterministic timed concurrent models (DTCMs), including prominent models such as synchronous dataflow (SDF), Boolean-controlled dataflow (BDF), scenario-aware dataflow (SADF), and Logical Execution Time (LET). In contrast to scheduling techniques tailored exclusively to specific MoCs, our unified approach leverages a common intermediate formalism called state space finite automata (SSFA), bridging the gap between high-level MoCs and executable schedules. Once identified as DTCMs, new MoCs can directly adopt SSFA-based scheduling, significantly easing adoption. We show that quasi-static schedules facilitated by SSFA are provably free from timing anomalies and enable straightforward worst-case makespan analysis. We demonstrate the approach using the reactor model—an emerging discrete-event MoC—programmed using the Lingua Franca (LF) language. Experiments show that quasi-statically scheduled LF programs exhibit lower runtime overhead compared to the dynamically scheduled LF programs, and that the analyzable worst-case makespans enable compile-time deadline checking.

Bibtex

@Article{lin_emsoft25,
author = {Lin, Shaokai and Jellum, Erling and Theile, Mirco and Tanneberger, Tassilo and Sun, Binqi and Jerad, Chadlia and Xu, Yimo and Feng, Guangyu and M\ae{}hlum, Magnus and Chen, Jian-Jia and Schoeberl, Martin and Phan, Linh Thi Xuan and Castrillon, Jeronimo and Seshia, Sanjit A. and Lee, Edward A.},
title = {Quasi-Static Scheduling for Deterministic Timed Concurrent Models on Multi-Core Hardware},
doi = {10.1145/3762653},
issn = {1539-9087},
url = {https://doi.org/10.1145/3762653},
abstract = {To design performant, expressive, and reliable cyber-physical systems (CPSs), researchers extensively perform quasi-static scheduling for concurrent models of computation (MoCs) on multi-core hardware. However, these quasi-static scheduling approaches are developed independently for their corresponding MoCs, despite commonality in the approaches. To help generalize the use of quasi-static scheduling to new and emerging MoCs, this paper proposes a unified approach for a class of deterministic timed concurrent models (DTCMs), including prominent models such as synchronous dataflow (SDF), Boolean-controlled dataflow (BDF), scenario-aware dataflow (SADF), and Logical Execution Time (LET). In contrast to scheduling techniques tailored exclusively to specific MoCs, our unified approach leverages a common intermediate formalism called state space finite automata (SSFA), bridging the gap between high-level MoCs and executable schedules. Once identified as DTCMs, new MoCs can directly adopt SSFA-based scheduling, significantly easing adoption. We show that quasi-static schedules facilitated by SSFA are provably free from timing anomalies and enable straightforward worst-case makespan analysis. We demonstrate the approach using the reactor model—an emerging discrete-event MoC—programmed using the Lingua Franca (LF) language. Experiments show that quasi-statically scheduled LF programs exhibit lower runtime overhead compared to the dynamically scheduled LF programs, and that the analyzable worst-case makespans enable compile-time deadline checking.},
address = {New York, NY, USA},
journal = {ACM Transactions on Embedded Computing Systems (TECS). Special issue, International Conference on Embedded Software (EMSOFT’25)},
location = {Taipei, Taiwan},
month = sep,
publisher = {Association for Computing Machinery},
year = {2025},
}

Downloads

2510_Lin_EMSOFT [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3848

×
Hamid Farzaneh, Asif Ali Khan, Jeronimo Castrillon, "CoMoNM: A Cost Modeling Framework for Compute-Near-Memory Systems", Aug 2025. [Bibtex & Downloads]

CoMoNM: A Cost Modeling Framework for Compute-Near-Memory Systems

Reference

Hamid Farzaneh, Asif Ali Khan, Jeronimo Castrillon, "CoMoNM: A Cost Modeling Framework for Compute-Near-Memory Systems", Aug 2025.

Bibtex

@Misc{farzaneh_comonm-arxiv25,
author = {Hamid Farzaneh and Asif Ali Khan and Jeronimo Castrillon},
title = {{CoMoNM}: A Cost Modeling Framework for Compute-Near-Memory Systems},
eprint = {2508.11451},
url = {https://arxiv.org/abs/2508.11451},
archiveprefix = {arXiv},
primaryclass = {cs.ET},
year = {2025},
month = aug,
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3854

×
Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, vol. 42, no. 6, pp. 75–86, Aug 2025. [doi] [Bibtex & Downloads]

Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies

Reference

Xiaobo Sharon Hu, Ming-Yen Lee, Mengyuan Li, João Paulo Cardoso de Lima, Liu Liu, Zhenhua Zhu, Jeronimo Castrillon, Michael Niemier, Yu Wang, "Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies", In IEEE Design & Test, Special Issue on the 20 years of the IEEE CEDA, IEEE, vol. 42, no. 6, pp. 75–86, Aug 2025. [doi]

Abstract
Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools –including modeling and simulation, data partitioning and mapping, and operation scheduling–play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.

Bibtex

@Article{hu_dnt25,
author = {Xiaobo Sharon Hu and Ming-Yen Lee and Mengyuan Li and João Paulo Cardoso de Lima and Liu Liu and Zhenhua Zhu and Jeronimo Castrillon and Michael Niemier and Yu Wang},
journal = {IEEE Design \& Test, Special Issue on the 20 years of the IEEE CEDA},
volume={42},
number={6},
pages={75--86},
title = {Cross-Layer Design and Design Automation for In-Memory Computing based on Non-Volatile Memory Technologies},
doi = {10.1109/MDAT.2025.3603495},
url = {https://ieeexplore.ieee.org/document/11142851},
month = aug,
numpages = {11},
publisher = {IEEE},
year = {2025},
abstract = {Data transfer between processors and memory remains a critical bottleneck in improving application performance on traditional computing hardware, particularly for data-intensive workloads such as machine learning, bioinformatics, and security applications. In-memory computing (IMC), a paradigm where a substantial portion of data processing occurs directly within memory, has emerged as a promising solution to mitigate this bottleneck. The advancement of emerging non-volatile memory (NVM) technologies has further accelerated the development of IMC hardware fabrics. However, harnessing the full potential of IMC requires a cross-layer design approach that spans memory technologies, circuits, architectures, and systems. Essential cross-layer tools --including modeling and simulation, data partitioning and mapping, and operation scheduling--play a pivotal role in designing efficient IMC-based hardware. This article reviews key advancements in simulation and design tools for IMC fabrics, with a focus on NVM-based crossbar arrays and content-addressable memories, while highlighting the necessity of cross-layer collaboration. Additionally, we discuss current challenges and emerging opportunities in the field.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3853

×
Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "TQHD: Thermometer Encoding Based Quantization for Hyperdimensional Computing", In Proceeding: 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, vol. 1, pp. 1–6, Los Alamitos, CA, USA, Jul 2025. [doi] [Bibtex & Downloads]

TQHD: Thermometer Encoding Based Quantization for Hyperdimensional Computing

Reference

Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "TQHD: Thermometer Encoding Based Quantization for Hyperdimensional Computing", In Proceeding: 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, vol. 1, pp. 1–6, Los Alamitos, CA, USA, Jul 2025. [doi]

Abstract
Hyperdimensional computing (HDC) is an emerging brain-inspired machine learning framework built upon unique properties of high-dimensional vectors. The vectors can contain floating-point (FP) or binary values, offering tradeoffs in terms of accuracy and computational cost. Previous works have proposed quantization methods to convert FP models into binary ones to improve performance. Unfortunately, these approaches not only incur an accuracy loss but also sacrifice valuable properties of HDC, such as low training time or robustness to noise. To overcome these limitations, we propose TQHD, a quantization method that transforms FP vectors into thermometer-encoded binary vectors. TQHD reduces the accuracy loss inflicted by quantization by 3.4 pp in complex scenarios compared to the state-of-the-art.

Bibtex

@InProceedings{vieira_isvlsi25,
author = { Caio Vieira and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)},
title = {TQHD: Thermometer Encoding Based Quantization for Hyperdimensional Computing},
location = {Kalamata, Greece},
organization = {IEEE},
pages = {1--6},
publisher = {IEEE Computer Society},
doi = {10.1109/ISVLSI65124.2025.11130206},
url = {https://ieeexplore.ieee.org/document/11130206},
volume = {1},
abstract = {Hyperdimensional computing (HDC) is an emerging brain-inspired machine learning framework built upon unique properties of high-dimensional vectors. The vectors can contain floating-point (FP) or binary values, offering tradeoffs in terms of accuracy and computational cost. Previous works have proposed quantization methods to convert FP models into binary ones to improve performance. Unfortunately, these approaches not only incur an accuracy loss but also sacrifice valuable properties of HDC, such as low training time or robustness to noise. To overcome these limitations, we propose TQHD, a quantization method that transforms FP vectors into thermometer-encoded binary vectors. TQHD reduces the accuracy loss inflicted by quantization by 3.4 pp in complex scenarios compared to the state-of-the-art.},
address = {Los Alamitos, CA, USA},
month = jul,
year = {2025},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3837

×
Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. (Best paper award candidate) [Bibtex & Downloads]

LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems

Reference

Anderson Faustino da Silva, Hamid Farzaneh, Joao Paulo Cardoso De Lima, Asif Ali Khan, Jeronimo Castrillon, "LearnCNM2Predict: Transfer Learning-based Performance Model for CNM Systems", Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, Berlin, Heidelberg, Jul 2025. (Best paper award candidate)

Abstract
Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.

Bibtex

@InProceedings{dasilva_samos25,
author = {Anderson Faustino da Silva and Hamid Farzaneh and Joao Paulo Cardoso De Lima and Asif Ali Khan and Jeronimo Castrillon},
booktitle = {Proceedings of the 25st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2025-07},
title = {{LearnCNM2Predict}: Transfer Learning-based Performance Model for CNM Systems},

location = {Samos, Greece},
organization = {IEEE},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
month = jul,
numpages = {17},
year = {2025},
abstract = {Compute-near-memory (CNM) architectures have emerged as a promising solution to address the von Neumann bottleneck by relocating computation closer to memory and utilizing dedicated logic near memory arrays or banks. Despite their early stage of development, these architectures have demonstrated significant performance improvements over traditional CPU and GPU systems in various application domains. CNM architectures tend to excel in memory-bound workloads that exhibit high levels of data-level parallelism. However, identifying which kernels can take advantage of CNM execution poses a considerable challenge for software developers. This paper introduces a transfer learning approach for predicting performance on CNM systems. Our method harnesses knowledge from previously analyzed applications to enhance prediction accuracy for new, unseen applications, thereby reducing the necessity for extensive training data for each application. We have developed a feature extraction framework that captures CNM-specific computation and memory access patterns, which are crucial for determining performance. Experimental results demonstrate that our transfer learning model achieves high prediction accuracy across diverse application domains, showcasing robust generalization even in scenarios with limited training data.},
}

Downloads

2507_daSilva_SAMOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3839

×
João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, pp. 50–56, New York, NY, USA, Jun 2025. [doi] [Bibtex & Downloads]

All-in-memory Stochastic Computing using ReRAM

Reference

João Paulo Cardoso De Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan, "All-in-memory Stochastic Computing using ReRAM", Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25), Association for Computing Machinery, pp. 50–56, New York, NY, USA, Jun 2025. [doi]

Abstract
As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: 1 generating low-cost true random numbers and SBSs, 2 conducting SC operations, and 3 converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39X, 2.16X) and energy consumption (1.15X, 2.8X) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5% across multiple SBS lengths and image processing tasks.

Bibtex

@InProceedings{delima_dac25,
author = {Jo{\~a}o Paulo Cardoso De Lima and Mehran Shoushtari Moghadam and Sercan Aygun and Jeronimo Castrillon and M. Hassan Najafi and Asif Ali Khan},
booktitle = {Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC'25)},
title = {All-in-memory Stochastic Computing using {ReRAM}},
doi = {10.1109/DAC63849.2025.11132096},
isbn = {9798331503048},
location = {San Francisco, California},
pages = {50--56},
publisher = {Association for Computing Machinery},
series = {DAC '25},
url = {https://doi.org/10.1109/DAC63849.2025.11132096},
abstract = {As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80\% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: 1 generating low-cost true random numbers and SBSs, 2 conducting SC operations, and 3 converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39X, 2.16X) and energy consumption (1.15X, 2.8X) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5\% across multiple SBS lengths and image processing tasks.},
address = {New York, NY, USA},
articleno = {5},
month = jun,
numpages = {6},
year = {2025},
}

Downloads

2506_deLima_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3818

×
Guilherme Korol, Antonio Carlos Schneider Beck, Jeronimo Castrillon, "Leveraging Stochastic Depth Training for Adaptive Inference", May 2025. [Bibtex & Downloads]

Leveraging Stochastic Depth Training for Adaptive Inference

Reference

Guilherme Korol, Antonio Carlos Schneider Beck, Jeronimo Castrillon, "Leveraging Stochastic Depth Training for Adaptive Inference", May 2025.

Bibtex

@misc{korol_sdarxiv25,
author = {Guilherme Korol and Antonio Carlos Schneider Beck and Jeronimo Castrillon},
title = {Leveraging Stochastic Depth Training for Adaptive Inference},
eprint = {2505.17626},
url = {https://arxiv.org/abs/2505.17626},
archiveprefix = {arXiv},
primaryclass = {cs.LG},
projects = {scads.ai, myrtus},
year = {2025},
month = may,
}

Downloads

2505_Korol_SDArXiv [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3838

×
Jeronimo Castrillon, "Compiler Support for Ferroelectric Compute-in-Memory Solutions (and beyond)", In Workshop on Cross-stack Explorations of Ferroelectric-based Logic and Memory Solutions for At-Scale Compute Workloads, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2025. [Bibtex & Downloads]

Compiler Support for Ferroelectric Compute-in-Memory Solutions (and beyond)

Reference

Jeronimo Castrillon, "Compiler Support for Ferroelectric Compute-in-Memory Solutions (and beyond)", In Workshop on Cross-stack Explorations of Ferroelectric-based Logic and Memory Solutions for At-Scale Compute Workloads, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk), Apr 2025.

Abstract
Compute-in-Memory (CIM) is a promising non-von Neumann computing paradigm that promises unprecedented improvements in performance and energy efficiency. Moving past manual designs, automation will be key to unleash the potential of CIM for multiple application domains and to accelerate cross-layer design cycles. This talks reports on an ongoing effort to build a high-level compiler infrastructure for different CIM approaches, built with MLIR to abstract from individual technologies to foster re-use. This includes abstractions and optimizations flows for logic-in memory, content-addressable memories, arithmetic operations in crossbars, and near-memory architectures. We also report on recent results retargeting the compiler for novel ferroelectric cells, exploring different memory modalities.

Bibtex

@Misc{castrillon_date2025,
author = {Castrillon, Jeronimo},
date = {2025-04},
title = {Compiler Support for Ferroelectric Compute-in-Memory Solutions (and beyond)},
howpublished = {Workshop on Cross-stack Explorations of Ferroelectric-based Logic and Memory Solutions for At-Scale Compute Workloads, co-located with the international conference on Design, Automation and Test in Europe Conference (DATE) (invited talk)},
location = {Lyon, France},
abstract = {Compute-in-Memory (CIM) is a promising non-von Neumann computing paradigm that promises unprecedented improvements in performance and energy efficiency. Moving past manual designs, automation will be key to unleash the potential of CIM for multiple application domains and to accelerate cross-layer design cycles. This talks reports on an ongoing effort to build a high-level compiler infrastructure for different CIM approaches, built with MLIR to abstract from individual technologies to foster re-use. This includes abstractions and optimizations flows for logic-in memory, content-addressable memories, arithmetic operations in crossbars, and near-memory architectures. We also report on recent results retargeting the compiler for novel ferroelectric cells, exploring different memory modalities.},
month = apr,
year = {2025},
}

Downloads

250401_DATE-W06-Castrillon [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3823

×
Anderson Faustino da Silva, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers", Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, pp. 13–24, New York, NY, USA, Mar 2025. [doi] [Bibtex & Downloads]

A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers

Reference

Anderson Faustino da Silva, Jeronimo Castrillon, Fernando Magno Quintão Pereira, "A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers", Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, pp. 13–24, New York, NY, USA, Mar 2025. [doi]

Abstract
Classifying programs based on their tasks is essential in fields such as plagiarism detection, malware analysis, and software auditing. Traditionally, two classification approaches exist: static classifiers analyze program syntax, while dynamic classifiers observe their execution. Although dynamic analysis is regarded as more precise, it is often considered impractical due to high overhead, leading the research community to largely dismiss it. In this paper, we revisit this perception by comparing static and dynamic analyses using the same classification representation: opcode histograms. We show that dynamic histograms—generated from instructions actually executed—are only marginally (4-5%) more accurate than static histograms in non-adversarial settings. However, if an adversary is allowed to obfuscate programs, the accuracy of the dynamic classifier is twice higher than the static one, due to its ability to avoid observing dead-code. Obtaining dynamic histograms with a state-of-the-art Valgrind-based tool incurs an 85x slowdown; however, once we account for the time to produce the representations for static analysis of executables, the overall slowdown reduces to 4x: a result significantly lower than previously reported in the literature.

Bibtex

@InProceedings{dasilva_cc25,
author = {Anderson Faustino da Silva and Jeronimo Castrillon and Fernando Magno Quint\~{a}o Pereira},
booktitle = {Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025)},
title = {A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers},
doi = {10.1145/3708493.3712680},
isbn = {9798400714078},
location = {Las Vegas, NV, USA},
pages = {13--24},
publisher = {Association for Computing Machinery},
series = {CC 2025},
url = {https://doi.org/10.1145/3708493.3712680},
abstract = {Classifying programs based on their tasks is essential in fields such as plagiarism detection, malware analysis, and software auditing. Traditionally, two classification approaches exist: static classifiers analyze program syntax, while dynamic classifiers observe their execution. Although dynamic analysis is regarded as more precise, it is often considered impractical due to high overhead, leading the research community to largely dismiss it. In this paper, we revisit this perception by comparing static and dynamic analyses using the same classification representation: opcode histograms. We show that dynamic histograms---generated from instructions actually executed---are only marginally (4-5\%) more accurate than static histograms in non-adversarial settings. However, if an adversary is allowed to obfuscate programs, the accuracy of the dynamic classifier is twice higher than the static one, due to its ability to avoid observing dead-code. Obtaining dynamic histograms with a state-of-the-art Valgrind-based tool incurs an 85x slowdown; however, once we account for the time to produce the representations for static analysis of executables, the overall slowdown reduces to 4x: a result significantly lower than previously reported in the literature.},
address = {New York, NY, USA},
month = mar,
numpages = {11},
year = {2025},
}

Downloads

2503_daSilva_CC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3805

×
Francesca Palumbo, Francesco Ratto, Claudio Rubattu, Maria Katiuscia Zedda, Tiziana Fanni, Veena Rao, Bart Driessen, Jeronimo Castrillon, "Key Enabling Technologies for Cognitive Computing Continuum – MYRTUS Project Perspective" (to appear), Proceedings of the 2025 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2025. [doi] [Bibtex & Downloads]

Key Enabling Technologies for Cognitive Computing Continuum – MYRTUS Project Perspective

Reference

Francesca Palumbo, Francesco Ratto, Claudio Rubattu, Maria Katiuscia Zedda, Tiziana Fanni, Veena Rao, Bart Driessen, Jeronimo Castrillon, "Key Enabling Technologies for Cognitive Computing Continuum – MYRTUS Project Perspective" (to appear), Proceedings of the 2025 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2025. [doi]

Bibtex

@InProceedings{palumbo_date25,
author = {Palumbo, Francesca and Ratto, Francesco and Rubattu, Claudio and Zedda, Maria Katiuscia and Fanni, Tiziana and Rao, Veena and Driessen, Bart and Castrillon, Jeronimo},
booktitle = {Proceedings of the 2025 Design, Automation and Test in Europe Conference (DATE)},
title = {Key Enabling Technologies for Cognitive Computing Continuum -- MYRTUS Project Perspective},
doi = {10.5281/zenodo.14609859},
location = {Lyon, France},
pages = {6pp},
series = {DATE'25},
url = {https://doi.org/10.5281/zenodo.14609859},
month = mar,
year = {2025},
}

Downloads

2503_Palumbo_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3809

×
Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O'Boyle, Jeronimo Castrillon, Hugh Leather, "DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses", Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, pp. 92–103, New York, NY, USA, Mar 2025. [doi] [Bibtex & Downloads]

DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses

Reference

Alexander Brauckmann, Anderson Faustino da Silva, Gabriel Synnaeve, Michael F. P. O'Boyle, Jeronimo Castrillon, Hugh Leather, "DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses", Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025), Association for Computing Machinery, pp. 92–103, New York, NY, USA, Mar 2025. [doi]

Abstract
Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly.

Bibtex

@InProceedings{brauckmann_cc25,
author = {Alexander Brauckmann and Anderson Faustino da Silva and Gabriel Synnaeve and Michael F. P. O'Boyle and Jeronimo Castrillon and Hugh Leather},
booktitle = {Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction (CC 2025)},
title = {DFA-Net: A Compiler-Specific Neural Architecture for Robust Generalization in Data Flow Analyses},
doi = {10.1145/3708493.3712687},
isbn = {9798400714078},
location = {Las Vegas, NV, USA},
pages = {92–103},
publisher = {Association for Computing Machinery},
series = {CC 2025},
url = {https://doi.org/10.1145/3708493.3712687},
abstract = {Data flow analysis is fundamental to modern program optimization and verification, serving as a critical foundation for compiler transformations. As machine learning increasingly drives compiler tasks, the need for models that can implicitly understand and correctly reason about data flow properties becomes crucial for maintaining soundness. State-of-the-art machine learning methods, especially graph neural networks (GNNs), face challenges in generalizing beyond training scenarios due to their limited ability to perform large propagations. We present DFA-Net, a neural network architecture tailored for compilers that systematically generalizes. It emulates the reasoning process of compilers, facilitating the generalization of data flow analyses from simple to complex programs. The architecture decomposes data flow analyses into specialized neural networks for initialization, transfer, and meet operations, explicitly incorporating compiler-specific knowledge into the model design. We evaluate DFA-Net on a data flow analysis benchmark from related work and demonstrate that our compiler-specific neural architecture can learn and systematically generalize on this task. DFA-Net demonstrates superior performance over traditional GNNs in data flow analysis, achieving F1 scores of 0.761 versus 0.009 for data dependencies and 0.989 versus 0.196 for dominators at high complexity levels, while maintaining perfect scores for liveness and reachability analyses where GNNs struggle significantly.},
address = {New York, NY, USA},
month = mar,
numpages = {11},
year = {2025},
}

Downloads

2503_Brauckmann_CC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3804

×
Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi] [Bibtex & Downloads]

CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms

Reference

Asif Ali Khan, Hamid Farzaneh, Karl F. A. Friebel, Clement Fournier, Lorenzo Chelini, Jeronimo Castrillon, "CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4, Association for Computing Machinery, pp. 31–46, Mar 2025. [doi]

Abstract
The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.

Bibtex

@InProceedings{khan_asplos25,
author = {Khan, Asif Ali and Farzaneh, Hamid and Friebel, Karl F. A. and Fournier, Clement and Chelini, Lorenzo and Castrillon, Jeronimo},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'25), Volume 4},
title = {CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms},
doi = {10.1145/3622781.3674189},
isbn = {9798400703911},
location = {Rotterdam, The Netherlands},
pages = {31--46},
publisher = {Association for Computing Machinery},
series = {ASPLOS '25},
url = {https://dl.acm.org/doi/pdf/10.1145/3622781.3674189},
abstract = {The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von-Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogeneous, developing novel compiler abstractions and frameworks becomes necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchical abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for a real CNM system (UPMEM) and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.},
month = mar,
numpages = {16},
year = {2025},
}

Downloads

2504_Khan_CINM_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3766

×
Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025. [Bibtex & Downloads]

Modeling and Simulating Emerging Memory Technologies: A Tutorial

Reference

Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich, "Modeling and Simulating Emerging Memory Technologies: A Tutorial", Feb 2025.

Bibtex

@Article{chen2025_sppsim,
author = {Yun-Chih Chen and Tristan Seidl and Nils Hölscher and Christian Hakert and Minh Duy Truong and Jian-Jia Chen and João Paulo C. de Lima and Asif Ali Khan and Jeronimo Castrillon and Ali Nezhadi and Lokesh Siddhu and Hassan Nassar and Mahta Mayahinia and Mehdi Baradaran Tahoori and Jörg Henkel and Nils Wilbert and Stefan Wildermann and Jürgen Teich},
title = {Modeling and Simulating Emerging Memory Technologies: A Tutorial},
eprint = {2502.10167},
url = {https://arxiv.org/abs/2502.10167},
archiveprefix = {arXiv},
primaryclass = {cs.AR},
year = {2025},
month = feb,
}

Downloads

2502_Chen_SPPSim [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3815

×

2024
Shaokai Lin, Tassilo Tanneberger, Jiahong Bi, Guangyu Feng, Ruomu Xu, Julian Robledo, Robert Khasanov, Jeronimo Castrillon, "Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024), IEEE, Oct 2024. [doi] [Bibtex & Downloads]

Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems

Reference

Shaokai Lin, Tassilo Tanneberger, Jiahong Bi, Guangyu Feng, Ruomu Xu, Julian Robledo, Robert Khasanov, Jeronimo Castrillon, "Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems", In IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024), IEEE, Oct 2024. [doi]

Bibtex

@Article{lin_ieeeesl-tcrs24,
author = {Shaokai Lin and Tassilo Tanneberger and Jiahong Bi and Guangyu Feng and Ruomu Xu and Julian Robledo and Robert Khasanov and Jeronimo Castrillon},
title = {Navigating Time and Energy Trade-offs in Reactive Heterogeneous Systems},
journal = {IEEE Embedded Systems Letters, special issue on Time-Centric Reactive Software (TCRS, ESWeek 2024)},
month = oct,
numpages = {4},
publisher = {IEEE},
year = {2024},
doi = {10.1109/LES.2024.3469278},
url = {https://ieeexplore.ieee.org/document/10702523},
}

Downloads

2410_Lin_TCRS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3791

×
Jiahong Bi, Guilherme Korol, Jeronimo Castrillon, "Leveraging the MLIR infrastructure for the computing continuum", In Proceeding: CPS Workshop 2024, Sep 2024. [Bibtex & Downloads]

Leveraging the MLIR infrastructure for the computing continuum

Reference

Jiahong Bi, Guilherme Korol, Jeronimo Castrillon, "Leveraging the MLIR infrastructure for the computing continuum", In Proceeding: CPS Workshop 2024, Sep 2024.

Abstract
With an ever-increasing number of connected devices (e.g., IoT), cloud computing faces efficiency challenges due to complex infrastructure, high communication costs, and privacy. Fog and edge computing enable computing closer to data sources, offering alternatives to the limitations of relying exclusively on the cloud. When combined with high-performance cloud platforms, fog, and edge devices form a computing continuum. However, the continuum challenges designers who need to compile and deploy on distributed and heterogeneous devices and optimize for a diverse set of non-functional requirements. To ease the usage and ensure the full potential of the continuum, a Design and Programming Environment (DPE) that is interoperable, reusable, portable, and cross-layer is needed. In this context, the Multi-Level Intermediate Representation (MLIR) becomes vital since it provides an extensible and reusable compiler infrastructure. The project development of a continuum-oriented DPE leveraging the MLIR infrastructure is discussed in this paper as a work in progress.

Bibtex

@InProceedings{bi_cps24,
author = {Jiahong Bi and Guilherme Korol and Jeronimo Castrillon},
booktitle = {CPS Workshop 2024},
title = {Leveraging the MLIR infrastructure for the computing continuum},
location = {Alghero, Italy},
abstract = {With an ever-increasing number of connected devices (e.g., IoT), cloud computing faces efficiency challenges due to complex infrastructure, high communication costs, and privacy. Fog and edge computing enable computing closer to data sources, offering alternatives to the limitations of relying exclusively on the cloud. When combined with high-performance cloud platforms, fog, and edge devices form a computing continuum. However, the continuum challenges designers who need to compile and deploy on distributed and heterogeneous devices and optimize for a diverse set of non-functional requirements. To ease the usage and ensure the full potential of the continuum, a Design and Programming Environment (DPE) that is interoperable, reusable, portable, and cross-layer is needed. In this context, the Multi-Level Intermediate Representation (MLIR) becomes vital since it provides an extensible and reusable compiler infrastructure. The project development of a continuum-oriented DPE leveraging the MLIR infrastructure is discussed in this paper as a work in progress.},
month = sep,
numpages = {8},
year = {2024},
}

Downloads

2409_BI_CPSW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3788

×
Julian Robledo, Christian Menard, Erling Jellum, Edward A. Lee, Jeronimo Castrillon, "Timing enclaves for performance in Lingua Franca", In Proceeding: 2024 Forum for Specification and Design Languages (FDL), pp. 1-9, Sep 2024. [doi] [Bibtex & Downloads]

Timing enclaves for performance in Lingua Franca

Reference

Julian Robledo, Christian Menard, Erling Jellum, Edward A. Lee, Jeronimo Castrillon, "Timing enclaves for performance in Lingua Franca", In Proceeding: 2024 Forum for Specification and Design Languages (FDL), pp. 1-9, Sep 2024. [doi]

Bibtex

@InProceedings{robledo_fdl24,
author = {Julian Robledo and Christian Menard and Erling Jellum and Edward A. Lee and Jeronimo Castrillon},
booktitle = {2024 Forum for Specification and Design Languages (FDL)},
title = {Timing enclaves for performance in Lingua Franca},
location = {Stockholm, Sweden},
month = sep,
year = {2024},
doi = {10.1109/FDL63219.2024.10673834},
pages = {1-9},
url = {https://ieeexplore.ieee.org/document/10673834},
}

Downloads

2409_Robledo_FDL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3767

×
Jeronimo Castrillon, "High-level programming abstractions and compilation for near and in-memory computing", In Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk), Aug 2024. (Video Presentation) [Bibtex & Downloads]

High-level programming abstractions and compilation for near and in-memory computing

Reference

Jeronimo Castrillon, "High-level programming abstractions and compilation for near and in-memory computing", In Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk), Aug 2024. (Video Presentation)

Bibtex

@Misc{castrillon_abumpimp2024,
author = {Castrillon, Jeronimo},
date = {2024-08},
title = {High-level programming abstractions and compilation for near and in-memory computing},
howpublished = {Minisymposium on Applications and Benefits of UPMEM commercial Massively Parallel Processing-In-Memory Platform (ABUMPIMP 2024) @ Euro-Par 2024 (invited talk)},
location = {Madrid, Spain},
month = aug,
year = {2024},

}

Downloads

240826_Castrillon_ABUMPIMP [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3784

×
Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Ruomu Xu, Guangyu Feng, Christian Menard, Marten Lohstroh, Jeronimo Castrillon, Sanjit Seshia, Edward Lee, "PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency", Jun 2024. [Bibtex & Downloads]

PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency

Reference

Shaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Ruomu Xu, Guangyu Feng, Christian Menard, Marten Lohstroh, Jeronimo Castrillon, Sanjit Seshia, Edward Lee, "PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency", Jun 2024.

Bibtex

@Misc{lin_pretvm24,
author = {Shaokai Lin and Erling Jellum and Mirco Theile and Tassilo Tanneberger and Binqi Sun and Chadlia Jerad and Ruomu Xu and Guangyu Feng and Christian Menard and Marten Lohstroh and Jeronimo Castrillon and Sanjit Seshia and Edward Lee},
title = {PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency},
eprint = {2406.06253},
url = {https://arxiv.org/abs/2406.06253},
archiveprefix = {arXiv},
month = jun,
primaryclass = {eess.SY},
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3765

×
Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors", Jun 2024. [Bibtex & Downloads]

E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors

Reference

Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors", Jun 2024.

Bibtex

@Misc{khasanov_emapper24,
author = {Till Smejkal and Robert Khasanov and Jeronimo Castrillon and Hermann Härtig},
title = {{E-Mapper}: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors},
eprint = {2406.18980},
url = {https://arxiv.org/abs/2406.18980},
archiveprefix = {arXiv},
primaryclass = {cs.OS},
year = {2024},
month = jun
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3769

×
Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi] [Bibtex & Downloads]

SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs

Reference

Hamid Farzaneh, João Paulo Cardoso De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, Jeronimo Castrillon, "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs", Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24), Association for Computing Machinery, New York, NY, USA, Jun 2024. [doi]

Abstract
Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes and 4.6\texttimes, respectively.

Bibtex

@InProceedings{farzaneh_dac24,
author = {Hamid Farzaneh and Jo{\~a}o Paulo Cardoso De Lima and Ali Nezhadi Khelejani and Asif Ali Khan and Mahta Mayahinia and Mehdi Tahoori and Jeronimo Castrillon},
booktitle = {Proceedings of the 61th ACM/IEEE Design Automation Conference (DAC'24)},
title = {{SHERLOCK}: Scheduling Efficient and Reliable Bulk Bitwise Operations in {NVMs}},
location = {San Francisco, California},
series = {DAC '24},
month = jun,
year = {2024},
isbn = {9798400706011},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3649329.3658485},
doi = {10.1145/3649329.3658485},
abstract = {Bulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing. The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional system architectures, mainly due to data movement between the processing and memory subsystems. Non-volatile memories (NVMs), such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM). Efficient mapping of complex applications to these CIM-capable NVMs is non-trivial and can even lead to slowdowns. This paper presents Sherlock, a novel mapping and scheduling method for efficient execution of bulk bitwise operations in NVMs. Sherlock collaboratively optimizes for performance and energy consumption and outperforms the state-of-the-art by 10\texttimes{} and 4.6\texttimes{}, respectively.},
articleno = {293},
numpages = {6},
}

Downloads

2406_Farzaneh_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3726

×
Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]

C4CAM: A Compiler for CAM-based In-memory Accelerators

Reference

Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon, "C4CAM: A Compiler for CAM-based In-memory Accelerators", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3, Association for Computing Machinery, pp. 164–177, New York, NY, USA, May 2024. [doi]

Abstract
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.

Bibtex

@InProceedings{farzaneh_asplos24,
author = {Hamid Farzaneh and João Paulo Cardoso de Lima and Mengyuan Li and Asif Ali Khan and Xiaobo Sharon Hu and Jeronimo Castrillon},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'24), Volume 3},
title = {C4CAM: A Compiler for CAM-based In-memory Accelerators},
doi = {10.1145/3620666.3651386},
isbn = {9798400703867},
location = {La Jolla, CA, USA},
pages = {164--177},
publisher = {Association for Computing Machinery},
series = {ASPLOS '24},
url = {https://arxiv.org/abs/2309.06418},
abstract = {Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.},
address = {New York, NY, USA},
month = may,
numpages = {14},
year = {2024},
}

Downloads

2405_Farzaneh_ASPLOS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3738

×
Stephanie Soldavini, Felix Suchert, Serena Curzel, Michele Fiorito, Karl Friedrich Alexander Friebel, Fabrizio Ferrandi, Radim Cmar, Jeronimo Castrillon, Christian Pilato, "Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes", Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract), 1pp, May 2024. [Bibtex & Downloads]

Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes

Reference

Stephanie Soldavini, Felix Suchert, Serena Curzel, Michele Fiorito, Karl Friedrich Alexander Friebel, Fabrizio Ferrandi, Radim Cmar, Jeronimo Castrillon, Christian Pilato, "Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes", Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract), 1pp, May 2024.

Bibtex

@InProceedings{suchert_fccm24,
author = {Stephanie Soldavini and Felix Suchert and Serena Curzel and Michele Fiorito and Karl Friedrich Alexander Friebel and Fabrizio Ferrandi and Radim Cmar and Jeronimo Castrillon and Christian Pilato},
booktitle = {Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract)},
title = {Etna: {MLIR}-Based System-Level Design and Optimization for Transparent Application Execution on {CPU}-{FPGA} Nodes},
location = {Orlando, CA USA},
pages = {1pp},
series = {FCCM’24},
month = may,
year = {2024},
}

Downloads

2405_Suchert_FCCM [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3752

×
Francesca Palumbo, Maria Katiuscia Zedda, Tiziana Fanni, Alessandra Bagnato, Luca Castello, Jeronimo Castrillon, Roberto Del Ponte, Yansha Deng, Bart Driessen, Mauro Fadda, Tristan Halna du Fretay, Julio de Oliveira Filho, Veena Rao, Francesco Regazzoni, Alfonso Rodríguez, Melanie Schranz, Giulia Sedda, "MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems", Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24), Association for Computing Machinery (ACM), pp. 101–106, New York, NY, USA, May 2024. [doi] [Bibtex & Downloads]

MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems

Reference

Francesca Palumbo, Maria Katiuscia Zedda, Tiziana Fanni, Alessandra Bagnato, Luca Castello, Jeronimo Castrillon, Roberto Del Ponte, Yansha Deng, Bart Driessen, Mauro Fadda, Tristan Halna du Fretay, Julio de Oliveira Filho, Veena Rao, Francesco Regazzoni, Alfonso Rodríguez, Melanie Schranz, Giulia Sedda, "MYRTUS: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems", Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24), Association for Computing Machinery (ACM), pp. 101–106, New York, NY, USA, May 2024. [doi]

Abstract
The MYRTUS Horizon Europe Project embraces the principles of the EUCloudEdgeIOT Initiative and integrates edge, fog and cloud computing platforms, leveraging a cognitive engine based on swarm intelligence and federated learning to orchestrate collaborative distributed and decentralised components. Components are augmented with interface contracts covering both functional and non-functional properties.

Bibtex

@InProceedings{palumbo_cf24,
author = {Francesca Palumbo and Maria Katiuscia Zedda and Tiziana Fanni and Alessandra Bagnato and Luca Castello and Jeronimo Castrillon and Roberto Del Ponte and Yansha Deng and Bart Driessen and Mauro Fadda and Tristan Halna du Fretay and Julio de Oliveira Filho and Veena Rao and Francesco Regazzoni and Alfonso Rodríguez and Melanie Schranz and Giulia Sedda},
booktitle = {Proceedings of the 21st ACM International Conference on Computing Frontiers (CF'24)},
title = {{MYRTUS}: Multi-layer 360 dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems},
location = {Ischia, Italy},
pages = {101–106},
numpages = {6},
publisher = {Association for Computing Machinery (ACM)},
series = {CF '24 Companion},
abstract = {The MYRTUS Horizon Europe Project embraces the principles of the EUCloudEdgeIOT Initiative and integrates edge, fog and cloud computing platforms, leveraging a cognitive engine based on swarm intelligence and federated learning to orchestrate collaborative distributed and decentralised components. Components are augmented with interface contracts covering both functional and non-functional properties.},
address = {New York, NY, USA},
month = may,
year = {2024},
doi = {10.1145/3637543.3654618},
isbn = {979-8-4007-0492-5/24/05},
}

Downloads

2405_Palumbo_CF [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3730

×
Jeronimo Castrillon, "Automatic optimization for heterogeneous in-memory computing", In Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk), Mar 2024. [Bibtex & Downloads]

Automatic optimization for heterogeneous in-memory computing

Reference

Jeronimo Castrillon, "Automatic optimization for heterogeneous in-memory computing", In Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk), Mar 2024.

Abstract
Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.

Bibtex

@Misc{castrillon_date2024,
author = {Castrillon, Jeronimo},
date = {2024-03},
title = {Automatic optimization for heterogeneous in-memory computing},
howpublished = {Focus Session, Design, Automation and Test in Europe Conference (DATE) (invited talk)},
location = {Valencia, Spain},
abstract = {Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.},
month = mar,
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3735

×
João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024. [Bibtex & Downloads]

Full-Stack Optimization for CAM-Only DNN Inference

Reference

João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon, "Full-Stack Optimization for CAM-Only DNN Inference", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Mar 2024.

Abstract
The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy

Bibtex

@InProceedings{delima_date24,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Luigi Carro and Jeronimo Castrillon},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Full-Stack Optimization for CAM-Only DNN Inference},
location = {Valencia, Spain},
pages = {1-6},
publisher = {IEEE},
series = {DATE'24},
url = {https://ieeexplore.ieee.org/document/10546805},
abstract = {The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von-Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. This is because, even in CIM systems, data movement and processing still require considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing the arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy},
month = mar,
year = {2024},
}

Downloads

2403_deLima_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3701

×
Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024. [Bibtex & Downloads]

Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers

Reference

Michael Niemier, Zephan Enciso, Mohammad Mehdi Sharifi, X. Sharon Hu, Ian O'Connor, Alexander Graening, Ravit Sharma, Puneet Gupta, Jeronimo Castrillon, João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Nashrah Afroze, Asif Islam Khan, Julien Ryckaert, "Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1–10, Mar 2024.

Bibtex

@InProceedings{niemier_date24,
author = {Michael Niemier and Zephan Enciso and Mohammad Mehdi Sharifi and X. Sharon Hu and Ian O'Connor and Alexander Graening and Ravit Sharma and Puneet Gupta and Jeronimo Castrillon and João Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Nashrah Afroze and Asif Islam Khan and Julien Ryckaert},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {Smoothing Disruption Across the Stack: Tales of Memory, Heterogeneity, and Compilers},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546772},
pages = {1--10},
publisher = {IEEE},
series = {DATE'24},
month = mar,
year = {2024},
}

Downloads

2403_Niemier_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3715

×
Christian Pilato, Subhadeep Banik, Jakub Beránek, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Radim Cmar, Serena Curzel, Fabrizio Ferrandi, Karl F. A. Friebel, Antonella Galizia, Matteo Grasso, Paulo Silva, Jan Martinovic, Gianluca Palermo, Michele Paolino, Andrea Parodi, Antonio Parodi, Fabio Pintus, Raphael Polig, David Poulet, Francesco Regazzoni, Burkhard Ringlein, Roberto Rocco, Katerina Slaninova, Tom Slooff, Stephanie Soldavini, Felix Suchert, Mattia Tibaldi, Beat Weiss, Christoph Hagleitner, "A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2024. [Bibtex & Downloads]

A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach

Reference

Christian Pilato, Subhadeep Banik, Jakub Beránek, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Radim Cmar, Serena Curzel, Fabrizio Ferrandi, Karl F. A. Friebel, Antonella Galizia, Matteo Grasso, Paulo Silva, Jan Martinovic, Gianluca Palermo, Michele Paolino, Andrea Parodi, Antonio Parodi, Fabio Pintus, Raphael Polig, David Poulet, Francesco Regazzoni, Burkhard Ringlein, Roberto Rocco, Katerina Slaninova, Tom Slooff, Stephanie Soldavini, Felix Suchert, Mattia Tibaldi, Beat Weiss, Christoph Hagleitner, "A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach", Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE), 6pp, Mar 2024.

Bibtex

@InProceedings{pilato_date24,
author = {Christian Pilato and Subhadeep Banik and Jakub Beránek and Fabien Brocheton and Jeronimo Castrillon and Riccardo Cevasco and Radim Cmar and Serena Curzel and Fabrizio Ferrandi and Karl F. A. Friebel and Antonella Galizia and Matteo Grasso and Paulo Silva and Jan Martinovic and Gianluca Palermo and Michele Paolino and Andrea Parodi and Antonio Parodi and Fabio Pintus and Raphael Polig and David Poulet and Francesco Regazzoni and Burkhard Ringlein and Roberto Rocco and Katerina Slaninova and Tom Slooff and Stephanie Soldavini and Felix Suchert and Mattia Tibaldi and Beat Weiss and Christoph Hagleitner},
booktitle = {Proceedings of the 2024 Design, Automation and Test in Europe Conference (DATE)},
title = {A System Development Kit for Big Data Applications on {FPGA}-based Clusters: The {EVEREST} Approach},
location = {Valencia, Spain},
url = {https://ieeexplore.ieee.org/document/10546518},
pages = {6pp},
series = {DATE'24},
month = mar,
year = {2024},
}

Downloads

2403_Pilato_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3706

×
Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024. [Bibtex & Downloads]

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

Reference

Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon, "The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview", Jan 2024.

Bibtex

@Report{khan_cimlandscape_2024,
author = {Asif Ali Khan and João Paulo C. De Lima and Hamid Farzaneh and Jeronimo Castrillon},
title = {The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview},
eprint = {2401.14428},
url = {https://arxiv.org/abs/2401.14428},
archiveprefix = {arXiv},
month = jan,
primaryclass = {cs.AR},
year = {2024},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3716

×
Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi] [Bibtex & Downloads]

Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory

Reference

Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon, "Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, 4pp, Jan 2024. [doi]

Bibtex

@Article{khan_ieeecal24,
author = {Asif Ali Khan and Fazal Hameed and Taha Shahroodi and Alex K. Jones and Jeronimo Castrillon},
title = {Efficient Memory Layout for Pre-alignment Filtering of Long DNA Reads Using Racetrack Memory},
pages = {4pp},
journal = {IEEE Computer Architecture Letters},
month = jan,
publisher = {IEEE},
year = {2024},
doi = {10.1109/LCA.2024.3350701},
url = {https://ieeexplore.ieee.org/document/10409506},
}

Downloads

2401_Khan_IEEECAL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3714

×
Robert Khasanov, Marc Dietrich, Jeronimo Castrillon, "Flexible Spatio-Temporal Energy-Efficient Runtime Management", In Proceeding: 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24), pp. 777–784, Jan 2024. [doi] [Bibtex & Downloads]

Flexible Spatio-Temporal Energy-Efficient Runtime Management

Reference

Robert Khasanov, Marc Dietrich, Jeronimo Castrillon, "Flexible Spatio-Temporal Energy-Efficient Runtime Management", In Proceeding: 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24), pp. 777–784, Jan 2024. [doi]

Bibtex

@InProceedings{khasanov_aspdac24,
author = {Robert Khasanov and Marc Dietrich and Jeronimo Castrillon},
booktitle = {29th Asia and South Pacific Design Automation Conference (ASP-DAC’24)},
title = {Flexible Spatio-Temporal Energy-Efficient Runtime Management},
location = {Incheon, South Korea},
organization = {IEEE},
pages = {777--784},
month = jan,
year = {2024},
url = {https://ieeexplore.ieee.org/document/10473885},
doi = {10.1109/ASP-DAC58780.2024.10473885},
}

Downloads

2401_Khasanov_ASPDAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3685

×
Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Hyperdimensional Computing Quantization with Thermometer Codes", In Proceeding: 6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 7pp, Jan 2024. [Bibtex & Downloads]

Hyperdimensional Computing Quantization with Thermometer Codes

Reference

Caio Vieira, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Hyperdimensional Computing Quantization with Thermometer Codes", In Proceeding: 6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), 7pp, Jan 2024.

Bibtex

@InProceedings{vieira_accml24,
author = {Caio Vieira and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {6th Workshop on Accelerated Machine Learning (AccML), co-located with 19th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Hyperdimensional Computing Quantization with Thermometer Codes},
location = {Munich, Germany},
pages = {7pp},
month = jan,
year = {2024},
url = {https://accml.dcs.gla.ac.uk/papers/2024/6th_AccML_paper_22.pdf},
}

Downloads

2401_Vieira_AccML [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3707

×

2023
Mirko Günther, Lars Schütze, Kilian Becher, Thorsten Strufe, Jeronimo Castrillon, "HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption", Dec 2023. [Bibtex & Downloads]

HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption

Reference

Mirko Günther, Lars Schütze, Kilian Becher, Thorsten Strufe, Jeronimo Castrillon, "HElium: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption", Dec 2023.

Bibtex

@online{schuetze_arxiv23,
author = {Mirko G{\"u}nther and Lars Sch{\"u}tze and Kilian Becher and Thorsten Strufe and Jeronimo Castrillon},
title = {{HElium}: A Language and Compiler for Fully Homomorphic Encryption with Support for Proxy Re-Encryption},
eprint = {2312.14250},
url = {http://arxiv.org/abs/2312.14250},
journal = {arXiv preprint arXiv:2312.14250},
month = dec,
year = {2023},
}

Downloads

2312_Schuetze_arXiv23 [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3712

×
Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk), Nov 2023. [Bibtex & Downloads]

Domain-specific programming methodologies for domain-specific and emerging computing systems

Reference

Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk), Nov 2023.

Abstract
Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.

Bibtex

@Misc{castrillon_espm2023,
author = {Castrillon, Jeronimo},
date = {2023-11},
title = {Domain-specific programming methodologies for domain-specific and emerging computing systems},
howpublished = {8th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), co-located with the International Conference on High Performance Computing, Networking, Storage and Analysis (SC23) (invited talk)},
location = {Denver, CA, USA},
abstract = {Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.},
month = nov,
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3700

×
Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "Management of Energy-Aware Processes in Heterogeneous Architectures", In Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters), Oct 2023. [Bibtex & Downloads]

Management of Energy-Aware Processes in Heterogeneous Architectures

Reference

Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "Management of Energy-Aware Processes in Heterogeneous Architectures", In Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters), Oct 2023.

Bibtex

@Misc{smejkal_poster-sosp23,
author = {Till Smejkal and Robert Khasanov and Jeronimo Castrillon and Hermann H{\"a}rtig},
title = {Management of Energy-Aware Processes in Heterogeneous Architectures},
howpublished = {Poster at the 29th ACM Symposium on Operating Systems Principles (SOSP’23 Posters)},
location = {Koblenz, Germany},
month = oct,
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3669

×
Marcus Rossel, Shaokai Lin, Marten Lohstroh, Jeronimo Castrillon, Andrés Goens, "Provable Determinism for Software in Cyber-Physical Systems", Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments, 12pp, Oct 2023. [Bibtex & Downloads]

Provable Determinism for Software in Cyber-Physical Systems

Reference

Marcus Rossel, Shaokai Lin, Marten Lohstroh, Jeronimo Castrillon, Andrés Goens, "Provable Determinism for Software in Cyber-Physical Systems", Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments, 12pp, Oct 2023.

Bibtex

@InProceedings{rossel_vstte23,
author = {Marcus Rossel and Shaokai Lin and Marten Lohstroh and Jeronimo Castrillon and Andr{\'e}s Goens},
booktitle = {Proceedings of the 15th International Conference on Verified Software: Theories, Tools, and Experiments},
title = {Provable Determinism for Software in Cyber-Physical Systems},
doi = {10.1007/978-3-031-66064-1_6},
isbn = {978-3-031-66064-1},
organization = {Springer},
pages = {85--107},
publisher = {Springer Nature Switzerland},
url = {https://link.springer.com/chapter/10.1007/978-3-031-66064-1_6},
abstract = {In Cyber-Physical Systems (CPS), concurrently executing software components interact with each other and the physical environment to deliver functionality that is often safety-critical and time-sensitive. Verifying the correctness of the joint behavior of concurrent software components, however, is challenging. It is helpful to eliminate nondeterminism in the software, at the level of the programming model, and provide first-class programming constructs for expressing timed behavior. The Lingua Franca (LF) coordination language achieves this through the use of the Reactor model as its underlying model of computation. In this paper, we present the first formal operational semantics for the Reactor model, and prove its key properties of progress and determinism. The Reactor model and its associated proofs are fully mechanized in the Lean theorem prover. As an operational model, our semantics are close to the intuition for implementation and a helpful reference. The computational objects of the Reactor model are formalized in a modular fashion, which provides insights into the different structural properties of the model, and their effect on execution behavior.},
address = {Cham},
month = oct,
year = {2023},
}

Downloads

2310_Rossel_VSSTE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3668

×
Marten Lohstroh, Soroush Bateni, Christian Menard, Alexander Schulz-Rosengarten, Jeronimo Castrillon, Edward A. Lee, "Deterministic Coordination Across Multiple Timelines", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 23, no. 5, New York, NY, USA, Oct 2023. [doi] [Bibtex & Downloads]

Deterministic Coordination Across Multiple Timelines

Reference

Marten Lohstroh, Soroush Bateni, Christian Menard, Alexander Schulz-Rosengarten, Jeronimo Castrillon, Edward A. Lee, "Deterministic Coordination Across Multiple Timelines", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 23, no. 5, New York, NY, USA, Oct 2023. [doi]

Abstract
We discuss a novel approach for constructing deterministic reactive systems that revolves around a temporal model that incorporates a multiplicity of timelines. This model is central to Lingua Franca (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called reactors, which are objects that react to and emit discrete events. Our temporal model differs from existing models like the logical execution time (LET) paradigm and synchronous languages in that it reflects that there are always at least two distinct timelines involved in a reactive system; a logical one and a physical one—and possibly multiple of each kind. This paper explains how the relationship between events across timelines facilitates reasoning about consistency and availability across components in Cyber-Physical Systems (CPS).

Bibtex

@Article{lohstroh_tecs23,
author = {Marten Lohstroh and Soroush Bateni and Christian Menard and Alexander Schulz-Rosengarten and Jeronimo Castrillon and Edward A. Lee},
title = {Deterministic Coordination Across Multiple Timelines},
doi = {10.1145/3615357},
issn = {1539-9087},
number = {5},
url = {https://doi.org/10.1145/3615357},
volume = {23},
abstract = {We discuss a novel approach for constructing deterministic reactive systems that revolves around a temporal model that incorporates a multiplicity of timelines. This model is central to Lingua Franca (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called reactors, which are objects that react to and emit discrete events. Our temporal model differs from existing models like the logical execution time (LET) paradigm and synchronous languages in that it reflects that there are always at least two distinct timelines involved in a reactive system; a logical one and a physical one—and possibly multiple of each kind. This paper explains how the relationship between events across timelines facilitates reasoning about consistency and availability across components in Cyber-Physical Systems (CPS).},
address = {New York, NY, USA},
articleno = {77},
issue_date = {September 2024},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
month = oct,
numpages = {29},
publisher = {Association for Computing Machinery},
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3596

×
Jeronimo Castrillon, "Next-generation compilers for emerging systems", In Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote), Sep 2023. [Bibtex & Downloads]

Next-generation compilers for emerging systems

Reference

Jeronimo Castrillon, "Next-generation compilers for emerging systems", In Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote), Sep 2023.

Bibtex

@Misc{castrillon_codai2023,
author = {Castrillon, Jeronimo},
date = {2023-09},
title = {Next-generation compilers for emerging systems},
howpublished = {Workshop on Compilers, Deployment, and Tooling for Edge AI (CODAI’23), co-located with the ESWeek (keynote)},
location = {Hamburg, Germany},
month = sep,
url = {http://codai-workshop.com},
year = {2023},
}

Downloads

230921_CODAI-castrillon-compressed [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3664

×
Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi] [Bibtex & Downloads]

Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications

Reference

Jörg Henkel, Lokesh Siddhu, Lars Bauer, Jürgen Teich, Stefan Wildermann, Mehdi Tahoori, Mahta Mayahinia, Jeronimo Castrillon, Asif Ali Khan, Hamid Farzaneh, João Paulo C. de Lima, Jian-Jia Chen, Christian Hakert, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, "Special Session – Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications", Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), pp. 11–20, Sep 2023. [doi]

Abstract
This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.

Bibtex

@InProceedings{henkel_cases23,
author = {J\"{o}rg Henkel and Lokesh Siddhu and Lars Bauer and J\"{u}rgen Teich and Stefan Wildermann and Mehdi Tahoori and Mahta Mayahinia and Jeronimo Castrillon and Asif Ali Khan and Hamid Farzaneh and Jo\~{a}o Paulo C. de Lima and Jian-Jia Chen and Christian Hakert and Kuan-Hsun Chen and Chia-Lin Yang and Hsiang-Yun Cheng},
booktitle = {Proceedings of the 2023 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
title = {Special Session -- Non-Volatile Memories: Challenges and Opportunities for Embedded System Architectures with Focus on Machine Learning Applications},
location = {Hamburg, Germany},
abstract = {This paper explores the challenges and opportunities of integrating non-volatile memories (NVMs) into embedded systems for machine learning. NVMs offer advantages such as increased memory density, lower power consumption, non-volatility, and compute-in- memory capabilities. The paper focuses on integrating NVMs into embedded systems, particularly in intermittent computing, where systems operate during periods of available energy. NVM technologies bring persistence closer to the CPU core, enabling efficient designs for energy-constrained scenarios. Next, computation in resistive NVMs is explored, highlighting its potential for accelerating machine learning algorithms. However, challenges related to reliability and device non-idealities need to be addressed. The paper also discusses memory-centric machine learning, leveraging NVMs to overcome the memory wall challenge. By optimizing memory layouts and utilizing probabilistic decision tree execution and neural network sparsity, NVM-based systems can improve cache behavior and reduce unnecessary computations. In conclusion, the paper emphasizes the need for further research and optimization for the widespread adoption of NVMs in embedded systems presenting relevant challenges, especially for machine learning applications.},
pages = {11--20},
url = {https://ieeexplore.ieee.org/abstract/document/10316216},
doi = {10.1145/3607889.3609088},
isbn = {9798400702907},
series = {CASES '23 Companion},
issn = {2643-1726},
month = sep,
numpages = {10},
year = {2023},
}

Downloads

2309_Henkel_CASES [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3654

×
Christian Menard, Marten Lohstroh, Soroush Bateni, Mathhew Chorlian, Arthur Deng, Peter Donovan, Clément Fournier, Shaokai Lin, Felix Suchert, Tassilo Tanneberger, Hokeun Kim, Jeronimo Castrillon, Edward A. Lee, "High-Performance Deterministic Concurrency using Lingua Franca", In ACM Transactions on Architecture and Code Optimization (TACO), Association for Computing Machinery, New York, NY, USA, Aug 2023. [doi] [Bibtex & Downloads]

High-Performance Deterministic Concurrency using Lingua Franca

Reference

Christian Menard, Marten Lohstroh, Soroush Bateni, Mathhew Chorlian, Arthur Deng, Peter Donovan, Clément Fournier, Shaokai Lin, Felix Suchert, Tassilo Tanneberger, Hokeun Kim, Jeronimo Castrillon, Edward A. Lee, "High-Performance Deterministic Concurrency using Lingua Franca", In ACM Transactions on Architecture and Code Optimization (TACO), Association for Computing Machinery, New York, NY, USA, Aug 2023. [doi]

Abstract
Actor frameworks and similar reactive programming techniques are widely used for building concurrent systems. They promise to be efficient and scale well to a large number of cores or nodes in a distributed system. However, they also expose programmers to nondeterminism, which often makes implementations hard to understand, debug, and test. The recently proposed reactor model is a promising alternative that enables deterministic concurrency. In this paper, we present an efficient, parallel implementation of reactors and demonstrate that the determinacy of reactors does not imply a loss in performance. To show this, we evaluate Lingua Franca (LF), a reactor-oriented coordination language. LF equips mainstream programming languages with a deterministic concurrency model that automatically takes advantage of opportunities to exploit parallelism. Our implementation of the Savina benchmark suite demonstrates that, in terms of execution time, the runtime performance of LF programs even exceeds popular and highly optimized actor frameworks. We compare against Akka and CAF, which LF outperforms by 1.86x and 1.42x, respectively.

Bibtex

@Article{menard_taco23,
author = {Menard, Christian and Lohstroh, Marten and Bateni, Soroush and Chorlian, Mathhew and Deng, Arthur and Donovan, Peter and Fournier, Clément and Lin, Shaokai and Suchert, Felix and Tanneberger, Tassilo and Kim, Hokeun and Castrillon, Jeronimo and Lee, Edward A.},
title = {High-Performance Deterministic Concurrency using Lingua Franca},
doi = {10.1145/3617687},
issn = {1544-3566},
number = {4},
pages = {1--29},
url = {https://doi.org/10.1145/3617687},
volume = {20},
abstract = {Actor frameworks and similar reactive programming techniques are widely used for building concurrent systems. They promise to be efficient and scale well to a large number of cores or nodes in a distributed system. However, they also expose programmers to nondeterminism, which often makes implementations hard to understand, debug, and test. The recently proposed reactor model is a promising alternative that enables deterministic concurrency. In this paper, we present an efficient, parallel implementation of reactors and demonstrate that the determinacy of reactors does not imply a loss in performance. To show this, we evaluate Lingua Franca (LF), a reactor-oriented coordination language. LF equips mainstream programming languages with a deterministic concurrency model that automatically takes advantage of opportunities to exploit parallelism. Our implementation of the Savina benchmark suite demonstrates that, in terms of execution time, the runtime performance of LF programs even exceeds popular and highly optimized actor frameworks. We compare against Akka and CAF, which LF outperforms by 1.86x and 1.42x, respectively.},
address = {New York, NY, USA},
articleno = {48},
copyright = {Creative Commons Attribution 4.0 International},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
month = aug,
numpages = {29},
publisher = {Association for Computing Machinery},
year = {2023},
}

Downloads

2309_Menard_TACO [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3455

×
Lars Schütze, Jeronimo Castrillon, "Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages", Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23), Association for Computing Machinery, pp. 1–8, New York, NY, USA, Jul 2023. [doi] [Bibtex & Downloads]

Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages

Reference

Lars Schütze, Jeronimo Castrillon, "Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages", Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23), Association for Computing Machinery, pp. 1–8, New York, NY, USA, Jul 2023. [doi]

Abstract
Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. Most approaches focus on optimizing language implementations neglecting the fact that the generated code is a verbose description of contextual roles in an object-oriented paradigm, which incurs an overhead. This paper takes a novel approach to reduce the semantic gap. We propose ObjectTeams/Truffle, to the best of our knowledge, the first virtual machine that optimizes the dispatch of contextual roles. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a speedup of up to 2.49\texttimes over the reference implementation ObjectTeams/Java and 1.2\texttimes over an optimized version ObjectTeams/Java using Dispatch Plans.

Bibtex

@InProceedings{schuetze_cop23,
author = {Sch\"{u}tze, Lars and Castrillon, Jeronimo},
booktitle = {Proceedings of the 15th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'23)},
title = {Towards Virtual Machine Support for Contextual Role-Oriented Programming Languages},
doi = {10.1145/3605154.3605851},
isbn = {9798400702440},
location = {Seattle, USA},
pages = {1–8},
publisher = {Association for Computing Machinery},
series = {COP '23},
url = {https://doi.org/10.1145/3605154.3605851},
abstract = {Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. Most approaches focus on optimizing language implementations neglecting the fact that the generated code is a verbose description of contextual roles in an object-oriented paradigm, which incurs an overhead. This paper takes a novel approach to reduce the semantic gap. We propose ObjectTeams/Truffle, to the best of our knowledge, the first virtual machine that optimizes the dispatch of contextual roles. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a speedup of up to 2.49\texttimes{} over the reference implementation ObjectTeams/Java and 1.2\texttimes{} over an optimized version ObjectTeams/Java using Dispatch Plans.},
address = {New York, NY, USA},
month = jul,
numpages = {8},
year = {2023},
}

Downloads

2307_Schuetze_COP [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3567

×
João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023. [Bibtex & Downloads]

Efficient Associative Processing with RTM-TCAMs

Reference

João Paulo C. de Lima, Asif Ali Khan, Hamid Farzaneh, Jeronimo Castrillon, "Efficient Associative Processing with RTM-TCAMs", In Proceeding: 1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23), 2pp, Jul 2023.

Bibtex

@InProceedings{lima_imacaw23,
author = {Jo{\~a}o Paulo C. de Lima and Asif Ali Khan and Hamid Farzaneh and Jeronimo Castrillon},
booktitle = {1st in-Memory Architectures and Computing Applications Workshop (iMACAW), co-located with the 60th Design Automation Conference (DAC'23)},
title = {Efficient Associative Processing with RTM-TCAMs},
location = {San Francisco, CA, USA},
pages = {2pp},
month = jul,
year = {2023},
}

Downloads

2307_deLima_iMACAW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3566

×
Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs", In Proceeding: 37th European Conference on Object-Oriented Programming (ECOOP 2023) (Ali, Karim and Salvaneschi, Guido), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 263, pp. 33:1–33:39, Dagstuhl, Germany, Jul 2023. (ECOOP 2023 Distinguished Artifact Award) [doi] [Bibtex & Downloads]

ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs

Reference

Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs", In Proceeding: 37th European Conference on Object-Oriented Programming (ECOOP 2023) (Ali, Karim and Salvaneschi, Guido), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 263, pp. 33:1–33:39, Dagstuhl, Germany, Jul 2023. (ECOOP 2023 Distinguished Artifact Award) [doi]

Abstract
SAT/SMT-solvers and model checkers automate formal verification of sequential programs. Formal reasoning about scalable concurrent programs is still manual and requires expert knowledge. But scalability is a fundamental requirement of current and future programs.

Sequential imperative programs compose statements, function/method calls and control flow constructs. Concurrent programming models provide constructs for concurrent composition. Concurrency abstractions such as threads and synchronization primitives such as locks compose the individual parts of a concurrent program that are meant to execute in parallel. We propose to rather compose the individual parts again using sequential composition and compile this sequential composition into a concurrent one. The developer can use existing tools to formally verify the sequential program while the translated concurrent program provides the dearly requested scalability.

Following this insight, we present ConDRust, a new programming model and compiler for Rust programs. The ConDRust compiler translates sequential composition into a concurrent composition based on threads and message-passing channels. During compilation, the compiler preserves the semantics of the sequential program along with much desired properties such as determinism.

Our evaluation shows that our ConDRust compiler generates concurrent deterministic code that can outperform even non-deterministic programs by up to a factor of three for irregular algorithms that are particularly hard to parallelize.

Bibtex

@InProceedings{suchert_ecoop23,
author = {Felix Suchert and Lisza Zeidler and Jeronimo Castrillon and Sebastian Ertel},
booktitle = {37th European Conference on Object-Oriented Programming (ECOOP 2023)},
title = {{ConDRust}: Scalable Deterministic Concurrency from Verifiable Rust Programs},

doi = {10.4230/LIPIcs.ECOOP.2023.33},
editor = {Ali, Karim and Salvaneschi, Guido},
isbn = {978-3-95977-281-5},
location = {Seattle, Washington, USA},
pages = {33:1--33:39},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
url = {https://drops.dagstuhl.de/opus/volltexte/2023/18226},
volume = {263},
abstract = {SAT/SMT-solvers and model checkers automate formal verification of sequential programs. Formal reasoning about scalable concurrent programs is still manual and requires expert knowledge. But scalability is a fundamental requirement of current and future programs.

Sequential imperative programs compose statements, function/method calls and control flow constructs. Concurrent programming models provide constructs for concurrent composition. Concurrency abstractions such as threads and synchronization primitives such as locks compose the individual parts of a concurrent program that are meant to execute in parallel. We propose to rather compose the individual parts again using sequential composition and compile this sequential composition into a concurrent one. The developer can use existing tools to formally verify the sequential program while the translated concurrent program provides the dearly requested scalability.

Following this insight, we present ConDRust, a new programming model and compiler for Rust programs. The ConDRust compiler translates sequential composition into a concurrent composition based on threads and message-passing channels. During compilation, the compiler preserves the semantics of the sequential program along with much desired properties such as determinism.

Our evaluation shows that our ConDRust compiler generates concurrent deterministic code that can outperform even non-deterministic programs by up to a factor of three for irregular algorithms that are particularly hard to parallelize.},
address = {Dagstuhl, Germany},
issn = {1868-8969},
month = jul,
urn = {urn:nbn:de:0030-drops-182263},
year = {2023},
}

Downloads

2307_Suchert_ECOOP [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3552

×
Karl F. A. Friebel, Jiahong Bi, Jeronimo Castrillon, "BASE2: An IR for Binary Numeral Types", In Proceeding: 13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023), Association for Computing Machinery, pp. 19–26, New York, NY, USA, Jun 2023. [doi] [Bibtex & Downloads]

BASE2: An IR for Binary Numeral Types

Reference

Karl F. A. Friebel, Jiahong Bi, Jeronimo Castrillon, "BASE2: An IR for Binary Numeral Types", In Proceeding: 13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023), Association for Computing Machinery, pp. 19–26, New York, NY, USA, Jun 2023. [doi]

Abstract
Custom data types and arbitrary-precision arithmetic are often key for efficient hardware designs on Field Programmable Gate Array (FPGA) platforms. Current end-to-end flows incorporating quantization are not only domain-specific, but also tightly integrated and not repurposable. Abstractions for arbitrary-precision arithmetic are generally vendor-specific, and results are hardly portable across platforms. In this work, we present a new Intermediate Representation (IR), base2, to address the programmability issues of custom data types in reconfigurable hardware. We contextualize our proposal in the greater LLVM (llvm) ecosystem, where we show how existing abstractions can be simplified and unified. We implement base2 in Multi-Level Intermediate Representation (MLIR), which allows it to be used in a variety of existing and future target-agnostic front-ends. We demonstrate the power of our model by applying it to sample kernels and evaluating the accuracy of the result. For these samples, we achieve interoperability with an existing end-to-end High-Level Synthesis (HLS) flow.

Bibtex

@InProceedings{friebel_heart23,
author = {Karl F. A. Friebel and Jiahong Bi and Jeronimo Castrillon},
booktitle = {13th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2023)},
title = {{BASE2}: An {IR} for Binary Numeral Types},
doi = {10.1145/3597031.3597048},
isbn = {9798400700439},
location = {Kusatsu, Japan},
pages = {19--26},
publisher = {Association for Computing Machinery},
series = {HEART2023},
url = {https://doi.org/10.1145/3597031.3597048},
abstract = {Custom data types and arbitrary-precision arithmetic are often key for efficient hardware designs on Field Programmable Gate Array (FPGA) platforms. Current end-to-end flows incorporating quantization are not only domain-specific, but also tightly integrated and not repurposable. Abstractions for arbitrary-precision arithmetic are generally vendor-specific, and results are hardly portable across platforms. In this work, we present a new Intermediate Representation (IR), base2, to address the programmability issues of custom data types in reconfigurable hardware. We contextualize our proposal in the greater LLVM (llvm) ecosystem, where we show how existing abstractions can be simplified and unified. We implement base2 in Multi-Level Intermediate Representation (MLIR), which allows it to be used in a variety of existing and future target-agnostic front-ends. We demonstrate the power of our model by applying it to sample kernels and evaluating the accuracy of the result. For these samples, we achieve interoperability with an existing end-to-end High-Level Synthesis (HLS) flow.},
address = {New York, NY, USA},
month = jun,
numpages = {8},
year = {2023},
}

Downloads

2306_Friebel_HEART [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3554

×
Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Design Space Exploration for CNN Offloading to FPGAs at the Edge", In Proceeding: 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, Los Alamitos, CA, USA, Jun 2023. [doi] [Bibtex & Downloads]

Design Space Exploration for CNN Offloading to FPGAs at the Edge

Reference

Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Design Space Exploration for CNN Offloading to FPGAs at the Edge", In Proceeding: 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, Los Alamitos, CA, USA, Jun 2023. [doi]

Abstract
AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2x) and process inferences at higher user quality of experience (by up to 12.5%).

Bibtex

@InProceedings{korol_isvlsi23,
author = {Guilherme Korol and Michael Guilherme Jordan and Mateus Beck Rutzig and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)},
title = {Design Space Exploration for {CNN} Offloading to {FPGAs} at the Edge},
doi = {10.1109/ISVLSI59464.2023.10238644},
location = {Foz do Igua{\c{c}}u, Brazil},
organization = {IEEE},
publisher = {IEEE Computer Society},
url = {https://ieeexplore.ieee.org/abstract/document/10238644},
address = {Los Alamitos, CA, USA},
month = jun,
year = {2023},
location = {Foz do Igua{\c{c}}u, Brazil},
abstract = {AI-based IoT applications relying on heavy-load deep learning algorithms like CNNs challenge IoT devices that are restricted in energy or processing capabilities. Edge computing offers an alternative by allowing the data to get offloaded to so-called edge servers with hardware more powerful than IoT devices and physically closer than the cloud. However, the increasing complexity of data and algorithms and diverse conditions make even powerful devices, such as those equipped with FPGAs, insufficient to cope with the current demands. In this case, optimizations in the algorithms, like pruning and early-exit, are mandatory to reduce the CNNs computational burden and speed up inference processing. With that in mind, we propose ExpOL, which combines the pruning and early-exit CNN optimizations in a system-level FPGA-based IoT-Edge design space exploration. Based on a user-defined multi-target optimization, ExpOL delivers designs tailored to specific application environments and user needs. When evaluated against state-of-the-art FPGA-based accelerators (either local or offloaded), designs produced by ExpOL are more power-efficient (by up to 2x) and process inferences at higher user quality of experience (by up to 12.5%).},
}

Downloads

2306_Korol_ISVLSI [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3553

×
Jeronimo Castrillon, "Intermediate abstractions and optimizing compilers for adaptable HPC", In 4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk), May 2023. [Bibtex & Downloads]

Intermediate abstractions and optimizing compilers for adaptable HPC

Reference

Jeronimo Castrillon, "Intermediate abstractions and optimizing compilers for adaptable HPC", In 4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk), May 2023.

Bibtex

@Misc{castrillon_isc_llcvm-cth2023,
author = {Castrillon, Jeronimo},
date = {2023-05},
title = {Intermediate abstractions and optimizing compilers for adaptable HPC},
howpublished = {4th Workshop on LLVM Compiler and Tools for HPC (LLVM-CTH’23), co-located with the 38th ISC High Performance Conference (invited talk)},
location = {Hamburg, Germany},
url = {https://hps.vi4io.org/events/2023/llvm},
month = may,
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3556

×
Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs", Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Apr 2023. [doi] [Bibtex & Downloads]

Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs

Reference

Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, Jeronimo Castrillon, Antonio Carlos Schneider Beck, "Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs", Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1-6, Apr 2023. [doi]

Abstract
The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design-time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At runtime, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator.

Bibtex

@InProceedings{korol_date23,
author = {Guilherme Korol and Michael Guilherme Jordan and Mateus Beck Rutzig and Jeronimo Castrillon and Antonio Carlos Schneider Beck},
booktitle = {Proceedings of the 2023 Design, Automation and Test in Europe Conference (DATE)},
title = {Pruning and Early-Exit Co-Optimization for CNN Acceleration on FPGAs},
location = {Antwerp, Belgium},
pages = {6pp},
publisher = {IEEE},
series = {DATE'23},
month = apr,
year = {2023},
doi = {10.23919/DATE56975.2023.10137244},
pages = {1-6},
url = {https://ieeexplore.ieee.org/document/10137244},
abstract = {The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design-time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At runtime, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator.},
}

Downloads

2304_Korol_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3438

×
Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi] [Bibtex & Downloads]

Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs

Reference

Carlos Escuin, Asif Ali Khan, Pablo Ibáñez-Marín, Teresa Monreal, Jeronimo Castrillon, Víctor Viñals-Yúfera, "Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs", In Proceeding: the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23), IEEE Computer Society, pp. 179–192, Los Alamitos, CA, USA, Mar 2023. [doi]

Abstract
Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.

Bibtex

@InProceedings{escuin_hpca23,
author = {Carlos Escuin and Asif Ali Khan and Pablo Ibáñez-Marín and Teresa Monreal and Jeronimo Castrillon and Víctor Viñals-Yúfera},
booktitle = {the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA'23)},
title = {Compression-Aware and Performance-Efficient Insertion Policies for Long-Lasting Hybrid LLCs},
organization = {IEEE},
pages = {179--192},
abstract = {Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17x compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9\% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1\% and 1.9\% performance, the NVM lifetime can be further increased by 28\% and 44\%, respectively.},
doi = {10.1109/HPCA56546.2023.10070968},
url = {https://doi.ieeecomputersociety.org/10.1109/HPCA56546.2023.10070968},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = mar,
year = {2023},
}

Downloads

2302_Escuin_HPCA [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3427

×
Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi] [Bibtex & Downloads]

DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories

Reference

Asif Ali Khan, Sebastien Ollivier, Fazal Hameed, Jeronimo Castrillon, Alex K. Jones, "DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories", In IEEE Transactions on Computers, IEEE, Mar 2023. [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6%, and 70.8% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2%.

Bibtex

@Article{khan_toc23,
author = {Asif Ali Khan and Sebastien Ollivier and Fazal Hameed and Jeronimo Castrillon and Alex K. Jones},
date = {2023-03},
journal = {IEEE Transactions on Computers},
doi = {10.1109/TC.2023.3257509},
title = {DownShift: Tuning Shift Reduction with Reliability for Racetrack Memories},
abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs, required for data access, incur performance penalties and can induce position errors. These factors can hinder their applicability in replacing low-latency, reliable on-chip memories. Intelligent placement of memory objects in RTMs can significantly reduce the number of shifts per memory access with little to no hardware overhead. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. Additionally, the impact of these shift optimization techniques on RTM reliability has been insufficiently investigated. We propose DownShift, a generalized data placement mechanism that improves upon prior approaches by taking into account (1) the timing and liveliness information of memory objects and (2) the underlying memory architecture, including required shifting fault tolerance. Thus, we also propose a collaboratively designed new shift alignment reliability technique called GROGU. GROGU leverages the reduced shift window made possible through DownShift allowing improved reliability, area, and energy compared to the state-of-the-art reliability approaches. DownShift reduces the number of shifts, runtime, and energy consumption by 3.24x, 47.6\%, and 70.8\% compared to the state-of-the-art. GROGU consumes 2.2x less area and 1.3x less energy while providing 16.8x improvement in shift fault tolerance compared to the leading reliability approach for a latency degradation of only 3.2\%.},
month = mar,
numpages = {15},
publisher = {IEEE},
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3524

×
Carlos Escuín Blasco, Asif Ali Khan, Pablo Enrique Ibáñez Marín, Teresa Monreal Arnal, Denis Navarro, José M Llaberia Griñó, Jeronimo Castrillon, Victor Viñals Yúfera, "Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache", In Proceeding: 14th Annual Non-Volatile Memories Workshop: University of California, San Diego, Mar 2023. [Bibtex & Downloads]

Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache

Reference

Carlos Escuín Blasco, Asif Ali Khan, Pablo Enrique Ibáñez Marín, Teresa Monreal Arnal, Denis Navarro, José M Llaberia Griñó, Jeronimo Castrillon, Victor Viñals Yúfera, "Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache", In Proceeding: 14th Annual Non-Volatile Memories Workshop: University of California, San Diego, Mar 2023.

Bibtex

@InProceedings{escuin_nvmw23,
author = {Escuín Blasco, Carlos and Ali Khan, Asif and Ibáñez Marín, Pablo Enrique and Monreal Arnal, Teresa and Navarro, Denis and Llaberia Griñó, José M and Castrillon, Jeronimo and Viñals Yúfera, Victor},
booktitle = {14th Annual Non-Volatile Memories Workshop: University of California, San Diego},
title = {Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache},
organization = {University of California, Los Angeles (UCLA)},
url = {https://upcommons.upc.edu/bitstream/handle/2117/395690/nvmw2023-paper8-final_version_your_extended_abstract.pdf?sequence=1},
month = mar,
year = {2023},
}

Downloads

2303_Escuin_NVMW [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3698

×
Stephanie Soldavini, Karl F. A. Friebel, Mattia Tibaldi, Gerald Hempel, Jeronimo Castrillon, Christian Pilato, "Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics", In ACM Transactions on Reconfigurable Technology and Systems (TRETS), Association for Computing Machinery, vol. 16, no. 2, New York, NY, USA, Mar 2023. [doi] [Bibtex & Downloads]

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Reference

Stephanie Soldavini, Karl F. A. Friebel, Mattia Tibaldi, Gerald Hempel, Jeronimo Castrillon, Christian Pilato, "Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics", In ACM Transactions on Reconfigurable Technology and Systems (TRETS), Association for Computing Machinery, vol. 16, no. 2, New York, NY, USA, Mar 2023. [doi]

Abstract
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25 \texttimes more energy efficient than expert-crafted Intel CPU implementations.

Bibtex

@Article{friebel_trets23,
author = {Stephanie Soldavini and Karl F. A. Friebel and Mattia Tibaldi and Gerald Hempel and Jeronimo Castrillon and Christian Pilato},
title = {Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics},
doi = {10.1145/3563553},
issn = {1936-7406},
number = {2},
url = {https://doi.org/10.1145/3563553},
volume = {16},
abstract = {Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25 \texttimes{} more energy efficient than expert-crafted Intel CPU implementations.},
address = {New York, NY, USA},
articleno = {21},
journal = {ACM Transactions on Reconfigurable Technology and Systems (TRETS)},
month = mar,
numpages = {34},
publisher = {Association for Computing Machinery},
year = {2023},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3372

×
Jeronimo Castrillon, "Programming abstractions and optimizing compilers for energy-efficient computing", In 1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk), Feb 2023. [Bibtex & Downloads]

Programming abstractions and optimizing compilers for energy-efficient computing

Reference

Jeronimo Castrillon, "Programming abstractions and optimizing compilers for energy-efficient computing", In 1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk), Feb 2023.

Abstract
The demise of scaling laws in micro-electronics has led to an era of innovation in software and hardware architectures aimed at improving the energy efficiency of computing systems. Albeit still highly relevant, software optimizations for mainstream systems, which make the bulk of today's computing systems, provide ever-decreasing returns in the range of single-digit percentages. This is why lots of attention has rightfully turn to domain-specific architectures and emerging technologies which promise improvements of one to several orders of magnitude. Software development for these novel systems is still characterized by low-level expert coding and brittle toolchains, preventing hardware innovations from reaching a broader impact. In this talk, we discuss ongoing efforts on providing high-level programming abstractions and optimizing compilers to automatically target emerging computing systems. We do this by looking at three ongoing projects, namely, (i) a collaborative HW-SW effort to reduce the energy footprint of baseband processing in upcoming cellular networks, (ii) and end-to-end compilation for energy-efficient HPC simulations on state-of-the-art reconfigurable systems, and (iii) an extensible compilation framework to optimize for novel in-memory and near-memory computing systems. Finally, we are interested in discussing how these kinds of tools can be embedded in the larger picture of full life-cycle management.

Bibtex

@Misc{castrillon_netzero2023,
author = {Castrillon, Jeronimo},
date = {2023-02},
title = {Programming abstractions and optimizing compilers for energy-efficient computing},
howpublished = {1st Workshop on NetZero Carbon Computing (NetZero'23), Co-located with the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29) (invited talk)},
location = {Montreal, Canada},
url = {https://netzero.sysnet.ucsd.edu},
abstract = {The demise of scaling laws in micro-electronics has led to an era of innovation in software and hardware architectures aimed at improving the energy efficiency of computing systems. Albeit still highly relevant, software optimizations for mainstream systems, which make the bulk of today's computing systems, provide ever-decreasing returns in the range of single-digit percentages. This is why lots of attention has rightfully turn to domain-specific architectures and emerging technologies which promise improvements of one to several orders of magnitude. Software development for these novel systems is still characterized by low-level expert coding and brittle toolchains, preventing hardware innovations from reaching a broader impact. In this talk, we discuss ongoing efforts on providing high-level programming abstractions and optimizing compilers to automatically target emerging computing systems. We do this by looking at three ongoing projects, namely, (i) a collaborative HW-SW effort to reduce the energy footprint of baseband processing in upcoming cellular networks, (ii) and end-to-end compilation for energy-efficient HPC simulations on state-of-the-art reconfigurable systems, and (iii) an extensible compilation framework to optimize for novel in-memory and near-memory computing systems. Finally, we are interested in discussing how these kinds of tools can be embedded in the larger picture of full life-cycle management.},
month = feb,
year = {2023},
}

Downloads

2302_castrillon_NetZero [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3519

×
Jeronimo Castrillon, Karol Desnos, Andrés Goens, Christian Menard, "Dataflow Models of Computation for Programming Heterogeneous Multicores", Springer Nature Singapore, pp. 1–40, Singapore, Jan 2023. [doi] [Bibtex & Downloads]

Dataflow Models of Computation for Programming Heterogeneous Multicores

Reference

Jeronimo Castrillon, Karol Desnos, Andrés Goens, Christian Menard, "Dataflow Models of Computation for Programming Heterogeneous Multicores", Springer Nature Singapore, pp. 1–40, Singapore, Jan 2023. [doi]

Abstract
The hardware complexity of modern integrated circuits keeps increasing at a steady pace. Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) integrate general-purpose processing elements, domain-specific processors, dedicated hardware accelerators, reconfigurable logic, as well as complex memory hierarchies and interconnect. While offering unprecedented computational power and energy efficiency, MPSoCs are notoriously difficult to program. This chapter presents Models of Computation (MoCs) as an appealing alternative to traditional programming methodologies to harness the full capacities of modern MPSoCs. By raising the level of abstraction, MoCs make it possible to specify complex systems with little knowledge of the target architecture. The properties of MoCs make it possible for tools to automatically generate efficient implementations for heterogeneous MPSoCs, relieving developers from time-consuming manual exploration. This chapter focuses on a specific MoC family called dataflow MoCs. Dataflow MoCs represent systems as graphs of computational entities and communication channels. This graph-based system specification enables intuitive description of parallelism and supports many analysis and optimization techniques for deriving safe and highly efficient implementations on MPSoCs.

Bibtex

@InBook{castrillon_hca22,
author = {Jeronimo Castrillon and Karol Desnos and Andrés Goens and Christian Menard},
booktitle = {Handbook of Computer Architecture},
date = {2023-01},
pages = {1--40},
title = {Dataflow Models of Computation for Programming Heterogeneous Multicores},
doi = {10.1007/978-981-15-6401-7_45-2},
isbn = {978-981-15-6401-7},
url = {https://doi.org/10.1007/978-981-15-6401-7_45-2},
editor = {Anupam Chattopadhyay et al.},
publisher = {Springer Nature Singapore},
address = {Singapore},
abstract = {The hardware complexity of modern integrated circuits keeps increasing at a steady pace. Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) integrate general-purpose processing elements, domain-specific processors, dedicated hardware accelerators, reconfigurable logic, as well as complex memory hierarchies and interconnect. While offering unprecedented computational power and energy efficiency, MPSoCs are notoriously difficult to program. This chapter presents Models of Computation (MoCs) as an appealing alternative to traditional programming methodologies to harness the full capacities of modern MPSoCs. By raising the level of abstraction, MoCs make it possible to specify complex systems with little knowledge of the target architecture. The properties of MoCs make it possible for tools to automatically generate efficient implementations for heterogeneous MPSoCs, relieving developers from time-consuming manual exploration. This chapter focuses on a specific MoC family called dataflow MoCs. Dataflow MoCs represent systems as graphs of computational entities and communication channels. This graph-based system specification enables intuitive description of parallelism and supports many analysis and optimization techniques for deriving safe and highly efficient implementations on MPSoCs.},
month = jan,
year = {2023},
}

Downloads

2301_Castrillon_dataflow-programmig_preview-www [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3393

×
Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023. [Bibtex & Downloads]

Modelling linear algebra kernels as polyhedral volume operations

Reference

Karl F. A. Friebel, Asif Ali Khan, Lorenzo Chelini, Jeronimo Castrillon, "Modelling linear algebra kernels as polyhedral volume operations", In Proceeding: 13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2023.

Bibtex

@InProceedings{friebel_impact23,
author = {Karl F. A. Friebel and Asif Ali Khan and Lorenzo Chelini and Jeronimo Castrillon},
booktitle = {13th International Workshop on Polyhedral Compilation Techniques (IMPACT'23), co-located with 18th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Modelling linear algebra kernels as polyhedral volume operations},
location = {Toulouse, France},
url = {https://impact-workshop.org/papers/paper10.pdf},
month = jan,
year = {2023},
}

Downloads

2301_Friebel_IMPACT [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3453

×
Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)", In Dagstuhl Artifacts Series (Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 9, no. 2, pp. 16:1–16:3, Dagstuhl, Germany, 2023. [doi] [Bibtex & Downloads]

ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)

Reference

Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs (Artifact)", In Dagstuhl Artifacts Series (Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 9, no. 2, pp. 16:1–16:3, Dagstuhl, Germany, 2023. [doi]

Bibtex

@Article{suchert_et_al:DARTS.9.2.16,
author = {Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian},
title = ,
pages = {16:1--16:3},
journal = {Dagstuhl Artifacts Series},
ISSN = {2509-8195},
year = {2023},
volume = {9},
number = {2},
editor = {Suchert, Felix and Zeidler, Lisza and Castrillon, Jeronimo and Ertel, Sebastian},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2023/18256},
URN = {urn:nbn:de:0030-drops-182562},
doi = {10.4230/DARTS.9.2.16},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3593

×

2022
Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi] [Bibtex & Downloads]

BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture

Reference

Fazal Hameed, Jeronimo Castrillon, "BlendCache: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), vol. 41, pp. 5288–5298, Dec 2022. [doi]

Abstract
Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8% and area by 15.2% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9% with 3% performance degradation.

Bibtex

@Article{hameed_tcad22,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {{BlendCache}: An Energy and Area Efficient Racetrack Last-Level-Cache Architecture},
doi = {10.1109/TCAD.2022.3161198},
issn = {0278-0070},
issue = {12},
pages = {5288--5298},
url = {https://ieeexplore.ieee.org/document/9739802},
volume = {41},
abstract = {Racetrack memory (RTM) is a promising nonvolatile memory that provides multi-bit storage cells achieving a higher area and leakage energy efficiency compared to contemporary volatile and non-volatile memories. These features make RTM a potential candidate to be used as a Last-Level-Cache (LLC). One drawback of the multi-bit RTM cell is the serialized access to the stored data, resulting in a shift penalty to access a particular bit within the cell. This overhead is particularly critical for LLC tags, for which prior RTM designs place tags either in SRAM or in single-bit RTM cells. While this avoids shifting, these designs require large number of leaky cells incurring high energy consumption. To address this problem, this paper proposes an energy efficient RTM design called BlendCache that efficiently stores the tags in the leakage optimized multi-bit RTM cells. To reduce the RTM shift penalty of these cells, BlendCache exploits the spatial locality of programs by maximizing accesses to nearby locations in RTM. Employing 32-bit RTM cells for a single-core, BlendCache reduces the energy consumption by 20.8\% and area by 15.2\% compared to the state-of-the-art while its impact on performance is negligible. For a 4-core system, the energy improvement translates to 35.9\% with 3\% performance degradation.},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD)},
month = dec,
project = {tracesymm,cfaed},
year = {2022},
}

Downloads

2204_Hameed_TCAD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3333

×
Julian Robledo, Jeronimo Castrillon, "Parameterizable mobile workloads for adaptable base station optimizations", Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22), pp. 381-386, Dec 2022. [doi] [Bibtex & Downloads]

Parameterizable mobile workloads for adaptable base station optimizations

Reference

Julian Robledo, Jeronimo Castrillon, "Parameterizable mobile workloads for adaptable base station optimizations", Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22), pp. 381-386, Dec 2022. [doi]

Abstract
Recent works on 5G baseband processing systems address the optimization of applications with different requirements of quality of service (QoS). The volume and heterogeneity of applications that have to be processed on a base station are growing and 5G introduces new use cases that push system designers towards more flexible and adaptable approaches. To investigate future network challenges of mobile communications, a good methodology for the generation of realistic workloads, that allows target optimizations of different traffic scenarios, is required. In this paper, we study the variation of real traffic data on multiple base stations and identify the main sources for the high variation of the 5G workloads. We propose a methodology for parameterizable workload generation for users with different QoS requirements that enables optimization techniques in baseband processing systems. We demonstrate the feasibility of our approach based on a virtual base station using a heterogeneous hardware model and various state-of-the-art mapping policies.

Bibtex

@InProceedings{robledo_mcsoc22,
author = {Julian Robledo and Jeronimo Castrillon},
booktitle = {Proceedings of the IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-22)},
title = {Parameterizable mobile workloads for adaptable base station optimizations},
doi = {10.1109/MCSoC57363.2022.00067},
url = {https://ieeexplore.ieee.org/document/10008473},
location = {Penang, Malaysia},
pages = {381--386},
abstract = {Recent works on 5G baseband processing systems address the optimization of applications with different requirements of quality of service (QoS). The volume and heterogeneity of applications that have to be processed on a base station are growing and 5G introduces new use cases that push system designers towards more flexible and adaptable approaches. To investigate future network challenges of mobile communications, a good methodology for the generation of realistic workloads, that allows target optimizations of different traffic scenarios, is required. In this paper, we study the variation of real traffic data on multiple base stations and identify the main sources for the high variation of the 5G workloads. We propose a methodology for parameterizable workload generation for users with different QoS requirements that enables optimization techniques in baseband processing systems. We demonstrate the feasibility of our approach based on a virtual base station using a heterogeneous hardware model and various state-of-the-art mapping policies.},
month = dec,
year = {2022},
}

Downloads

2212_Robledo_MCSOC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3429

×
Nesrine Khouzami, Jeronimo Castrillon, "Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations", Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC), Nov 2022. [Bibtex & Downloads]

Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations

Reference

Nesrine Khouzami, Jeronimo Castrillon, "Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations", Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC), Nov 2022.

Abstract
We present OpenPME (Open Particle-Mesh Environment), a Problem Solving Environment (PSE) which provides a Domain Specific Language (DSL) built atop a domain model general enough to write numerical simulations in scientific computing using particle-mesh abstractions. This helps to close the productivity gap in HPC applications and effectively lowers the programming barrier to enable the smooth implementation of scalable simulations. We also introduce a model-based autotuning approach of discretization methods for OpenPME compiler. We evaluate the autotuner in two diffusion test cases and the results show that we consistently find configurations that outperform those found by state-of-the-art general-purpose autotuners.

Bibtex

@InProceedings{khouzami_sc_whpc22,
author = {Nesrine Khouzami and Jeronimo Castrillon},
title = {Problem Solving Environment and Compiler Optimizations for High-Performance Particle-Mesh Numerical Simulations},
location = {Dallas, Texas},
publisher = {Supercomputing Conference (SC 2022) - Women in HPC Workshop (WHPC)},
abstract = {We present OpenPME (Open Particle-Mesh Environment), a Problem Solving Environment (PSE) which provides a Domain Specific Language (DSL) built atop a domain model general enough to write numerical simulations in scientific computing using particle-mesh abstractions. This helps to close the productivity gap in HPC applications and effectively lowers the programming barrier to enable the smooth implementation of scalable simulations. We also introduce a model-based autotuning approach of discretization methods for OpenPME compiler. We evaluate the autotuner in two diffusion test cases and the results show that we consistently find configurations that outperform those found by state-of-the-art general-purpose autotuners.},
month = nov,
numpages = {3},
year = {2022},
}

Downloads

2211_Khouzami_WHPC-SC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3408

×
Felix Suchert, Jeronimo Castrillon, "STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks", In Proceeding: Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22) (Gainaru, Ana and Zhang, Ce and Luo, Chunjie), Springer International Publishing, pp. 160–175, Cham, Nov 2022. [doi] [Bibtex & Downloads]

STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks

Reference

Felix Suchert, Jeronimo Castrillon, "STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks", In Proceeding: Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22) (Gainaru, Ana and Zhang, Ce and Luo, Chunjie), Springer International Publishing, pp. 160–175, Cham, Nov 2022. [doi]

Abstract
Software Transactional Memory has been used as a synchronization mechanism that is easier to use and compose than locking ones. The mechanisms continued relevance in research and application design motivates considerations regarding safer implementations than existing C libraries. In this paper, we study the impact of the Rust programming language on STM performance and code quality. To facilitate the comparison, we manually translated the STAMP benchmark suite to Rust and also generated a version using a state-of-the-art C-to-Rust transpiler. We find that, while idiomatic implementations using safe Rust are generally slower than both C and transpiled code, they guarantee memory safety and improve code quality.

Bibtex

@InProceedings{suchert_bench22,
author = {Felix Suchert and Jeronimo Castrillon},
booktitle = {Proceeeding of the BenchCouncil Transactions on Benchmarks, Standards and Evaluations (Bench22)},
title = {STAMP-Rust: Language and Performance Comparison to C on Transactional Benchmarks},
doi = {10.1007/978-3-031-31180-2_10},
editor = {Gainaru, Ana and Zhang, Ce and Luo, Chunjie},
isbn = {978-3-031-31180-2},
pages = {160--175},
publisher = {Springer International Publishing},
url = {https://link.springer.com/chapter/10.1007/978-3-031-31180-2_10},
abstract = {Software Transactional Memory has been used as a synchronization mechanism that is easier to use and compose than locking ones. The mechanisms continued relevance in research and application design motivates considerations regarding safer implementations than existing C libraries. In this paper, we study the impact of the Rust programming language on STM performance and code quality. To facilitate the comparison, we manually translated the STAMP benchmark suite to Rust and also generated a version using a state-of-the-art C-to-Rust transpiler. We find that, while idiomatic implementations using safe Rust are generally slower than both C and transpiled code, they guarantee memory safety and improve code quality.},
address = {Cham},
month = nov,
year = {2022},
}

Downloads

2211_Suchert_BENCH [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3392

×
Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "Shisha: Online scheduling of CNN pipelines on heterogeneous architectures", Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics, pp. 249–262, Nov 2022. [Bibtex & Downloads]

Shisha: Online scheduling of CNN pipelines on heterogeneous architectures

Reference

Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "Shisha: Online scheduling of CNN pipelines on heterogeneous architectures", Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics, pp. 249–262, Nov 2022.

Abstract
Many modern multicore processors integrate asymmetric core clusters. With the trend towards Multi-Chip-Modules (MCMs) and interposer-based packaging technologies, platforms will feature heterogeneity at the level of cores, memory subsystem and the interconnect. Due to their potential high memory throughput and energy efficient core modules, these platforms are prominent targets for emerging machine learning applications, such as Convolutional Neural Networks (CNNs). To exploit and adapt to the diversity of modern heterogeneous chips, CNNs need to be quickly optimized in terms of scheduling and workload distribution among computing resources. To address this we propose Shisha, an online approach to generate and schedule parallel CNN pipelines on heterogeneous MCM-based architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by 35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.

Bibtex

@InProceedings{soomro_ppam22,
author = {Pirah Noor Soomro and Mustafa Abduljabbar and Jeronimo Castrillon and Miquel Pericás},
booktitle = {Proceedings of the 14th International Conference on Parallel Processing and Applied Mathematics},
date = {2022-11},
title = {Shisha: Online scheduling of CNN pipelines on heterogeneous architectures},
location = {Gdansk, Poland},
abstract = {Many modern multicore processors integrate asymmetric core clusters. With the trend towards Multi-Chip-Modules (MCMs) and interposer-based packaging technologies, platforms will feature heterogeneity at the level of cores, memory subsystem and the interconnect. Due to their potential high memory throughput and energy efficient core modules, these platforms are prominent targets for emerging machine learning applications, such as Convolutional Neural Networks (CNNs). To exploit and adapt to the diversity of modern heterogeneous chips, CNNs need to be quickly optimized in terms of scheduling and workload distribution among computing resources. To address this we propose Shisha, an online approach to generate and schedule parallel CNN pipelines on heterogeneous MCM-based architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.},
month = nov,
year = {2022},
pages = {249--262},
url = {https://www.springerprofessional.de/shisha-online-scheduling-of-cnn-pipelines-on-heterogeneous-archi/25292292},
}

Downloads

2211_Soomro_PPAM [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3359

×
Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi] [Bibtex & Downloads]

DNA Pre-alignment Filter using Processing Near Racetrack Memory

Reference

Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", In IEEE Computer Architecture Letters, IEEE, pp. 1–4, Jul 2022. [doi]

Abstract
Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)–an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68% and 52%, respectively, compared to the state of the art proposed DRAM-based architecture.

Bibtex

@Article{hameed_ieeecal22,
author = {Fazal Hameed and Asif Ali Khan and Sebastien Ollivier and Alex K. Jones and Jeronimo Castrillon},
date = {2022-08},
journal = {IEEE Computer Architecture Letters},
title = {DNA Pre-alignment Filter using Processing Near Racetrack Memory},
abstract = {Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68\% and 52\%, respectively, compared to the state of the art proposed DRAM-based architecture.},
month = jul,
numpages = {4},
publisher = {IEEE},
year = {2022},
doi = {10.1109/LCA.2022.3194263},
pages = {1--4},
url = {https://ieeexplore.ieee.org/document/9841612},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3361

×
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi] [Bibtex & Downloads]

ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Che, "ROLLED: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees", In IEEE Transactions on Computers, pp. 1-14, Jul 2022. [doi]

Bibtex

@Article{hakert_toc22,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Che},
title = {{ROLLED}: Racetrack Memory Optimized Linear Layout and Efficient Decomposition of Decision Trees},
journal = {IEEE Transactions on Computers},
month = jul,
year = {2022},
doi = {10.1109/TC.2022.3197094},
pages = {1--14},
url = {https://ieeexplore.ieee.org/document/9851943},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3365

×
Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote), Jun 2022. [Bibtex & Downloads]

Domain-specific programming methodologies for domain-specific and emerging computing systems

Reference

Jeronimo Castrillon, "Domain-specific programming methodologies for domain-specific and emerging computing systems", In 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote), Jun 2022.

Abstract
Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. The golden age of computer architecture must be thus accompanied by a golden age of research in compilers and programming languages. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. Concretely, we will discuss abstractions for tensor expressions as vehicle to optimize for modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing. The talk closes with an outlook on other emerging architectures and challenges for high-level compilation.

Bibtex

@Misc{castrillon_lctes2022,
author = {Castrillon, Jeronimo},
date = {2022-06},
title = {Domain-specific programming methodologies for domain-specific and emerging computing systems},
howpublished = {23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022)(keynote)},
location = {San Diego, CA, USA},
abstract = {Programming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. The golden age of computer architecture must be thus accompanied by a golden age of research in compilers and programming languages. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. Concretely, we will discuss abstractions for tensor expressions as vehicle to optimize for modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing. The talk closes with an outlook on other emerging architectures and challenges for high-level compilation.},
month = jun,
year = {2022},
}

Downloads

220614_castrillon_LCTES-keynote_compressed [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3347

×
Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi] [Bibtex & Downloads]

HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches

Reference

Carlos Escuin, Asif Ali Khan, Pablo Ibañez, Teresa Monreal, Victor Viñals, Jeronimo Castrillon, "HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches", In Proceeding: DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 53–58, New York, NY, USA, Jun 2022. [doi]

Abstract
Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.

Bibtex

@InProceedings{escuin_rapido22,
author = {Carlos Escuin and Asif Ali Khan and Pablo Iba{\~n}ez and Teresa Monreal and Victor Vi{\~n}als and Jeronimo Castrillon},
booktitle = {DroneSE and RAPIDO: System Engineering for Constrained Embedded Systems, co-located with the 17th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
date = {2022-06},
title = {HyCSim: A Rapid Design Space Exploration Tool for Emerging Hybrid Last-Level Caches},
doi = {10.1145/3522784.3522801},
isbn = {9781450395663},
location = {Budapest, Hungary},
pages = {53–58},
publisher = {Association for Computing Machinery},
series = {DroneSE and RAPIDO '22},
url = {https://doi.org/10.1145/3522784.3522801},
abstract = {Recent years have seen a rising trend in the exploration of non- volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24x, compared to the cycle-accurate Gem5 simulator, with high-fidelity.},
address = {New York, NY, USA},
month = jun,
numpages = {6},
year = {2022},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3327

×
Lars Schütze, Cornelius Kummer, Jeronimo Castrillon, "Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language", Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22), Association for Computing Machinery, pp. 27–34, New York, NY, USA, Jun 2022. [doi] [Bibtex & Downloads]

Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language

Reference

Lars Schütze, Cornelius Kummer, Jeronimo Castrillon, "Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language", Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22), Association for Computing Machinery, pp. 27–34, New York, NY, USA, Jun 2022. [doi]

Abstract
Adaptive programming models are increasingly important as context-dependent software conquers more domains. One such a model is role-oriented programming where behavioral changes are implemented by objects playing and renouncing roles. As with other adaptive models, the overhead introduced by source code adaptations is a major showstopper for role-oriented programs. This is in part because the optimizations of object-oriented virtual machines (VMs) do not provide the same performance gains when applied to role-oriented programs. Recently, dispatch plans have been shown to enable optimizations beyond those in VMs, thereby improving the performance of role programs with low variability. This paper introduces guarded dispatch plans, an extension of dispatch plans with a context-aware guarding mechanism that allows reuse in high-variability scenarios. Fine-grained guards use run-time feedback to partially reuse dispatch plans across call sites when contexts are changing. We present an algorithm to construct and compose guarded dispatch plans and provide a reference implementation of the approach. We show that our approach is able to gracefully degrade into a default dispatch approach when variability increases. The implementation is evaluated with synthetic benchmarks capturing different characteristics. Compared to the state-of-the-art implementation in ObjectTeams we achieved a mean speedup of 3.3 \texttimes in static cases, 3.0 \texttimes at low variability and the same performance in highly dynamic cases.

Bibtex

@InProceedings{schuetze_cop22,
author = {Sch\"{u}tze, Lars and Kummer, Cornelius and Castrillon, Jeronimo},
booktitle = {Proceedings of the 14th ACM International Workshop on Context-Oriented Programming and Advanced Modularity (COP'22)},
title = {Guard the Cache: Dispatch Optimization in a Contextual Role-oriented Language},
doi = {10.1145/3570353.3570357},
isbn = {9781450399869},
pages = {27--34},
publisher = {Association for Computing Machinery},
series = {COP '22},
url = {https://doi.org/10.1145/3570353.3570357},
abstract = {Adaptive programming models are increasingly important as context-dependent software conquers more domains. One such a model is role-oriented programming where behavioral changes are implemented by objects playing and renouncing roles. As with other adaptive models, the overhead introduced by source code adaptations is a major showstopper for role-oriented programs. This is in part because the optimizations of object-oriented virtual machines (VMs) do not provide the same performance gains when applied to role-oriented programs. Recently, dispatch plans have been shown to enable optimizations beyond those in VMs, thereby improving the performance of role programs with low variability. This paper introduces guarded dispatch plans, an extension of dispatch plans with a context-aware guarding mechanism that allows reuse in high-variability scenarios. Fine-grained guards use run-time feedback to partially reuse dispatch plans across call sites when contexts are changing. We present an algorithm to construct and compose guarded dispatch plans and provide a reference implementation of the approach. We show that our approach is able to gracefully degrade into a default dispatch approach when variability increases. The implementation is evaluated with synthetic benchmarks capturing different characteristics. Compared to the state-of-the-art implementation in ObjectTeams we achieved a mean speedup of 3.3 \texttimes{} in static cases, 3.0 \texttimes{} at low variability and the same performance in highly dynamic cases.},
address = {New York, NY, USA},
month = jun,
numpages = {8},
year = {2022},
}

Downloads

2206_schuetze_COP [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3341

×
Jeronimo Castrillon, "Language and compiler research for heterogeneous emerging computing systems", In SPCL_Bcast(COMM_WORLD) seminar series, SPCL ETH Zurich (invited talk), May 2022. [Bibtex & Downloads]

Language and compiler research for heterogeneous emerging computing systems

Reference

Jeronimo Castrillon, "Language and compiler research for heterogeneous emerging computing systems", In SPCL_Bcast(COMM_WORLD) seminar series, SPCL ETH Zurich (invited talk), May 2022.

Abstract
Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging, non Von-Neumann computer architectures. The so-called golden age of computer architecture thus must be accompanied by a, hopefully, golden age of research in compilers and programming languages. This talk discusses research along two fronts, namely, (1) on domain specific languages (DSLs) to hide complexity from non-expert programmers while passing richer information to compilers, and (2) on understanding the fundamental changes in emerging computing paradigms and their consequences for compilers. Concretely, we will talk about DSLs for physics simulations, compute-in-memory with emerging technologies, and current efforts in unifying intermediate representations with the MLIR compiler framework.

Bibtex

@Misc{castrillon_spcl2022,
author = {Castrillon, Jeronimo},
year = {2022},
month = may,
title = {Language and compiler research for heterogeneous emerging computing systems},
howpublished = {SPCL\_Bcast(COMM\_WORLD) seminar series, SPCL ETH Zurich (invited talk)},
location = {Virtual},
url = {https://www.youtube.com/watch?v=-NoRpUBlNrU},
abstract = {Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging, non Von-Neumann computer architectures. The so-called golden age of computer architecture thus must be accompanied by a, hopefully, golden age of research in compilers and programming languages. This talk discusses research along two fronts, namely, (1) on domain specific languages (DSLs) to hide complexity from non-expert programmers while passing richer information to compilers, and (2) on understanding the fundamental changes in emerging computing paradigms and their consequences for compilers. Concretely, we will talk about DSLs for physics simulations, compute-in-memory with emerging technologies, and current efforts in unifying intermediate representations with the MLIR compiler framework.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3345

×
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Burkhard Ringlein, Michele Fiorito, Fabrizio Ferrandi, Donatella Sciuto, Christian Pilato, Stephanie Soldavini, "EVEREST: Intermediate report of the compilation framework", Technical report, EVEREST consortium, Apr 2022. [Bibtex & Downloads]

EVEREST: Intermediate report of the compilation framework

Reference

Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Burkhard Ringlein, Michele Fiorito, Fabrizio Ferrandi, Donatella Sciuto, Christian Pilato, Stephanie Soldavini, "EVEREST: Intermediate report of the compilation framework", Technical report, EVEREST consortium, Apr 2022.

Bibtex

@Report{castrillon_everestD4.2_2022,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Burkhard Ringlein and Michele Fiorito and Fabrizio Ferrandi and Donatella Sciuto and Christian Pilato and Stephanie Soldavini},
institution = {EVEREST consortium},
title = {{EVEREST}: Intermediate report of the compilation framework},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/tqCXugYLqGQNChd},
month = apr,
year = {2022},
}

Downloads

2204_Castrillon-Everest-D4 [2]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3725

×
Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi] [Bibtex & Downloads]

Brain-inspired Cognition in Next Generation Racetrack Memories

Reference

Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones, "Brain-inspired Cognition in Next Generation Racetrack Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 21, no. 6, pp. 79:1–79:28, New York, NY, USA, Mar 2022. [doi]

Abstract
Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.

Bibtex

@Article{khan_tecs22,
author = {Asif Ali Khan and Sebastien Ollivier and Stephen Longofono and Gerald Hempel and Jeronimo Castrillon and Alex K. Jones},
title = {Brain-inspired Cognition in Next Generation Racetrack Memories},
abstract = {Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional von Neumann architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. In contrast, even partial implementations of an HDC framework inside memory can provide considerable performance/energy gains as demonstrated in prior work using memristors. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within memory. The proposed solution requires minimal additional CMOS circuitry by leveraging a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the CMOS circuitry overhead, an RTM nanowire-based counting mechanism is proposed. Using language recognition as the example workload, the proposed RTM HDC system reduces the energy consumption by 8.6x compared to the state-of-the-art in-memory implementation. Compared to dedicated hardware design realized with an FPGA, RTM-based HDC processing demonstrates 7.8x and 5.3x improvements in the overall runtime and energy consumption, respectively.},
address = {New York, NY, USA},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
month = mar,
numpages = {28},
publisher = {Association for Computing Machinery},
year = {2022},
doi = {10.1145/3524071},
issn = {1539-9087},
url = {https://doi.org/10.1145/3524071},
volume = {21},
number = {6},
articleno = {79},
pages = {79:1--79:28},
}

Downloads

2203_Khan_TECS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3329

×

2021
Nesrine Khouzami, Friedrich Michel, Pietro Incardona, Jeronimo Castrillon, Ivo F. Sbalzarini, "Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations", In Journal of Computational Science, vol. 57, pp. 1–11, Dec 2021. [doi] [Bibtex & Downloads]

Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations

Reference

Nesrine Khouzami, Friedrich Michel, Pietro Incardona, Jeronimo Castrillon, Ivo F. Sbalzarini, "Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations", In Journal of Computational Science, vol. 57, pp. 1–11, Dec 2021. [doi]

Abstract
We present an autotuning approach for compile-time optimization of numerical discretization methods in simulations of partial differential equations. Our approach is based on data-driven regression of performance models for numerical methods. We use these models at compile time to automatically determine the parameters (e.g., resolution, time step size, etc.) of numerical simulations of continuum spatio-temporal models in order to optimize the tradeoff between simulation accuracy and runtime. The resulting autotuner is developed for the compiler of a Domain-Specific Language (DSL) for numerical simulations. The abstractions in the DSL enable the compiler to automatically determine the performance models and know which discretization parameters to tune. We demonstrate that this high-level approach can explore a large space of possible simulations, with simulation runtimes spanning multiple orders of magnitude. We evaluate our approach in two test cases: the linear diffusion equation and the nonlinear Gray-Scott reaction–diffusion equation. The results show that our model-based autotuner consistently finds configurations that outperform those found by state-of-the-art general-purpose autotuners. Specifically, our autotuner yields simulations that are on average 4.2x faster than those found by the best generic exploration algorithms, while using 16x less tuning time. Compared to manual tuning by a group of researchers with varying levels of expertise, the autotuner was slower than the best users by not more than a factor of 2, whereas it was able to significantly outperform half of them.

Bibtex

@Article{khouzami_jocs21,
author = {Nesrine Khouzami and Friedrich Michel and Pietro Incardona and Jeronimo Castrillon and Ivo F. Sbalzarini},
date = {2021-12},
title = {Model-based Autotuning of Discretization Methods in Numerical Simulations of Partial Differential Equations},
doi = {10.1016/j.jocs.2021.101489},
issn = {1877-7503},
pages = {1--11},
url = {https://www.sciencedirect.com/science/article/pii/S1877750321001563},
volume = {57},
abstract = {We present an autotuning approach for compile-time optimization of numerical discretization methods in simulations of partial differential equations. Our approach is based on data-driven regression of performance models for numerical methods. We use these models at compile time to automatically determine the parameters (e.g., resolution, time step size, etc.) of numerical simulations of continuum spatio-temporal models in order to optimize the tradeoff between simulation accuracy and runtime. The resulting autotuner is developed for the compiler of a Domain-Specific Language (DSL) for numerical simulations. The abstractions in the DSL enable the compiler to automatically determine the performance models and know which discretization parameters to tune. We demonstrate that this high-level approach can explore a large space of possible simulations, with simulation runtimes spanning multiple orders of magnitude. We evaluate our approach in two test cases: the linear diffusion equation and the nonlinear Gray-Scott reaction–diffusion equation. The results show that our model-based autotuner consistently finds configurations that outperform those found by state-of-the-art general-purpose autotuners. Specifically, our autotuner yields simulations that are on average 4.2x faster than those found by the best generic exploration algorithms, while using 16x less tuning time. Compared to manual tuning by a group of researchers with varying levels of expertise, the autotuner was slower than the best users by not more than a factor of 2, whereas it was able to significantly outperform half of them.},
journal = {Journal of Computational Science},
month = dec,
numpages = {15},
project = {openpme},
year = {2021},
}

Downloads

2111_Khouzami_JOCS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3224

×
Jeronimo Castrillon, "Models of computation for energy-efficient time-aware distributed embedded systems", In IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote), Dec 2021. [Bibtex & Downloads]

Models of computation for energy-efficient time-aware distributed embedded systems

Reference

Jeronimo Castrillon, "Models of computation for energy-efficient time-aware distributed embedded systems", In IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote), Dec 2021.

Abstract
Programming heterogeneous and distributed manycores under timing constraints in cyber-physical systems is an extremely hard task. Managing reactive behaviour to outside stimuli, adapting to variable workloads and handling parallelism while ensuring correct and time-predictable execution are examples of key challenges in these kinds of systems. This talk discusses the role of formal models of computation to help architect programming methodologies, making it easier to manage the complexity and provide guarantees than with ad-hoc programming models. We will discuss dataflow-based programming for joint optimization of multiple adaptable applications while respecting real-time constraints. We will then introduce the reactor model which adds time semantics to dataflow to support time-determinism when needed and to account for reactive behavior. The benefits and overheads of this model-based approach to programming distributed embedded systems will be analysed with use cases from the automotive and 5G-communication domains.

Bibtex

@Misc{castrillon_mcsoc2021,
author = {Castrillon, Jeronimo},
title = {Models of computation for energy-efficient time-aware distributed embedded systems},
howpublished = {IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2021) (keynote)},
location = {Singapore},
url = {https://mcsoc-forum.org/m2021/wp-content/uploads/2021/12/211222_castrillon_MCSOC_compressed.pdf},
abstract = {Programming heterogeneous and distributed manycores under timing constraints in cyber-physical systems is an extremely hard task. Managing reactive behaviour to outside stimuli, adapting to variable workloads and handling parallelism while ensuring correct and time-predictable execution are examples of key challenges in these kinds of systems. This talk discusses the role of formal models of computation to help architect programming methodologies, making it easier to manage the complexity and provide guarantees than with ad-hoc programming models. We will discuss dataflow-based programming for joint optimization of multiple adaptable applications while respecting real-time constraints. We will then introduce the reactor model which adds time semantics to dataflow to support time-determinism when needed and to account for reactive behavior. The benefits and overheads of this model-based approach to programming distributed embedded systems will be analysed with use cases from the automotive and 5G-communication domains.},
month = dec,
year = {2021},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3254

×
Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi] [Bibtex & Downloads]

Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory

Reference

Joonas Iisakki Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, Pekka Jääskeläinen, "Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory", In IEEE Transactions on Computers, pp. 1-1, Oct 2021. [doi]

Bibtex

@Article{multanen_toc21,
author = {Joonas Iisakki Multanen and Kari Hepola and Asif Ali Khan and Jeronimo Castrillon and Pekka J{\"a}{\"a}skel{\"a}inen},
title = {Energy-Efficient Instruction Delivery in Embedded Systems with Domain Wall Memory},
doi = {10.1109/TC.2021.3117439},
number = {9},
pages = {2010--2021},
url = {https://ieeexplore.ieee.org/document/9557799},
volume = {71},
abstract = {As performance and energy-efficiency improvements from technology scaling are slowing down, new technologies are being researched in hopes of disrupting results. Domain wall memory (DWM) is an emerging non-volatile technology that promises extreme data density, fast access times and low power consumption. However, DWM access time depends on the memory location distance from access ports, requiring expensive shifting. This causes overheads on performance and energy consumption. In this article, we implement our previously proposed shift-reducing instruction memory placement (SHRIMP) on a RISC-V core in RTL, provide the first thorough evaluation of the control logic required for DWM and SHRIMP and evaluate the effects on system energy and energy-efficiency. SHRIMP reduces the number of shifts by 36\% on average compared to a linear placement in CHStone and Coremark benchmark suites when evaluated on the RISC-V processor system. The reduced shift amount leads to an average reduction of 14\% in cycle counts compared to the linear placement. When compared to an SRAM-based system, although increasing memory usage by 26\%, DWM with SHRIMP allows a 73\% reduction in memory energy and 42\% relative energy delay product. We estimate overall energy reductions of 14\%, 15\% and 19\% in three example embedded systems.},
journal = {IEEE Transactions on Computers},
month = oct,
year = {2021},
}

Downloads

2110_Multanen_TOC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3208

×
Karl F. A. Friebel, Stephanie Soldavini, Gerald Hempel, Christian Pilato, Jeronimo Castrillon, "From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics", Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) — FPGA for HPC Workshop, pp. 759–766, Sep 2021. [doi] [Bibtex & Downloads]

From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics

Reference

Karl F. A. Friebel, Stephanie Soldavini, Gerald Hempel, Christian Pilato, Jeronimo Castrillon, "From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics", Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) — FPGA for HPC Workshop, pp. 759–766, Sep 2021. [doi]

Bibtex

@InProceedings{friebel_fpga4hpc21,
author = {Karl F. A. Friebel and Stephanie Soldavini and Gerald Hempel and Christian Pilato and Jeronimo Castrillon},
booktitle = {Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER) --- FPGA for HPC Workshop},
title = {From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics},
doi = {10.1109/Cluster48925.2021.00112},
location = {Portland (virtual), OR, USA},
pages = {759--766},
url = {https://ieeexplore.ieee.org/document/9556064},
month = sep,
numpages = {8},
year = {2021},
}

Downloads

2109_Friebel_fpga4hpc [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3181

×
Robert Khasanov, Julian Robledo, Christian Menard, Andr'es Goens, Jeronimo Castrillon, "Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks", In ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), Association for Computing Machinery, vol. 20, no. 5s, New York, NY, USA, Sep 2021. [doi] [Bibtex & Downloads]

Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks

Reference

Robert Khasanov, Julian Robledo, Christian Menard, Andr'es Goens, Jeronimo Castrillon, "Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks", In ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES), Association for Computing Machinery, vol. 20, no. 5s, New York, NY, USA, Sep 2021. [doi]

Abstract
Advancing telecommunication standards continuously push for larger bandwidths, lower latencies, and faster data rates. The receiver baseband unit not only has to deal with a huge number of users expecting connectivity but also with a high workload heterogeneity. As a consequence of the required flexibility, baseband processing has seen a trend towards software implementations in cloud Radio Access Networks (cRANs). The flexibility gained from software implementation comes at the price of impoverished energy efficiency. This paper addresses the trade-off between flexibility and efficiency by proposing a domain-specific hybrid mapping algorithm. Hybrid mapping is an established approach from the model-based design of embedded systems that allows us to retain flexibility while targeting heterogeneous hardware. Depending on the current workload, the runtime system selects the most energy-efficient mapping configuration without violating timing constraints. We leverage the structure of baseband processing, and refine the scheduling methodology, to enable efficient mapping of 100s of tasks at the millisecond granularity, improving upon state-of-the-art hybrid approaches. We validate our approach on an Odroid XU4 and virtual platforms with application-specific accelerators on an open-source prototype. On different LTE workloads, our hybrid approach shows significant improvements both at design time and at runtime. At design-time, mappings of similar quality to those obtained by state-of-the-art methods are generated around four orders of magnitude faster. At runtime, multi-application schedules are computed 37.7% faster than the state-of-the-art without compromising on the quality.

Bibtex

@Article{khasanov_cases21,
author = {Robert Khasanov and Julian Robledo and Christian Menard and Andrés Goens and Jeronimo Castrillon},
title = {Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks},
doi = {10.1145/3476991},
issn = {1539-9087},
number = {5s},
url = {https://doi.org/10.1145/3476991},
volume = {20},
abstract = {Advancing telecommunication standards continuously push for larger bandwidths, lower latencies, and faster data rates. The receiver baseband unit not only has to deal with a huge number of users expecting connectivity but also with a high workload heterogeneity. As a consequence of the required flexibility, baseband processing has seen a trend towards software implementations in cloud Radio Access Networks (cRANs). The flexibility gained from software implementation comes at the price of impoverished energy efficiency. This paper addresses the trade-off between flexibility and efficiency by proposing a domain-specific hybrid mapping algorithm. Hybrid mapping is an established approach from the model-based design of embedded systems that allows us to retain flexibility while targeting heterogeneous hardware. Depending on the current workload, the runtime system selects the most energy-efficient mapping configuration without violating timing constraints. We leverage the structure of baseband processing, and refine the scheduling methodology, to enable efficient mapping of 100s of tasks at the millisecond granularity, improving upon state-of-the-art hybrid approaches. We validate our approach on an Odroid XU4 and virtual platforms with application-specific accelerators on an open-source prototype. On different LTE workloads, our hybrid approach shows significant improvements both at design time and at runtime. At design-time, mappings of similar quality to those obtained by state-of-the-art methods are generated around four orders of magnitude faster. At runtime, multi-application schedules are computed 37.7% faster than the state-of-the-art without compromising on the quality.},
address = {New York, NY, USA},
articleno = {60},
issue_date = {October 2021},
journal = {ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
location = {Virtual conference},
month = sep,
numpages = {26},
publisher = {Association for Computing Machinery},
year = {2021},
}

Downloads

2110_Khasanov_CASES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3155

×
Alexander Brauckmann, Andr'es Goens, Jeronimo Castrillon, "PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning", Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 17-29, Sep 2021. [doi] [Bibtex & Downloads]

PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning

Reference

Alexander Brauckmann, Andr'es Goens, Jeronimo Castrillon, "PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning", Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 17-29, Sep 2021. [doi]

Abstract
The polyhedral model allows a structured way of defining semantics-preserving transformations to improve the performance of a large class of loops. Finding profitable points in this space is a hard problem which is usually approached by heuristics that generalize from domain-expert knowledge. Existing search space formulations in state-of-the-art heuristics depend on the shape of particular loops, making it hard to leverage generic and more powerful optimization techniques from the machine learning domain. In this paper, we propose a shape-agnostic formulation for the space of legal transformations in the polyhedral model as a Markov Decision Process (MDP). Instead of using transformations, the formulation is based on an abstract space of possible schedules. In this formulation, states model partial schedules, which are constructed by actions that are reusable across different loops. With a simple heuristic to traverse the space, we demonstrate that our formulation is powerful enough to match and outperform state-of-the-art heuristics. On the Polybench benchmark suite, we found the search space to contain transformations that lead to a speedup of 3.39x over LLVM O3, which is 1.34x better than the best transformations found in the search space of isl, and 1.83x better than the speedup achieved by the default heuristics of isl. Our generic MDP formulation enables future work to use reinforcement learning to learn optimization heuristics over a wide range of loops. This also contributes to the emerging field of machine learning in compilers, as it exposes a novel problem formulation that can push the limits of existing methods.

Bibtex

@InProceedings{brauckmann_pact21,
author = {Brauckmann, Alexander and Goens, Andrés and Castrillon, Jeronimo},
booktitle = {Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
title = {PolyGym: Polyhedral Optimizations as an Environment for Reinforcement Learning},
month = sep,
doi = {10.1109/PACT52795.2021.00009},
pages = {17-29},
url = {https://ieeexplore.ieee.org/document/9563041},
year = {2021},
abstract = {The polyhedral model allows a structured way of defining semantics-preserving transformations to improve the performance of a large class of loops. Finding profitable points in this space is a hard problem which is usually approached by heuristics that generalize from domain-expert knowledge. Existing search space formulations in state-of-the-art heuristics depend on the shape of particular loops, making it hard to leverage generic and more powerful optimization techniques from the machine learning domain. In this paper, we propose a shape-agnostic formulation for the space of legal transformations in the polyhedral model as a Markov Decision Process (MDP). Instead of using transformations, the formulation is based on an abstract space of possible schedules. In this formulation, states model partial schedules, which are constructed by actions that are reusable across different loops. With a simple heuristic to traverse the space, we demonstrate that our formulation is powerful enough to match and outperform state-of-the-art heuristics. On the Polybench benchmark suite, we found the search space to contain transformations that lead to a speedup of 3.39x over LLVM O3, which is 1.34x better than the best transformations found in the search space of isl, and 1.83x better than the speedup achieved by the default heuristics of isl. Our generic MDP formulation enables future work to use reinforcement learning to learn optimization heuristics over a wide range of loops. This also contributes to the emerging field of machine learning in compilers, as it exposes a novel problem formulation that can push the limits of existing methods.},
}

Downloads

2109_Brauckmann_PACT [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3182

×
Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi] [Bibtex & Downloads]

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

Reference

Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1674-1686, Aug 2021. [doi]

Bibtex

@Article{khan_tcad21,
author = {Adam Siemieniuk and Lorenzo Chelini and Asif Ali Khan and Jeronimo Castrillon and Andi Drebes and Henk Corporaal and Tobias Grosser and Martin Kong},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title = {{OCC}: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory},
month = aug,
volume={41},
number={6},
pages={1674--1686},
numpages = {12 pp},
doi = {10.1109/TCAD.2021.3101464},
issn = {1937-4151},
url = {https://ieeexplore.ieee.org/document/9502921},
publisher = {IEEE Press},
year = {2021},
abstract = {Memristive devices promise an alternative approach toward non-Von Neumann architectures, where specific computational tasks are performed within the memory devices. In the machine learning (ML) domain, crossbar arrays of resistive devices have shown great promise for ML inference, as they allow for hardware acceleration of matrix multiplications. But, to enable widespread adoption of these novel architectures, it is critical to have an automatic compilation flow as opposed to relying on a manual mapping of specific kernels on the crossbar arrays. We demonstrate the programmability of memristor-based accelerators using the new compiler design principle of multilevel rewriting, where a hierarchy of abstractions lowers programs level-by-level and perform code transformations at the most suitable abstraction. In particular, we develop a prototype compiler, which progressively lowers a mathematical notation for tensor operations arising in ML workloads, to fixed-function memristor-based hardware blocks.},
}

Downloads

2107_Khan_TCAD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3157

×
Jeronimo Castrillon, "Domain specific languages to tame heterogeneous and emerging computing systems", In ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote), Jul 2021. [Bibtex & Downloads]

Domain specific languages to tame heterogeneous and emerging computing systems

Reference

Jeronimo Castrillon, "Domain specific languages to tame heterogeneous and emerging computing systems", In ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote), Jul 2021.

Abstract
Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging computer architectures. This complexity will make it harder to democratize high-performance computing, which already today highly relies on expert programmers to write efficient parallel code. This talk discusses domain specific languages (DSLs) as a promising avenue to tame heterogeneity for non-expert programmers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning. The talk closes with insights on how compilers can leverage the high-level semantics of DSLs to optimize for emerging memory technologies.

Bibtex

@Misc{castrillon_pasc21,
author = {Castrillon, Jeronimo},
title = {Domain specific languages to tame heterogeneous and emerging computing systems},
howpublished = {ACM SIGHPC conference Platform for Advanced Scientific Computing PASC'21 (keynote)},
location = {Geneva (virtual), Switzerland},
abstract = {Programming heterogeneous computing systems is still a daunting task that will become even more challenging with the advent of emerging computer architectures. This complexity will make it harder to democratize high-performance computing, which already today highly relies on expert programmers to write efficient parallel code. This talk discusses domain specific languages (DSLs) as a promising avenue to tame heterogeneity for non-expert programmers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning. The talk closes with insights on how compilers can leverage the high-level semantics of DSLs to optimize for emerging memory technologies.},
month = jul,
year = {2021},
}

Downloads

210709_castrillon_PASC-sent [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3170

×
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi] [Bibtex & Downloads]

BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory

Reference

Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jeronimo Castrillon, Jian-Jia Chen, "BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory", Proceedings of the 58th Annual Design Automation Conference (DAC'21), ACM, pp. 1111–1116, Jul 2021. [doi]

Bibtex

@InProceedings{khan_dac21,
author = {Christian Hakert and Asif Ali Khan and Kuan-Hsun Chen and Fazal Hameed and Jeronimo Castrillon and Jian-Jia Chen},
booktitle = {Proceedings of the 58th Annual Design Automation Conference (DAC'21)},
title = {{BLO}wing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
doi = {10.1109/DAC18074.2021.9586167},
location = {San Francisco, California},
pages = {1111--1116},
series = {DAC '21},
url = {https://ieeexplore.ieee.org/document/9586167},
publisher = {ACM},
month = jul,
year = {2021},
}

Downloads

2112_Hakert_DAC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2975

×
Andr'es Goens, Jeronimo Castrillon, "Embeddings of Task Mappings to Multicore Systems", Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, pp. 161–176, Berlin, Heidelberg, Jul 2021. [doi] [Bibtex & Downloads]

Embeddings of Task Mappings to Multicore Systems

Reference

Andr'es Goens, Jeronimo Castrillon, "Embeddings of Task Mappings to Multicore Systems", Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS), Springer-Verlag, pp. 161–176, Berlin, Heidelberg, Jul 2021. [doi]

Abstract
The problem of finding good mappings is central to designing and executing applications efficiently in embedded systems. In heterogeneous multicores, which are ubiquitous today, this problem yields an intractably large design space of possible mappings. Most methods explore this space using heuristics, many of which implicitly use geometric notions in mappings. In this paper we explore the geometry of the mapping problem explicitly, for finding embeddings of the mapping space that capture its structure. This allows us to formulate new mapping strategies by leveraging the geometry of the mapping space, as well as improving existing heuristics that do so implicitly. We evaluate our approach on a novel mapping heuristic based on gradient descent, as well as multiple existing meta-heuristics. For complex architectures, our methods improved the results of established exploration meta-heuristics by about an order of magnitude in average.

Bibtex

@InProceedings{goens_samos21,
author = {Andrés Goens and Jeronimo Castrillon},
booktitle = {Proceedings of the 21st IEEE International Conference on Embedded Computer Systems: Architectures Modeling and Simulation (SAMOS)},
date = {2021-07},
title = {Embeddings of Task Mappings to Multicore Systems},
doi = {10.1007/978-3-031-04580-6_11},
isbn = {978-3-031-04579-0},
location = {Samos, Greece},
organization = {IEEE},
pages = {161--176},
publisher = {Springer-Verlag},
url = {https://doi.org/10.1007/978-3-031-04580-6_11},
abstract = {The problem of finding good mappings is central to designing and executing applications efficiently in embedded systems. In heterogeneous multicores, which are ubiquitous today, this problem yields an intractably large design space of possible mappings. Most methods explore this space using heuristics, many of which implicitly use geometric notions in mappings. In this paper we explore the geometry of the mapping problem explicitly, for finding embeddings of the mapping space that capture its structure. This allows us to formulate new mapping strategies by leveraging the geometry of the mapping space, as well as improving existing heuristics that do so implicitly. We evaluate our approach on a novel mapping heuristic based on gradient descent, as well as multiple existing meta-heuristics. For complex architectures, our methods improved the results of established exploration meta-heuristics by about an order of magnitude in average.},
address = {Berlin, Heidelberg},
month = jul,
numpages = {16},
year = {2021},
}

Downloads

2107_Goens_SAMOS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3071

×
Andr'es Goens, Timo Nicolai, Jeronimo Castrillon, "mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1592-1605, Jul 2021. [doi] [Bibtex & Downloads]

mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies

Reference

Andr'es Goens, Timo Nicolai, Jeronimo Castrillon, "mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), IEEE Press, vol. 41, no. 6, pp. 1592-1605, Jul 2021. [doi]

Bibtex

@Article{goens_tcad21,
author = {Andrés Goens and Timo Nicolai and Jeronimo Castrillon},
date = {2021-07},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title = {mpsym: Improving Design-Space Exploration of Clustered Manycores with Arbitrary Topologies},
month = jul,
numpages = {14 pp},
volume={41},
number={6},
pages={1592-1605},
doi = {10.1109/TCAD.2021.3102512},
issn = {1937-4151},
url = {https://ieeexplore.ieee.org/document/9506807},
publisher = {IEEE Press},
year = {2021},
}

Downloads

2107_Goens_TCAD [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3171

×
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Burkhard Ringlein, Stephanie Soldavini, Christian Pilato, Mattia Tibaldi, Fabrizio Ferrandi, Stanislav Böhm, Francesco Regazzoni, Kartik Nayak, "EVEREST: Definition of the compilation framework", Technical report, EVEREST consortium, Jul 2021. [Bibtex & Downloads]

EVEREST: Definition of the compilation framework

Reference

Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Burkhard Ringlein, Stephanie Soldavini, Christian Pilato, Mattia Tibaldi, Fabrizio Ferrandi, Stanislav Böhm, Francesco Regazzoni, Kartik Nayak, "EVEREST: Definition of the compilation framework", Technical report, EVEREST consortium, Jul 2021.

Bibtex

@Report{castrillon_everestD4.1_2021,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Gerald Hempel and Burkhard Ringlein and Stephanie Soldavini and Christian Pilato and Mattia Tibaldi and Fabrizio Ferrandi and Stanislav B{\"o}hm and Francesco Regazzoni and Kartik Nayak},
institution = {EVEREST consortium},
title = {{EVEREST}: Definition of the compilation framework},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/3lloP4p1ukGUdJx},
month = jul,
year = {2021},
}

Downloads

2107_Castrillon-Everest-D4 [1]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3724

×
Nesrine Khouzami, Lars Schütze, Pietro Incardona, Landfried Kraaz, Tina Subic, Jeronimo Castrillon, Ivo F. Sbalzarini, "The OpenPME Problem Solving Environment for Numerical Simulations", In Proceeding: International Conference on Computational Science (ICCS'21) (Paszynski, Maciej and Kranzlmüller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.), Springer International Publishing, pp. 614–627, Cham, Jun 2021. [doi] [Bibtex & Downloads]

The OpenPME Problem Solving Environment for Numerical Simulations

Reference

Nesrine Khouzami, Lars Schütze, Pietro Incardona, Landfried Kraaz, Tina Subic, Jeronimo Castrillon, Ivo F. Sbalzarini, "The OpenPME Problem Solving Environment for Numerical Simulations", In Proceeding: International Conference on Computational Science (ICCS'21) (Paszynski, Maciej and Kranzlmüller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.), Springer International Publishing, pp. 614–627, Cham, Jun 2021. [doi]

Abstract
We introduce OpenPME, the Open Particle-Mesh Environment, a problem solving environment that provides a Domain Specific Language (DSL) for numerical simulations in scientific computing. It is built atop a domain metamodel that is general enough to cover the main types of numerical simulations: simulations using particles, meshes, and hybrid combinations of particles and meshes. Using model-to-model transformations, OpenPME generates code against the state-of-the-art C++ parallel computing library OpenFPM. This effectively lowers the programming barrier and enables users to implement scalable simulation codes for high-performance computing (HPC) systems using high-level abstractions. Plenty of recent research has shown that higher-level abstractions and problem solving environments are well suited to alleviate low-level implementation overhead. We demonstrate this for OpenPME and its compiler on three different test cases—particle-based, mesh-based, and hybrid particle-mesh—showing up to 7-fold reduction in the number of lines of code compared to a direct OpenFPM implementation in C++.

Bibtex

@InProceedings{khouzami_iccs21,
author = {Nesrine Khouzami and Lars Sch{\"u}tze and Pietro Incardona and Landfried Kraaz and Tina Subic and Jeronimo Castrillon and Ivo F. Sbalzarini},
booktitle = {International Conference on Computational Science (ICCS'21)},
title = {The OpenPME Problem Solving Environment for Numerical Simulations},
doi = {10.1007/978-3-030-77961-0_49},
editor = {Paszynski, Maciej and Kranzlm{\"u}ller, Dieter and Krzhizhanovskaya, Valeria V. and Dongarra, Jack J. and Sloot, Peter M. A.},
isbn = {978-3-030-77961-0},
location = {Krakow (virtual), Poland},
organization = {Springer},
pages = {614--627},
publisher = {Springer International Publishing},
url = {https://link.springer.com/chapter/10.1007%2F978-3-030-77961-0_49},
abstract = {We introduce OpenPME, the Open Particle-Mesh Environment, a problem solving environment that provides a Domain Specific Language (DSL) for numerical simulations in scientific computing. It is built atop a domain metamodel that is general enough to cover the main types of numerical simulations: simulations using particles, meshes, and hybrid combinations of particles and meshes. Using model-to-model transformations, OpenPME generates code against the state-of-the-art C++ parallel computing library OpenFPM. This effectively lowers the programming barrier and enables users to implement scalable simulation codes for high-performance computing (HPC) systems using high-level abstractions. Plenty of recent research has shown that higher-level abstractions and problem solving environments are well suited to alleviate low-level implementation overhead. We demonstrate this for OpenPME and its compiler on three different test cases---particle-based, mesh-based, and hybrid particle-mesh---showing up to 7-fold reduction in the number of lines of code compared to a direct OpenFPM implementation in C++.},
address = {Cham},
month = jun,
numpages = {14},
year = {2021},
}

Downloads

2106_Khouzami_ICCS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3018

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi] [Bibtex & Downloads]

ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", In IEEE Transactions on Emerging Topics in Computing (IEEE TETC), 12 pp., Jun 2021. [doi]

Abstract
Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44% and 21-49%, respectively.

Bibtex

@Article{hameed_tetc21,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
journal = {IEEE Transactions on Emerging Topics in Computing (IEEE TETC)},
title = {{ALPHA}: A Novel Algorithm-Hardware Co-design for Accelerating {DNA} Seed Location Filtering},
pages = {12 pp.},
abstract = {Sequence alignment is a fundamental operation in genomic analysis where DNA fragments called reads are mapped to a long reference DNA sequence. There exist a number of (in)exact alignment algorithms with varying sensitivity for both local and global alignments, however, they are all computationally expensive. With the advent of high-throughput sequencing (HTS) technologies that generate a mammoth amount of data, there is increased pressure on improving the performance and capacity of the analysis algorithms in general and the mapping algorithms in particular. While many works focus on improving the performance of the aligner themselves, recently it has been demonstrated that restricting the mapping space for input reads and filtering out mapping positions that will result in a poor match can significantly improve the performance of the alignment operation. However, this is only true if it is guaranteed that the filtering operation can be performed significantly faster. Otherwise, it can easily outweigh the benefits of the aligner. To expedite this pre-alignment filtering, among others, the recently proposed GRIM-Filter uses highly-parallel processing-in-memory operations benefiting from light-weight computational units on the logic-in-memory layer. However, the significant amount of data transferring between the memory and logic-in-memory layers quickly becomes a performance and energy bottleneck for the memory subsystem and ultimately for the overall system. By analyzing input genomes, we found that there are unexpected data-reuse opportunities in the filtering operation. We propose an algorithm-hardware co-design that exploits the data-reuse in the seed location filtering operation and, compared to the GRIM-Filter, cuts the number of memory accesses by 22-54\%. This reduction in memory accesses improves the overall performance and energy consumption by 19-44\% and 21-49\%, respectively.},
month = jun,
year = {2021},
doi = {10.1109/TETC.2021.3093840},
issn = {2168-6750},
}

Downloads

2107_hameed_TETC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3116

×
Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "An online guided tuning approach to run CNN pipelines on edge devices", Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21), Association for Computing Machinery (ACM), pp. 45–53, New York, NY, USA, May 2021. [doi] [Bibtex & Downloads]

An online guided tuning approach to run CNN pipelines on edge devices

Reference

Pirah Noor Soomro, Mustafa Abduljabbar, Jeronimo Castrillon, Miquel Pericás, "An online guided tuning approach to run CNN pipelines on edge devices", Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21), Association for Computing Machinery (ACM), pp. 45–53, New York, NY, USA, May 2021. [doi]

Abstract
Modern edge and mobile devices are equipped with powerful computing resources. These are often organized as heterogeneous multi-cores, featuring performance-asymmetric core clusters. This raises the question on how to effectively execute the inference pass of convolutional neural networks (CNN) on such devices. Existing CNN implementations on edge devices leverage offline profiling data to determine a better schedule for CNN applications. This approach requires a time consuming phase of generating a performance profile for each type of representative kernel on various core configurations available on the device, coupled with a search space exploration. We propose an online tuning technique which utilizes compile time hints and online profiling data to generate high throughput CNN pipelines. We explore core heterogeneity and compatible core-layer configurations through an online guided search. Unlike exhaustive search, we adopt an evolutionary approach with a guided starting point in order to find the solution. We show that by pruning and navigating through the complex search space using compile time hints, 79% of the tested configurations turn out to be near-optimal candidates for a throughput maximizing pipeline on NVIDIA Jetson TX2 platform.

Bibtex

@InProceedings{soomro_cf21,
author = {Pirah Noor Soomro and Mustafa Abduljabbar and Jeronimo Castrillon and Miquel Peric{\'a}s},
booktitle = {Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21)},
title = {An online guided tuning approach to run CNN pipelines on edge devices},
doi = {10.1145/3457388.3458662},
isbn = {9781450384049},
location = {Virtual Event, Italy},
pages = {45–53},
publisher = {Association for Computing Machinery (ACM)},
series = {CF '21},
url = {https://doi.org/10.1145/3457388.3458662},
abstract = {Modern edge and mobile devices are equipped with powerful computing resources. These are often organized as heterogeneous multi-cores, featuring performance-asymmetric core clusters. This raises the question on how to effectively execute the inference pass of convolutional neural networks (CNN) on such devices. Existing CNN implementations on edge devices leverage offline profiling data to determine a better schedule for CNN applications. This approach requires a time consuming phase of generating a performance profile for each type of representative kernel on various core configurations available on the device, coupled with a search space exploration. We propose an online tuning technique which utilizes compile time hints and online profiling data to generate high throughput CNN pipelines. We explore core heterogeneity and compatible core-layer configurations through an online guided search. Unlike exhaustive search, we adopt an evolutionary approach with a guided starting point in order to find the solution. We show that by pruning and navigating through the complex search space using compile time hints, 79\% of the tested configurations turn out to be near-optimal candidates for a throughput maximizing pipeline on NVIDIA Jetson TX2 platform.},
address = {New York, NY, USA},
month = may,
numpages = {9},
year = {2021},
}

Downloads

2105_Soomro-CF [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3020

×
Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Jan Martinovic, Stanislav Böhm, Martin Surkovsky, Michele Paolino, Fabrizio Ferrandi, Serena Curzel, Michele Fiorito, Christian Pilato, Stephanie Soldavini, Gianluca Palermo, Dionysios Diamantopoulos, "EVEREST: Definition of Language Requirements", Technical report, EVEREST consortium, Apr 2021. [Bibtex & Downloads]

EVEREST: Definition of Language Requirements

Reference

Jeronimo Castrillon, Felix Wittwer, Karl Friebel, Gerald Hempel, Jan Martinovic, Stanislav Böhm, Martin Surkovsky, Michele Paolino, Fabrizio Ferrandi, Serena Curzel, Michele Fiorito, Christian Pilato, Stephanie Soldavini, Gianluca Palermo, Dionysios Diamantopoulos, "EVEREST: Definition of Language Requirements", Technical report, EVEREST consortium, Apr 2021.

Bibtex

@Report{castrillon_everestD2.2_2021,
author = {Jeronimo Castrillon and Felix Wittwer and Karl Friebel and Gerald Hempel and Jan Martinovic and Stanislav B{\"o}hm and Martin Surkovsky and Michele Paolino and Fabrizio Ferrandi and Serena Curzel and Michele Fiorito and Christian Pilato and Stephanie Soldavini and Gianluca Palermo and Dionysios Diamantopoulos},
institution = {EVEREST consortium},
title = {{EVEREST}: Definition of Language Requirements},
type = {techreport},
url = {https://drive.switch.ch/index.php/s/ddn1yGnHavgzXpB},
month = apr,
year = {2021},
}

Downloads

2104_Castrillon-Everest-D2 [2]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3723

×
Weihua Sheng, Jeronimo Castrillon, Rainer Leupers, "Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms", Chapter in Multi-Processor System-on-Chip 2 (Liliana Andrade and Frédéric Rousseau), ISTE Ltd London and Wiley Hoboke, pp. 203–235, Mar 2021. [Bibtex & Downloads]

Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms

Reference

Weihua Sheng, Jeronimo Castrillon, Rainer Leupers, "Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms", Chapter in Multi-Processor System-on-Chip 2 (Liliana Andrade and Frédéric Rousseau), ISTE Ltd London and Wiley Hoboke, pp. 203–235, Mar 2021.

Abstract
This chapter addresses the challenges associated with compilation and optimization techniques for heterogeneous multi-core computing systems in the embedded industry. Wireless terminals and modems are typical examples of such systems, which demand high performance and energy efficiency at the same time. To fully exploit the computing power of those systems, the existing compiler technology for single processor systems does not suit the need and scale for multi-core architectures anymore. The authors have applied a systematic approach to tackle the problems of application modeling, source-to-source compilation, flexible compiler infrastructure construction and software distribution for multi-core architectures from a practical perspective. Several real-world multi-core platforms as well as system-level virtual platforms have been successfully used to demonstrate the achievable speed-ups and versatility of the compilation and optimization techniques developed in this work.

Bibtex

@InCollection{sheng_mpsoc21,
author = {Weihua Sheng and Jeronimo Castrillon and Rainer Leupers},
booktitle = {Multi-Processor System-on-Chip 2},
year = {2021},
title = {Software Compilation and Optimization Techniques for Heterogeneous Multi-core Platforms},
chapter = {10},
editor = {Liliana Andrade and Fr{\'e}d{\'e}ric Rousseau},
isbn = {978-17-894-5022-4},
pages = {203--235},
publisher = {ISTE Ltd London and Wiley Hoboke},
abstract = {This chapter addresses the challenges associated with compilation and optimization techniques for heterogeneous multi-core computing systems in the embedded industry. Wireless terminals and modems are typical examples of such systems, which demand high performance and energy efficiency at the same time. To fully exploit the computing power of those systems, the existing compiler technology for single processor systems does not suit the need and scale for multi-core architectures anymore. The authors have applied a systematic approach to tackle the problems of application modeling, source-to-source compilation, flexible compiler infrastructure construction and software distribution for multi-core architectures from a practical perspective. Several real-world multi-core platforms as well as system-level virtual platforms have been successfully used to demonstrate the achievable speed-ups and versatility of the compilation and optimization techniques developed in this work.},
month = mar,
url = {http://www.iste.co.uk/book.php?id=1739}
}

Downloads

2103_Sheng_MPSoC20-preprint [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=3003

×
Christian Pilato, Stanislav Bohm, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Vojtech Cima, Radim Cmar, Dionysios Diamantopoulos, Fabrizio Ferrandi, Jan Martinovic, Gianluca Palermo, Michele Paolino, Antonio Parodi, Lorenzo Pittaluga, Daniel Raho, Francesco Regazzoni, Katerina Slaninova, Christoph Hagleitner, "EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms", Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE), pp. 1320–1325, Feb 2021. [doi] [Bibtex & Downloads]

EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms

Reference

Christian Pilato, Stanislav Bohm, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Vojtech Cima, Radim Cmar, Dionysios Diamantopoulos, Fabrizio Ferrandi, Jan Martinovic, Gianluca Palermo, Michele Paolino, Antonio Parodi, Lorenzo Pittaluga, Daniel Raho, Francesco Regazzoni, Katerina Slaninova, Christoph Hagleitner, "EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms", Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE), pp. 1320–1325, Feb 2021. [doi]

Bibtex

@InProceedings{pilato_date21,
author = {Christian Pilato and Stanislav Bohm and Fabien Brocheton and Jeronimo Castrillon and Riccardo Cevasco and Vojtech Cima and Radim Cmar and Dionysios Diamantopoulos and Fabrizio Ferrandi and Jan Martinovic and Gianluca Palermo and Michele Paolino and Antonio Parodi and Lorenzo Pittaluga and Daniel Raho and Francesco Regazzoni and Katerina Slaninova and Christoph Hagleitner},
booktitle = {Proceedings of the 2021 Design, Automation and Test in Europe Conference (DATE)},
title = {{EVEREST}: A design environment for extreme-scale big data analytics on heterogeneous platforms},
location = {Virtual Conference},
series = {DATE'21},
month = feb,
year = {2021},
pages = {1320--1325},
url = {https://ieeexplore.ieee.org/document/9473940},
doi = {10.23919/DATE51398.2021.9473940},
}

Downloads

2102_Pilato_DATE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2922

×
Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Towards Adaptive multi-Alternative Process Network", Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (Bispo, João and Cherubin, Stefano and Flich, Jos'e), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 88, pp. 1:1–1:11, Dagstuhl, Germany, Jan 2021. [doi] [Bibtex & Downloads]

Towards Adaptive multi-Alternative Process Network

Reference

Hasna Bouraoui, Chadlia Jerad, Jeronimo Castrillon, "Towards Adaptive multi-Alternative Process Network", Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (Bispo, João and Cherubin, Stefano and Flich, Jos'e), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 88, pp. 1:1–1:11, Dagstuhl, Germany, Jan 2021. [doi]

Bibtex

@InProceedings{bouraoui_parma21,
author = {Hasna Bouraoui and Chadlia Jerad and Jeronimo Castrillon},
booktitle = {Proceedings of the 12th Workshop and 10th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'21), co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
title = {Towards Adaptive multi-Alternative Process Network},
doi = {10.4230/OASIcs.PARMA-DITAM.2021.1},
editor = {Bispo, Jo\~{a}o and Cherubin, Stefano and Flich, Jos\'{e}},
isbn = {978-3-95977-181-8},
location = {Budapest, Hungary},
pages = {1:1--1:11},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
series = {Open Access Series in Informatics (OASIcs)},
url = {https://drops.dagstuhl.de/opus/volltexte/2021/13637/pdf/OASIcs-PARMA-DITAM-2021-1.pdf},
volume = {88},
address = {Dagstuhl, Germany},
issn = {2190-6807},
month = jan,
numpages = {10},
urn = {urn:nbn:de:0030-drops-136378},
year = {2021},
}

Downloads

2101_Bouraoui_PARMA [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2945

×
Christian Menard, Andr'es Goens, Gerald Hempel, Robert Khasanov, Julian Robledo, Felix Teweleitt, Jeronimo Castrillon, "Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores", Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 66–73, New York, NY, USA, Jan 2021. (Video Presentation) [doi] [Bibtex & Downloads]

Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores

Reference

Christian Menard, Andr'es Goens, Gerald Hempel, Robert Khasanov, Julian Robledo, Felix Teweleitt, Jeronimo Castrillon, "Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores", Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 66–73, New York, NY, USA, Jan 2021. (Video Presentation) [doi]

Abstract
We present Mocasin, an open-source rapid prototyping framework for researching, implementing and validating new algorithms and solutions in the field of mapping software to heterogeneous multi-cores. In contrast to the many existing tools that often specialize for a particular use-case, Mocasin is an open, flexible and generic research environment that abstracts over the approaches taken by other tools. Mocasin is designed to support a wide range of models of computation and input formats, implements manifold mapping strategies and provides an adjustable high-level simulator for performance estimation. This infrastructure serves as a flexible vehicle for exploring new approaches and as a blueprint for building customized tools. We highlight the key design aspects of Mocasin that enable its flexibility and illustrate its capabilities in a case-study showing how Mocasin can be used for building a customized tool for researching runtime mapping strategies in an LTE uplink receiver.

Bibtex

@InProceedings{menard_rapido21,
author = {Christian Menard and Andrés Goens and Gerald Hempel and Robert Khasanov and Julian Robledo and Felix Teweleitt and Jeronimo Castrillon},
title = {Mocasin---Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores},
booktitle = {Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2021},
address = {New York, NY, USA},
month = jan,
publisher = {ACM},
doi = {10.1145/3444950.3447285},
isbn = {9781450389525},
location = {Budapest, Hungary},
pages = {66–73},
publisher = {Association for Computing Machinery},
series = {DroneSE and RAPIDO '21},
url = {https://doi.org/10.1145/3444950.3447285},
abstract = {We present Mocasin, an open-source rapid prototyping framework for researching, implementing and validating new algorithms and solutions in the field of mapping software to heterogeneous multi-cores. In contrast to the many existing tools that often specialize for a particular use-case, Mocasin is an open, flexible and generic research environment that abstracts over the approaches taken by other tools. Mocasin is designed to support a wide range of models of computation and input formats, implements manifold mapping strategies and provides an adjustable high-level simulator for performance estimation. This infrastructure serves as a flexible vehicle for exploring new approaches and as a blueprint for building customized tools. We highlight the key design aspects of Mocasin that enable its flexibility and illustrate its capabilities in a case-study showing how Mocasin can be used for building a customized tool for researching runtime mapping strategies in an LTE uplink receiver.},
numpages = {8},

}

Downloads

2101_Menard_RAPIDO [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2944

×

2020
Lars Schütze, Jeronimo Castrillon, "Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages", Proceedings of the17th International Conference on Managed Programming Languages & Runtimes (MPLR'20), Association for Computing Machinery, pp. 52–62, New York, NY, USA, Nov 2020. [doi] [Bibtex & Downloads]

Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages

Reference

Lars Schütze, Jeronimo Castrillon, "Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages", Proceedings of the17th International Conference on Managed Programming Languages & Runtimes (MPLR'20), Association for Computing Machinery, pp. 52–62, New York, NY, USA, Nov 2020. [doi]

Abstract
Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. The cost of adaptivity is however a high runtime overhead stemming from executing compositions of behavior-modifying code. It has been shown that the overhead can be reduced by optimizing dispatch plans at runtime for static cases, but no method exists to reduce the overhead in cases with high variability. This paper presents a novel approach to implement polymorphic role dispatch, taking advantage of dependent types and using run-time information to effectively guard abstractions and enable reuse. The concept of polymorphic inline caches is extended to role invocations. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a geometric mean speedup of 4.0$\times$ (3.8$\times$ up to 4.5$\times$) in the static case, and close to no overhead in the dynamic case over the current implementation of contextual roles in Object Teams.

Bibtex

@InProceedings{schuetze_mplr20,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
booktitle = {Proceedings of the17th International Conference on Managed Programming Languages \& Runtimes (MPLR'20)},
title = {Efficient Dispatch of Multi-Object Polymorphic Call Sites in Contextual Role-Oriented Programming Languages},
location = {Virtual, UK},
pages = {52--62},
numpages = {11},
series = {MPLR'20},
abstract = {Adaptive software becomes more and more important as computing is increasingly context-dependent. Runtime adaptability can be achieved by dynamically selecting and applying context-specific code. Role-oriented programming has been proposed as a paradigm to enable runtime adaptive software by design. Roles change the objects’ behavior at runtime, thus adapting the software to a given context. The cost of adaptivity is however a high runtime overhead stemming from executing compositions of behavior-modifying code. It has been shown that the overhead can be reduced by optimizing dispatch plans at runtime for static cases, but no method exists to reduce the overhead in cases with high variability. This paper presents a novel approach to implement polymorphic role dispatch, taking advantage of dependent types and using run-time information to effectively guard abstractions and enable reuse. The concept of polymorphic inline caches is extended to role invocations. We evaluate the implementation with a benchmark for role-oriented programming languages achieving a geometric mean speedup of 4.0$\times$ (3.8$\times$ up to 4.5$\times$) in the static case, and close to no overhead in the dynamic case over the current implementation of contextual roles in Object Teams.},
year = {2020},
month = nov,
numpages = {9},
isbn = {9781450388535},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3426182.3426186},
doi = {10.1145/3426182.3426186},
}

Downloads

2010_Schuetze_MPLR [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2874

×
Robert Wittig, Andrés Goens, Christian Menard, Emil Matus, Gerhard P. Fettweis, Jeronimo Castrillon, "Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach", Proceedings of the 27th International Conference on Telecommunications (ICT), pp. 1-5, Oct 2020. [doi] [Bibtex & Downloads]

Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach

Reference

Robert Wittig, Andrés Goens, Christian Menard, Emil Matus, Gerhard P. Fettweis, Jeronimo Castrillon, "Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach", Proceedings of the 27th International Conference on Telecommunications (ICT), pp. 1-5, Oct 2020. [doi]

Abstract
In the era of 5G and beyond, adaptive workloads and the need for energy efficiency drive are becoming increasingly vital. Changes in parameters of the physical layer algorithm can cascade throughout the algorithm, requiring additional changes to keep a correct functionality within the timing bounds. These factors drive the process of designing systems for mobile communication towards reconfigurability. In this paper we analyze the trade-offs involved in changing algorithmic parameters and show how reconfigurable systems can be used to produce energy-efficient systems. We argue that we ought to resort to formal models to tame this reconfigurability and examine where existing formal models fall short.

Bibtex

@InProceedings{goens_ict20,
author = {Robert Wittig and Andr{\'e}s Goens and Christian Menard and Emil Matus and Gerhard P. Fettweis and Jeronimo Castrillon},
booktitle = {Proceedings of the 27th International Conference on Telecommunications (ICT)},
title = {Modem Design in the Era of 5G and Beyond: The Need for a Formal Approach},
location = {Virtual. Bali, Indonesia},
month = oct,
abstract = {In the era of 5G and beyond, adaptive workloads and the need for energy efficiency drive are becoming increasingly vital. Changes in parameters of the physical layer algorithm can cascade throughout the algorithm, requiring additional changes to keep a correct functionality within the timing bounds. These factors drive the process of designing systems for mobile communication towards reconfigurability. In this paper we analyze the trade-offs involved in changing algorithmic parameters and show how reconfigurable systems can be used to produce energy-efficient systems. We argue that we ought to resort to formal models to tame this reconfigurability and examine where existing formal models fall short.},
year = {2020},
pages={1-5},
doi={10.1109/ICT49546.2020.9239539},
url = {https://ieeexplore.ieee.org/document/9239539},
}

Downloads

2010_Wittig_ICT [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2648

×
Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi] [Bibtex & Downloads]

Polyhedral Compilation for Racetrack Memories

Reference

Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20), IEEE Press, vol. 39, no. 11, pp. 3968–3980, Oct 2020. [doi]

Abstract
Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85% fewer shifts (average 41%), improving both performance and energy consumption by an average of 17.9% and 39.8%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.

Bibtex

@Article{khan_cases20,
author = {Asif Ali Khan and Hauke Mewes and Tobias Grosser and Torsten Hoefler and Jeronimo Castrillon},
title = {Polyhedral Compilation for Racetrack Memories},
journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Special issue on Compilers, Architecture, and Synthesis of Embedded Systems (CASES'20)},
year = {2020},
series = {CASES ’20},
month = oct,
doi = {10.1109/TCAD.2020.3012266},
url = {https://ieeexplore.ieee.org/document/9216560},
volume={39},
number={11},
pages={3968--3980},
issn = {1937-4151},
issn = {1937-4151},
abstract = {Traditional memory hierarchy designs, primarily based on SRAM and DRAM, become increasingly unsuitable to meet the performance, energy, bandwidth and area requirements of modern embedded and high-performance computer systems. Racetrack Memory (RTM), an emerging non-volatile memory technology, promises to meet these conflicting demands by offering simultaneously high speed, higher density, and non-volatility. RTM provides these efficiency gains by not providing immediate access to all storage locations, but by instead storing data sequentially in the equivalent to nanoscale tapes called tracks. Before any data can be accessed, explicit shift operations must be issued that cost energy and increase access latency. The result is a fundamental change in memory performance behavior: the address distance between subsequent memory accesses now has a linear effect on memory performance. While there are first techniques to optimize programs for linear-latency memories such as RTM, existing automatic solutions treat only scalar memory accesses. This work presents the first automatic compilation framework that optimizes static loop programs over arrays for linear-latency memories. We extend the polyhedral compilation framework Polly to generate code that maximizes accesses to the same or consecutive locations, thereby minimizing the number of shifts. Our experimental results show that the optimized code incurs up to 85\% fewer shifts (average 41\%), improving both performance and energy consumption by an average of 17.9\% and 39.8\%, respectively. Our results show that automatic techniques make it possible to effectively program linear-latency memory architectures such as RTM.},
booktitle = {Proceedings of the 2020 International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES)},
location = {Virtual conference},
numpages = {12},
publisher = {IEEE Press},
}

Downloads

2009_Khan_CASES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2833

×
Jeronimo Castrillon, "The role of domain-specific languages for cyber-physical systems", In Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk), Oct 2020. [Bibtex & Downloads]

The role of domain-specific languages for cyber-physical systems

Reference

Jeronimo Castrillon, "The role of domain-specific languages for cyber-physical systems", In Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk), Oct 2020.

Abstract
Embedded and cyber-physical systems (CPS) are heterogeneous interconnected computing systems with an ever increasing complexity. As CPSs become more widespread, the developer community widens, exposing the complexity to mainstream programmers. In this talk, I will talk about domain-specific languages (DSLs) as a promising avenue to handle complexity without compromising on efficiency. The talk will provide background on programming languages and go over sample DSLs from different communities. An in-depth example will serve to grasp the power that lies in DSLs for efficiency, correctness and ease to target complex emerging systems.

Bibtex

@Misc{castrillon_ensi20,
author = {Castrillon, Jeronimo},
title = {The role of domain-specific languages for cyber-physical systems},
year = {2020},
howpublished = {Seminar series: Design and Programming Cyber-Physical Systems and IoT Applications (invited talk)},
location = {Tunis, Tunisia (Virtual)},
month = oct,
abstract = {Embedded and cyber-physical systems (CPS) are heterogeneous interconnected computing systems with an ever increasing complexity. As CPSs become more widespread, the developer community widens, exposing the complexity to mainstream programmers. In this talk, I will talk about domain-specific languages (DSLs) as a promising avenue to handle complexity without compromising on efficiency. The talk will provide background on programming languages and go over sample DSLs from different communities. An in-depth example will serve to grasp the power that lies in DSLs for efficiency, correctness and ease to target complex emerging systems. },
url = {https://sites.google.com/ensi-uma.tn/seminar-series-on-cps-n-iot/home}
}

Downloads

201006_castrill_ENSI-compressed [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2905

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi] [Bibtex & Downloads]

Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling", In IEEE Transactions on Computers, vol. 70, no. 11, pp. 1914-1927, Oct 2020. [doi]

Abstract
In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7% compared to a state-of-the-art direct-mapped LLC design and by 7.2% compared to an existing associative LLC design.

Bibtex

@Article{hameed_tc20,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Improving the Performance of Block-based DRAM Caches via Tag-Data Decoupling},
journal = {IEEE Transactions on Computers},
year = {2020},
month = oct,
abstract = {In-package DRAM-based Last-Level-Caches (LLCs) that cache data in small chunks (i.e., blocks) are promising for improving system performance due to their efficient main memory bandwidth utilization. However, in these high-capacity DRAM caches, managing metadata (i.e., tags) at low cost is challenging. Storing the tags in SRAM has the advantage of quick tag access but is impractical due to a large area overhead. Storing the tags in DRAM reduces the area overhead but incurs tag serialization latency for an associative LLC design, which is inevitable for achieving high cache hit rate. To address the area and latency overhead problem, we propose a block- based DRAM LLC design that decouples tag and data into two regions in DRAM. Our design stores the tags in a latency-optimized DRAM region as the tags are accessed more often than the data. In contrast, we optimize the data region for area efficiency and map spatially-adjacent cache blocks to the same DRAM row to exploit spatial locality. Our design mitigates the tag serialization latency of existing associative DRAM LLCs via selective in-DRAM tag comparison, which overlaps the latency of tag and data accesses. This efficiently enables LLC bypassing via a novel DRAM Absence Table (DAT) that not only provides fast LLC miss detection but also reduces in-package bandwidth requirements. Our evaluation using SPEC2006 benchmarks shows that our tag-data decoupled LLC improves system performance by 11.7\% compared to a state-of-the-art direct-mapped LLC design and by 7.2\% compared to an existing associative LLC design.},
doi = {10.1109/TC.2020.3029615},
url = {https://ieeexplore.ieee.org/document/9220805},
issn = {0018-9340},
numpages = {14},
volume={70},
number={11},
pages={1914-1927},
}

Downloads

2010_Hameed_TC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2880

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

Bibtex

@Article{khan_tecs20,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2020},
month = sep,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {6},
issn = {1539-9087},
url = {https://doi.org/10.1145/3396235},
doi = {10.1145/3396235},
articleno = {44},
numpages = {26},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32\% and 73\% respectively compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80\%.},
}

Downloads

2009_Khan_TECS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2649

×
Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon, "ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-4, Sep 2020. [doi] [Bibtex & Downloads]

ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers

Reference

Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon, "ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-4, Sep 2020. [doi]

Abstract
Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.

Bibtex

@InProceedings{brauckmann_fdl20,
author = {Alexander Brauckmann and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {ComPy-Learn: A Toolbox for Exploring Machine Learning Representations for Compilers},
booktitle = {2020 Forum for Specification and Design Languages (FDL)},
year = {2020},
location = {Kiel, Germany},
month = sep,
pages={1-4},
doi={10.1109/FDL50818.2020.9232946},
url = {https://ieeexplore.ieee.org/document/9232946},
abstract = {Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.},
}

Downloads

2009_Brauckmann_FDL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2842

×
Marten Lohstroh, Christian Menard, Alexander Schulz-Rosengarten, Matthew Weber, Jeronimo Castrillon, Edward A. Lee, "A Language for Deterministic Coordination Across Multiple Timelines", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-8, Sep 2020. (Best paper award candidate) [doi] [Bibtex & Downloads]

A Language for Deterministic Coordination Across Multiple Timelines

Reference

Marten Lohstroh, Christian Menard, Alexander Schulz-Rosengarten, Matthew Weber, Jeronimo Castrillon, Edward A. Lee, "A Language for Deterministic Coordination Across Multiple Timelines", In Proceeding: 2020 Forum for Specification and Design Languages (FDL), pp. 1-8, Sep 2020. (Best paper award candidate) [doi]

Abstract
We discuss a novel approach for constructing deterministic reactive systems that evolves around a temporal model which incorporates a multiplicity of timelines. This model is central to LINGUA FRANCA (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called Reactors, which are objects that react to and emit discrete events. What sets LF apart from other languages that treat time as a first-class citizen is that it confronts the issue that in any reactive system there are at least two distinct timelines involved; a logical one and a physical one-and possibly multiple of each kind. LF provides a mechanism for relating events across timelines, and guarantees deterministic program behavior under quantifiable assumptions.

Bibtex

@InProceedings{lohstroh_fdl20,
author = {Marten Lohstroh and Christian Menard and Alexander Schulz-Rosengarten and Matthew Weber and Jeronimo Castrillon and Edward A. Lee},
title = {A Language for Deterministic Coordination Across Multiple Timelines},
booktitle = {2020 Forum for Specification and Design Languages (FDL)},
year = {2020},
location = {Kiel, Germany},
month = sep,
abstract = {We discuss a novel approach for constructing deterministic reactive systems that evolves around a temporal model which incorporates a multiplicity of timelines. This model is central to LINGUA FRANCA (LF), a polyglot coordination language and compiler toolchain we are developing for the definition and composition of concurrent components called Reactors, which are objects that react to and emit discrete events. What sets LF apart from other languages that treat time as a first-class citizen is that it confronts the issue that in any reactive system there are at least two distinct timelines involved; a logical one and a physical one-and possibly multiple of each kind. LF provides a mechanism for relating events across timelines, and guarantees deterministic program behavior under quantifiable assumptions.},
pages={1-8},
doi={10.1109/FDL50818.2020.9232939},
url = {https://ieeexplore.ieee.org/document/9232939},

}

Downloads

2009_Lohstroh_FDL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2843

×
Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, Éder F. Zulian, "The gem5 Simulator: Version 20.0+", In arXiv preprint arXiv:2007.03152, Jul 2020. [Bibtex & Downloads]

The gem5 Simulator: Version 20.0+

Reference

Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, Éder F. Zulian, "The gem5 Simulator: Version 20.0+", In arXiv preprint arXiv:2007.03152, Jul 2020.

Abstract
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.

Bibtex

@article{lowe-power_gem5_2020,
author={Jason Lowe-Power and Abdul Mutaal Ahmad and Ayaz Akram and Mohammad Alian and Rico Amslinger and Matteo Andreozzi and Adri{\`a} Armejach and Nils Asmussen and Brad Beckmann and Srikant Bharadwaj and Gabe Black and Gedare Bloom and Bobby R. Bruce and Daniel Rodrigues Carvalho and Jeronimo Castrillon and Lizhong Chen and Nicolas Derumigny and Stephan Diestelhorst and Wendy Elsasser and Carlos Escuin and Marjan Fariborz and Amin Farmahini-Farahani and Pouya Fotouhi and Ryan Gambord and Jayneel Gandhi and Dibakar Gope and Thomas Grass and Anthony Gutierrez and Bagus Hanindhito and Andreas Hansson and Swapnil Haria and Austin Harris and Timothy Hayes and Adrian Herrera and Matthew Horsnell and Syed Ali Raza Jafri and Radhika Jagtap and Hanhwi Jang and Reiley Jeyapaul and Timothy M. Jones and Matthias Jung and Subash Kannoth and Hamidreza Khaleghzadeh and Yuetsu Kodama and Tushar Krishna and Tommaso Marinelli and Christian Menard and Andrea Mondelli and Miquel Moreto and Tiago M{\"u}ck and Omar Naji and Krishnendra Nathella and Hoa Nguyen and Nikos Nikoleris and Lena E. Olson and Marc Orr and Binh Pham and Pablo Prieto and Trivikram Reddy and Alec Roelke and Mahyar Samani and Andreas Sandberg and Javier Setoain and Boris Shingarov and Matthew D. Sinclair and Tuan Ta and Rahul Thakur and Giacomo Travaglini and Michael Upton and Nilay Vaish and Ilias Vougioukas and William Wang and Zhengrong Wang and Norbert Wehn and Christian Weis and David A. Wood and Hongil Yoon and {\'E}der F. Zulian},
title = {The gem5 Simulator: Version 20.0+},
journal = {arXiv preprint arXiv:2007.03152},
url = {https://arxiv.org/abs/2007.03152},
year = {2020},
month = jul,
abstract = {The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.},
}

Downloads

2007_Lowe-Power-Gem5 [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2841

×
Christian Menard, Andrés Goens, Marten Lohstroh, Jeronimo Castrillon, "Achieving Determinism in Adaptive AUTOSAR", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 822–827, Mar 2020. (Best paper award candidate A-Track, Video Presentation) [doi] [Bibtex & Downloads]

Achieving Determinism in Adaptive AUTOSAR

Reference

Christian Menard, Andrés Goens, Marten Lohstroh, Jeronimo Castrillon, "Achieving Determinism in Adaptive AUTOSAR", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 822–827, Mar 2020. (Best paper award candidate A-Track, Video Presentation) [doi]

Abstract
AUTOSAR AP is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.

Bibtex

@InProceedings{menard_date20,
author = {Christian Menard and Andr{\'e}s Goens and Marten Lohstroh and Jeronimo Castrillon},
title = {Achieving Determinism in Adaptive AUTOSAR},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
month = mar,
publisher = {IEEE},
location = {Grenoble, France},

abstract = {AUTOSAR AP is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.},
isbn = {978-3-9819263-4-7},
pages = {822--827},
doi = {10.23919/DATE48585.2020.9116430},
url = {https://ieeexplore.ieee.org/abstract/document/9116430},
}

Downloads

2003_Menard_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2545

×
Robert Khasanov, Jeronimo Castrillon, "Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 909–914, Mar 2020. (Best paper award candidate E-Track, Video Presentation) [doi] [Bibtex & Downloads]

Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping

Reference

Robert Khasanov, Jeronimo Castrillon, "Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 909–914, Mar 2020. (Best paper award candidate E-Track, Video Presentation) [doi]

Abstract
Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.

Bibtex

@InProceedings{khasanov_date20,
author = {Robert Khasanov and Jeronimo Castrillon},
title = {Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},

month = mar,
publisher = {IEEE},
location = {Grenoble, France},
isbn = {978-3-9819263-4-7},
pages = {909--914},
doi = {10.23919/DATE48585.2020.9116381},
url = {https://ieeexplore.ieee.org/document/9116381},
abstract = {Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.},
}

Downloads

2003_Khasanov_DATE [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2546

×
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi] [Bibtex & Downloads]

Generalized Data Placement Strategies for Racetrack Memories

Reference

Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020. (Video Presentation) [doi]

Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.

Bibtex

@InProceedings{khan_date20,
author = {Asif Ali Khan and Andr{\'e}s Goens and Fazal Hameed and Jeronimo Castrillon},
title = {Generalized Data Placement Strategies for Racetrack Memories},
booktitle = {Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE)},
year = {2020},
series = {DATE '20},
publisher = {IEEE},
location = {Grenoble, France},
month = mar,
isbn = {978-3-9819263-4-7},
pages = {1502--1507},
doi = {10.23919/DATE48585.2020.9116245},
url = {https://ieeexplore.ieee.org/document/9116245},

abstract = {Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3x, 46% and 55% respectively compared to the state-of-the-art.},
}

Downloads

2003_Khan_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2547

×
Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi] [Bibtex & Downloads]

Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade

Reference

Robin Bläsing, Asif Ali Khan, Panagiotis Ch. Filippou, Chirag Garg, Fazal Hameed, Jeronimo Castrillon, Stuart S. P. Parkin, "Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade", In Proceedings of the IEEE, vol. 108, no. 8, pp. 1303-1321, Mar 2020. [doi]

Bibtex

@Article{khan_pieee20,
author = {Robin Bl{\"a}sing and Asif Ali Khan and Panagiotis Ch. Filippou and Chirag Garg and Fazal Hameed and Jeronimo Castrillon and Stuart S. P. Parkin},
title = {Magnetic Racetrack Memory: From Physics to the Cusp of Applications within a Decade},
journal = {Proceedings of the IEEE},
year = {2020},
month = mar,
volume={108},
number={8},
pages={1303-1321},
doi = {10.1109/JPROC.2020.2975719},
url = {https://ieeexplore.ieee.org/document/9045991},
}

Downloads

2003_Khan_JPROC [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2599

×
Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, Alberto Sangiovanni-Vincentelli, "Reactors: A Deterministic Model for Composable Reactive Systems", Cyber Physical Systems. Model-Based Design – Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019) (Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid), Springer International Publishing, pp. 59–85, Cham, Feb 2020. [doi] [Bibtex & Downloads]

Reactors: A Deterministic Model for Composable Reactive Systems

Reference

Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, Alberto Sangiovanni-Vincentelli, "Reactors: A Deterministic Model for Composable Reactive Systems", Cyber Physical Systems. Model-Based Design – Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019) (Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid), Springer International Publishing, pp. 59–85, Cham, Feb 2020. [doi]

Abstract
This paper describes a component-based concurrent model of computation for reactive systems. The components in this model, featuring ports and hierarchy, are called reactors. The model leverages a semantic notion of time, an event scheduler, and a synchronous-reactive style of communication to achieve determinism. Reactors enable a programming model that ensures determinism, unless explicitly abandoned by the programmer. We show how the coordination of reactors can safely and transparently exploit parallelism, both in shared-memory and distributed systems.

Bibtex

@InProceedings{Lohstroh_cyphy19,
author = {Marten Lohstroh and {\'I}{\~n}igo {\'I}ncer Romero and Andr\'{e}s Goens and Patricia Derler and Jeronimo Castrillon and Edward A. Lee and Alberto Sangiovanni-Vincentelli},
title = {Reactors: A Deterministic Model for Composable Reactive Systems},
editor= {Chamberlain, Roger and Edin Grimheden, Martin and Taha, Walid},
booktitle={Cyber Physical Systems. Model-Based Design -- Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019)},
year = {2020},
location = {New York City, NY, USA},
month = feb,
publisher={Springer International Publishing},
address={Cham},
pages={59--85},
abstract={This paper describes a component-based concurrent model of computation for reactive systems. The components in this model, featuring ports and hierarchy, are called reactors. The model leverages a semantic notion of time, an event scheduler, and a synchronous-reactive style of communication to achieve determinism. Reactors enable a programming model that ensures determinism, unless explicitly abandoned by the programmer. We show how the coordination of reactors can safely and transparently exploit parallelism, both in shared-memory and distributed systems.},
isbn={978-3-030-41131-2},
url = {https://link.springer.com/chapter/10.1007/978-3-030-41131-2_4},
doi = {10.1007/978-3-030-41131-2_4},
numpages = {27pp},
}

Downloads

1910_Lohstroh_CyPhy [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2532

×
Alexander Brauckmann, Andrés Goens, Sebastian Ertel, Jeronimo Castrillon, "Compiler-Based Graph Representations for Deep Learning Models of Code", Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020), Association for Computing Machinery, pp. 201–211, New York, NY, USA, Feb 2020. [doi] [Bibtex & Downloads]

Compiler-Based Graph Representations for Deep Learning Models of Code

Reference

Alexander Brauckmann, Andrés Goens, Sebastian Ertel, Jeronimo Castrillon, "Compiler-Based Graph Representations for Deep Learning Models of Code", Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020), Association for Computing Machinery, pp. 201–211, New York, NY, USA, Feb 2020. [doi]

Bibtex

@InProceedings{brauckmann_cc20,
author = {Alexander Brauckmann and Andr\'{e}s Goens and Sebastian Ertel and Jeronimo Castrillon},
title = {Compiler-Based Graph Representations for Deep Learning Models of Code},
booktitle = {Proceedings of the 29th ACM SIGPLAN International Conference on Compiler Construction (CC 2020)},
year = {2020},
isbn = {9781450371209},
url = {https://doi.org/10.1145/3377555.3377894},
doi = {10.1145/3377555.3377894},
series = {CC 2020},
pages = {201–211},
numpages = {11},
publisher = {Association for Computing Machinery},
location = {San Diego, CA, USA},
month = feb,
address = {New York, NY, USA},
keywords = {conf},
}

Downloads

2002_Brauckmann_CC [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2561

×

2019
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi] [Bibtex & Downloads]

ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019. [doi]

Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.

Bibtex

@Article{khan_taco19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart S. P. Parkin and Jeronimo Castrillon},
title = {ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2019},
volume = {16},
number = {4},
month = dec,
year = {2019},
issn = {1544-3566},
pages = {56:1--56:23},
articleno = {56},
numpages = {23},
url = {http://doi.acm.org/10.1145/3372489},
doi = {10.1145/3372489},
acmid = {3372489},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This paper presents data placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5\%, outperforming the state of the art by up to 16.1\%.},
}

Downloads

1912_Khan_TACO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2289

×
Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi] [Bibtex & Downloads]

A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement

Reference

Fazal Hameed, Jeronimo Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 27, no. 10, pp. 2375-2386, Oct 2019. [doi]

Abstract
High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78% and improves the average performance by 13% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24% and 17.1% respectively.

Bibtex

@Article{hameed_tvlsi19,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {A Novel Hybrid {DRAM}/{STT-RAM} {L}ast-{L}evel-{C}ache Architecture for Performance, Energy and Endurance Enhancement},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2019},
month = oct,
abstract = {High capacity L4 architectures as Last-Level-Cache (LLC) have been recently introduced between L3-SRAM and off-chip memory. These LLC architectures have either employed DRAM or Spin-Transfer-Torque (STT-RAM) memory technologies. It is a known fact that DRAM LLCs feature a higher energy consumption while STT-RAM LLCs feature a lower write endurance compared to their counterparts. This paper proposes an efficient hybrid DRAM/STT-RAM LLC architecture that exploits the best characteristics offered by the individual memory technologies while mitigating their drawbacks. More precisely, we introduce a novel mechanism for the storage and management of the hybrid LLC tags, and a proactive L3-SRAM writeback policy that combines multiple dirty blocks that are mapped to the same LLC row. Our hybrid architecture reduces LLC interference by having less writeback accesses and row fetches. The endurance is improved by reducing the number of STT-RAM block writes. We show that our LLC architecture reduces the total number of STT-RAM block writes by 78\% and improves the average performance by 13\% compared to a recently proposed STT- RAM LLC. Compared to the state-of-the-art DRAM LLC, we report an average energy and performance improvement of 24\% and 17.1\% respectively.},
volume = {27},
number = {10},
pages = {2375-2386},
numpages = {12pp},
doi={10.1109/TVLSI.2019.2918385},
url = {https://ieeexplore.ieee.org/document/8734763},
}

Downloads

1905_Hameed_TVLSI [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2454

×
Lars Schütze, Jeronimo Castrillon, "Efficient Late Binding of Dynamic Function Compositions", Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering, ACM, pp. 141–151, New York, NY, USA, Oct 2019. [doi] [Bibtex & Downloads]

Efficient Late Binding of Dynamic Function Compositions

Reference

Lars Schütze, Jeronimo Castrillon, "Efficient Late Binding of Dynamic Function Compositions", Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering, ACM, pp. 141–151, New York, NY, USA, Oct 2019. [doi]

Bibtex

@InProceedings{schuetze_sle19,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
title = {Efficient Late Binding of Dynamic Function Compositions},
booktitle = {Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering},
year = {2019},
series = {SLE 2019},
address = {New York, NY, USA},
month = oct,
publisher = {ACM},
keywords = {conf},
location = {Athens, Greece},
isbn = {978-1-4503-6981-7},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3357766.3359543},
doi = {10.1145/3357766.3359543},
acmid = {3359543},
}

Downloads

1910_Schuetze_SLE [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2491

×
Tobias Reiher, Alexander Senier, Jeronimo Castrillon, Thorsten Strufe, "RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers", In Proceeding: International Conference on Formal Aspects of Component Software (Arbab, Farhad and Jongmans, Sung-Shik), Springer International Publishing, pp. 170–190, Cham, Oct 2019. [doi] [Bibtex & Downloads]

RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers

Reference

Tobias Reiher, Alexander Senier, Jeronimo Castrillon, Thorsten Strufe, "RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers", In Proceeding: International Conference on Formal Aspects of Component Software (Arbab, Farhad and Jongmans, Sung-Shik), Springer International Publishing, pp. 170–190, Cham, Oct 2019. [doi]

Abstract
Various vulnerabilities have been found in message parsers of protocol implementations in the past. Even highly sensitive software components like TLS libraries are affected regularly. Resulting issues range from denial-of-service attacks to the extraction of sensitive information. The complexity of protocols and imprecise specifications in natural language are the core reasons for subtle bugs in implementations, which are hard to find. The lack of precise specifications impedes formal verification.

Bibtex

@InProceedings{reiher_facs19,
author = {Tobias Reiher and Alexander Senier and Jeronimo Castrillon and Thorsten Strufe},
title = {RecordFlux: Formal Message Specification and Generation of Verifiable Binary Parsers},
booktitle = {International Conference on Formal Aspects of Component Software},
year = {2019},
editor = {Arbab, Farhad and Jongmans, Sung-Shik},
organization = {Springer},
publisher = {Springer International Publishing},
location = {Amsterdam, The Netherlands},
address = {Cham},
month = oct,
pages = {170--190},
numpages = {21},
abstract = {Various vulnerabilities have been found in message parsers of protocol implementations in the past. Even highly sensitive software components like TLS libraries are affected regularly. Resulting issues range from denial-of-service attacks to the extraction of sensitive information. The complexity of protocols and imprecise specifications in natural language are the core reasons for subtle bugs in implementations, which are hard to find. The lack of precise specifications impedes formal verification.},
isbn={978-3-030-40914-2},
doi = {10.1007/978-3-030-40914-2_9},
url = {https://link.springer.com/chapter/10.1007/978-3-030-40914-2_9},
}

Downloads

1910_Reiher_FACS [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2531

×
Jeronimo Castrillon, "Embedded manycore programming: From auto-parallelization to domain specific languages", In IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote), Oct 2019. [Bibtex & Downloads]

Embedded manycore programming: From auto-parallelization to domain specific languages

Reference

Jeronimo Castrillon, "Embedded manycore programming: From auto-parallelization to domain specific languages", In IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote), Oct 2019.

Abstract
Programming manycores remains a daunting task, especially in the presence of the heterogeneity and application constraints typical in the embedded domain. This talk reviews efforts to cope with this complexity from the last 10+ years of research. It starts with the challenges faced by auto-parallelizing compilers, discussing how far they have made it since the start of the multi-core era. The talk also reviews explicit parallel programming and associated programming methodologies, with focus on recent advances that aim at increasing the adaptivity and robustness of dataflow applications. The talk then advocates for even higher-level programming abstractions in the form of domain specific languages, particularly important to deal with the increased complexity brought by emerging computing paradigms.

Bibtex

@Misc{castrillon_mcsoc2019,
author = {Castrillon, Jeronimo},
title = {Embedded manycore programming: From auto-parallelization to domain specific languages},
howpublished = {IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019) (keynote)},
month = oct,
year = {2019},
abstract = {Programming manycores remains a daunting task, especially in the presence of the heterogeneity and application constraints typical in the embedded domain. This talk reviews efforts to cope with this complexity from the last 10+ years of research. It starts with the challenges faced by auto-parallelizing compilers, discussing how far they have made it since the start of the multi-core era. The talk also reviews explicit parallel programming and associated programming methodologies, with focus on recent advances that aim at increasing the adaptivity and robustness of dataflow applications. The talk then advocates for even higher-level programming abstractions in the form of domain specific languages, particularly important to deal with the increased complexity brought by emerging computing paradigms.},
url = {http://mcsoc-forum.org/m2019/wp-content/uploads/2019/10/191002_castrillon_mcsoc-opt.pdf},
keywords = {invitedtalk},
location = {Singapore},
}

Downloads

191002_castrillon_mcsoc-opt [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2534

×
Jeronimo Castrillon, "Dataflow and higher level abstractions for parallel programming", In CPS Summer School 2019: Designing Cyber-Physical Systems - From concepts to implementation (keynote), Sep 2019. [Bibtex & Downloads]

Dataflow and higher level abstractions for parallel programming

Reference

Jeronimo Castrillon, "Dataflow and higher level abstractions for parallel programming", In CPS Summer School 2019: Designing Cyber-Physical Systems - From concepts to implementation (keynote), Sep 2019.

Abstract
Computing systems continue to increase in complexity, today including multiple cores, complex memory hierarchies and domain-specific accelerators, and soon with components built with emerging hardware technologies. This complexity calls for advances in a variety of domains, like programming and modeling languages, models of hardware, system simulators, design exploration methodologies and hardware architectures. From the standpoint of programming languages and compilers, this lecture discusses the challenges in mainstream sequential programming to motivate higher-level abstractions. It then provides an introduction to dataflow programming methodologies as a promising solution for embedded applications. We will review the fundamentals of dataflow models of computation, basic programming methodologies and look at current research to account for the adaptivity that new applications require, especially in the context of cyber physical systems. The lecture closes with an outlook on higher level programming abstractions and challenges posed by emerging computing architectures.

Bibtex

@Misc{castrillon_cpss19,
author = {Castrillon, Jeronimo},
title = {Dataflow and higher level abstractions for parallel programming},
howpublished = {CPS Summer School 2019: {Designing Cyber-Physical Systems - From concepts to implementation (keynote)}},
month = sep,
year = {2019},
abstract = {Computing systems continue to increase in complexity, today including multiple cores, complex memory hierarchies and domain-specific accelerators, and soon with components built with emerging hardware technologies. This complexity calls for advances in a variety of domains, like programming and modeling languages, models of hardware, system simulators, design exploration methodologies and hardware architectures. From the standpoint of programming languages and compilers, this lecture discusses the challenges in mainstream sequential programming to motivate higher-level abstractions. It then provides an introduction to dataflow programming methodologies as a promising solution for embedded applications. We will review the fundamentals of dataflow models of computation, basic programming methodologies and look at current research to account for the adaptivity that new applications require, especially in the context of cyber physical systems. The lecture closes with an outlook on higher level programming abstractions and challenges posed by emerging computing architectures.},
keywords = {invitedtalk},
location = {Alghero, Sardinia, Italy},
project = {cfaed, haec},
url = {http://www.cpsschool.eu/dataflow-and-higher-level-abstractions-for-parallel-programming/}
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2533

×
Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism", Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell, ACM, pp. 146–161, New York, NY, USA, Aug 2019. [doi] [Bibtex & Downloads]

STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism

Reference

Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism", Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell, ACM, pp. 146–161, New York, NY, USA, Aug 2019. [doi]

Abstract
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflow execution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution.
Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations.
In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-the-art big data processing systems.

Bibtex

@InProceedings{ertel_haskell19,
author = {Ertel, Sebastian and Adam, Justus and Rink, Norman A. and Goens, Andr{\'e}s and Castrillon, Jeronimo},
title = {{STCLang}: State Thread Composition as a Foundation for Monadic Dataflow Parallelism},
booktitle = {Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell},
year = {2019},
series = {Haskell 2019},
pages = {146--161},
address = {New York, NY, USA},
month = aug,
publisher = {ACM},
abstract = {Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflow execution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution.
Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations.
In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-the-art big data processing systems.},
acmid = {3342600},
doi = {10.1145/3331545.3342600},
isbn = {978-1-4503-6813-1},
keywords = {conf},
location = {Berlin, Germany},
numpages = {16},
url = {http://doi.acm.org/10.1145/3331545.3342600}
}

Downloads

1908_Ertel_Haskell [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2476

×
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi] [Bibtex & Downloads]

SHRIMP: Efficient Instruction Delivery with Domain Wall Memory

Reference

Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, 6pp, New York, NY, USA, Jul 2019. [doi]

Bibtex

@InProceedings{multanen_islped19,
author = {Joonas Multanen and Asif Ali Khan and Pekka J{\"a}{\"a}skel{\"a}inen and Fazal Hameed and Jeronimo Castrillon},
title = {{SHRIMP}: Efficient Instruction Delivery with Domain Wall Memory},
booktitle = {Proceedings of the International Symposium on Low Power Electronics and Design},
year = {2019},
month = jul,
series = {ISLPED '19},
location = {Lausanne, Switzerland},
pages = {6pp},
numpages = {6},
publisher = {ACM},
address = {New York, NY, USA},
doi={10.1109/ISLPED.2019.8824954},
url = {https://ieeexplore.ieee.org/document/8824954},
}

Downloads

1907_Multanen_ISLPED [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2452

×
Andrés Goens, Christian Menard, Jeronimo Castrillon, "On Compact Mappings for Multicore Systems", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS) (D. Pnevmatikatos and M. Pelcat and M. Jung), Springer, Cham, vol. 11733, pp. 325–335, Jul 2019. [doi] [Bibtex & Downloads]

On Compact Mappings for Multicore Systems

Reference

Andrés Goens, Christian Menard, Jeronimo Castrillon, "On Compact Mappings for Multicore Systems", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS) (D. Pnevmatikatos and M. Pelcat and M. Jung), Springer, Cham, vol. 11733, pp. 325–335, Jul 2019. [doi]

Bibtex

@InProceedings{goens_samos19,
author = {Andr{\'e}s Goens and Christian Menard and Jeronimo Castrillon},
title = {On Compact Mappings for Multicore Systems},
booktitle = {Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS)},
year = {2019},
editor = {D. Pnevmatikatos and M. Pelcat and M. Jung},
volume = {11733},
pages = {325--335},
month = jul,
organization = {IEEE},
publisher = {Springer, Cham},
doi = {10.1007/978-3-030-27562-4_23},
isbn = {978-3-030-27561-7},
location = {Pythagorion, Greece},
numpages = {11},
url = {https://link.springer.com/chapter/10.1007/978-3-030-27562-4_23}
}

Downloads

1907_Goens_SAMOS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2456

×
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]

Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads

Reference

Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019. [doi]

Abstract
Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

Bibtex

@InProceedings{kahn_lctes19,
author = {Asif Ali Khan and Norman A. Rink and Fazal Hameed and Jeronimo Castrillon},
title = {Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads},
booktitle = {Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES)},
series = {LCTES 2019},
pages = {5--18},
numpages = {12},
numpages = {14},
isbn = {978-1-4503-6724-0/19/06},
doi = {10.1145/3316482.3326351},
url = {http://doi.acm.org/10.1145/3316482.3326351},
acmid = {3326351},
year = {2019},
month = jun,
location = {Phoenix, AZ, USA},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.},
acmid = {3326351},
}

Downloads

1906_Khan_LCTES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2419

×
Norman A. Rink, Jeronimo Castrillon, "TeIL: a type-safe imperative Tensor Intermediate Language", Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, pp. 57–68, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]

TeIL: a type-safe imperative Tensor Intermediate Language

Reference

Norman A. Rink, Jeronimo Castrillon, "TeIL: a type-safe imperative Tensor Intermediate Language", Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, pp. 57–68, New York, NY, USA, Jun 2019. [doi]

Abstract
Each of the popular tensor frameworks from the machine learning domain comes with its own language for expressing tensor kernels. Since these tensor languages lack precise specifications, it is impossible to understand and reason about tensor kernels that exhibit unexpected behaviour. In this paper, we give examples of such kernels.
The tensor languages are superficially similar to the well-known functional array languages, for which formal definitions often exist. However, the tensor languages are inherently imperative. In this paper we present TeIL, an imperative tensor intermediate language with precise formal semantics. For the popular tensor languages, TeIL can serve as a common ground on the basis of which precise reasoning about kernels becomes possible. Based on TeIL's formal semantics we develop a type-safety result in the Coq proof assistant.

Bibtex

@InProceedings{rink_array19,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {{TeIL}: a type-safe imperative {Tensor Intermediate Language}},
booktitle = {Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY)},
year = {2019},
series = {ARRAY 2019},
pages = {57--68},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
doi = {10.1145/3315454.3329959},
url = {http://doi.acm.org/10.1145/3315454.3329959},
acmid = {3329959},
isbn = {978-1-4503-6717-2/19/06},
location = {Phoenix, AZ, USA},
numpages = {12},
abstract = {Each of the popular tensor frameworks from the machine learning domain comes with its own language for expressing tensor kernels. Since these tensor languages lack precise specifications, it is impossible to understand and reason about tensor kernels that exhibit unexpected behaviour. In this paper, we give examples of such kernels.
The tensor languages are superficially similar to the well-known functional array languages, for which formal definitions often exist. However, the tensor languages are inherently imperative. In this paper we present TeIL, an imperative tensor intermediate language with precise formal semantics. For the popular tensor languages, TeIL can serve as a common ground on the basis of which precise reasoning about kernels becomes possible. Based on TeIL's formal semantics we develop a type-safety result in the Coq proof assistant.},
}

Downloads

1906_Rink_Array [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2449

×
Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon, "A Case Study on Machine Learning for Synthesizing Benchmarks", Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), ACM, pp. 38–46, New York, NY, USA, Jun 2019. [doi] [Bibtex & Downloads]

A Case Study on Machine Learning for Synthesizing Benchmarks

Reference

Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon, "A Case Study on Machine Learning for Synthesizing Benchmarks", Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), ACM, pp. 38–46, New York, NY, USA, Jun 2019. [doi]

Abstract
Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.

Bibtex

@InProceedings{goens_mapl19,
author = {Andr\'{e}s Goens and Alexander Brauckmann and Sebastian Ertel and Chris Cummins and Hugh Leather and Jeronimo Castrillon},
title = {A Case Study on Machine Learning for Synthesizing Benchmarks},
booktitle = {Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL)},
year = {2019},
series = {MAPL 2019},
doi = {10.1145/3315508.3329976},
url = {http://doi.acm.org/10.1145/3315508.3329976},
acmid = {3329976},
isbn = {978-1-4503-6719-6/19/06},
pages = {38--46},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
keywords = {conf},
location = {Phoenix, AZ, USA},
numpages = {9},
abstract = {Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.},
}

Downloads

1906_Goens_MAPL [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2450

×
Jeronimo Castrillon, "SoC programming in the era of the Internet of Things, machine learning and emerging technologies", In Groupement De Recherche SOC2: System On Chip, Systèmes embarqués et Objets Connecté (keynote), Jun 2019. [Bibtex & Downloads]

SoC programming in the era of the Internet of Things, machine learning and emerging technologies

Reference

Jeronimo Castrillon, "SoC programming in the era of the Internet of Things, machine learning and emerging technologies", In Groupement De Recherche SOC2: System On Chip, Systèmes embarqués et Objets Connecté (keynote), Jun 2019.

Abstract
The design of a system on chip has traditionally been one of the most complex tasks in computing
systems. Designers have to deal with stringent application constraints under a low-power budget while
reducing non-recurring engineering costs. Modeling languages, costs models of hardware, system
simulators, design exploration methodologies and alike have made it possible to cope with this high
complexity. Today, three recent trends represent a non trivial complexity increase and thus a challenge
for SoC designers and programmers, namely, 1) additional system dynamics in the context of the
Internet of Things, 2) the ubiquity of machine learning workloads, and 3) the added complexity brought
by specialization and emerging technologies. This talk discusses how models and higher level
programming abstractions can be leveraged to cope with these trends. A dataflow programming
methodology is extended to account for dynamic execution scenarios at runtime. A tensor abstraction,
common in machine learning, is introduced that eases programming and design tasks. Finally, the
talk shows how the tensor abstraction is useful to efficiently map tensor computations to SoCs with
non-volatile racetrack scratch-pad memory.

Bibtex

@Misc{castrillon_gdrsoc2019,
author = {Castrillon, Jeronimo},
title = {SoC programming in the era of the Internet of Things, machine learning and emerging technologies},
howpublished = {Groupement De Recherche SOC2: System On Chip, Syst{\`e}mes embarqu{\'e}s et Objets Connect{\'e} (keynote)},
month = jun,
year = {2019},
abstract = {The design of a system on chip has traditionally been one of the most complex tasks in computing
systems. Designers have to deal with stringent application constraints under a low-power budget while
reducing non-recurring engineering costs. Modeling languages, costs models of hardware, system
simulators, design exploration methodologies and alike have made it possible to cope with this high
complexity. Today, three recent trends represent a non trivial complexity increase and thus a challenge
for SoC designers and programmers, namely, 1) additional system dynamics in the context of the
Internet of Things, 2) the ubiquity of machine learning workloads, and 3) the added complexity brought
by specialization and emerging technologies. This talk discusses how models and higher level
programming abstractions can be leveraged to cope with these trends. A dataflow programming
methodology is extended to account for dynamic execution scenarios at runtime. A tensor abstraction,
common in machine learning, is introduced that eases programming and design tasks. Finally, the
talk shows how the tensor abstraction is useful to efficiently map tensor computations to SoCs with
non-volatile racetrack scratch-pad memory.},
location = {Montpellier, France},
url = {http://www.gdr-soc.cnrs.fr/programme-colloque-2019/}
}

Downloads

190620_castrillon_gdrsoc2-lowres [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2475

×
Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''", In CoRR, vol. abs/1906.12098, Jun 2019. [Bibtex & Downloads]

Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''

Reference

Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens, Jeronimo Castrillon, "Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''", In CoRR, vol. abs/1906.12098, Jun 2019.

Bibtex

@Article{ertel_haskellsup19,
author = {Sebastian Ertel and Justus Adam and Norman A. Rink and Andr{\'{e}}s Goens and Jeronimo Castrillon},
title = {Category-Theoretic Foundations of ``STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism''},
journal = {CoRR},
year = {2019},
volume = {abs/1906.12098},
month = jun,
archiveprefix = {arXiv},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1906-12098},
eprint = {1906.12098},
url = {http://arxiv.org/abs/1906.12098}
}

Downloads

1906_Ertel_Haskellsupp [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2486

×
Jeronimo Castrillon, "Programming abstractions: When domain-specific goes mainstream", In 28th Workshop of the Gesellschaft für Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk), Mar 2019. [Bibtex & Downloads]

Programming abstractions: When domain-specific goes mainstream

Reference

Jeronimo Castrillon, "Programming abstractions: When domain-specific goes mainstream", In 28th Workshop of the Gesellschaft für Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk), Mar 2019.

Abstract
We have seen several inflection points in computing in this century: from single to multi-core,
from homogeneous to heterogeneous, and in the near future to fundamentally new computing
paradigms with emerging technologies. With domain-specific hardware becoming mainstream,
and programming reaching out to professions outside computer science, higher-level
programming abstractions, domain-specific languages (DSLs) and tools are badly needed. This
talk provides examples of such programming abstractions as a basis for discussion. It first
discusses how dataflow programming models from the embedded domain can be leveraged in
more general purpose setups. It then presents DSLs for particle-based simulations and for tensor
expressions. The latter is one example of multiple tensor DSLs available today, spawned
by the recent machine learning boom. The talk closes with examples of emerging technologies
and a brief discussion about how they may impact our current assumptions.

Bibtex

@Misc{castrillon_pars2019,
author = {Castrillon, Jeronimo},
title = {Programming abstractions: When domain-specific goes mainstream},
howpublished = {28th Workshop of the Gesellschaft f{\"u}r Informatik, interest group Parallele Algorithmen, Rechenstrukturen und Systemsoftware (PARS'19) (invited talk)},
month = mar,
year = {2019},
abstract = {We have seen several inflection points in computing in this century: from single to multi-core,
from homogeneous to heterogeneous, and in the near future to fundamentally new computing
paradigms with emerging technologies. With domain-specific hardware becoming mainstream,
and programming reaching out to professions outside computer science, higher-level
programming abstractions, domain-specific languages (DSLs) and tools are badly needed. This
talk provides examples of such programming abstractions as a basis for discussion. It first
discusses how dataflow programming models from the embedded domain can be leveraged in
more general purpose setups. It then presents DSLs for particle-based simulations and for tensor
expressions. The latter is one example of multiple tensor DSLs available today, spawned
by the recent machine learning boom. The talk closes with examples of emerging technologies
and a brief discussion about how they may impact our current assumptions.},
location = {Berlin, Germany},
url = {https://fg-pars.gi.de/veranstaltung/pars-workshop-2019/}
}

Downloads

190321_castrill_PARS-sent [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2455

×
Gerhard Fettweis, Meik Dörpinghaus, Jeronimo Castrillon, Akash Kumar, Christel Baier, Karlheinz Bock, Frank Ellinger, Andreas Fery, Frank H. P. Fitzek, Hermann Härtig, Kambiz Jamshidi, Thomas Kissinger, Wolfgang Lehner, Michael Mertig, Wolfgang E. Nagel, Giang T. Nguyen, Dirk Plettemeier, Michael Schröter, Thorsten Strufe, "Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing", In Proceedings of the IEEE, vol. 107, no. 1, pp. 204–231, Jan 2019. [doi] [Bibtex & Downloads]

Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing

Reference

Gerhard Fettweis, Meik Dörpinghaus, Jeronimo Castrillon, Akash Kumar, Christel Baier, Karlheinz Bock, Frank Ellinger, Andreas Fery, Frank H. P. Fitzek, Hermann Härtig, Kambiz Jamshidi, Thomas Kissinger, Wolfgang Lehner, Michael Mertig, Wolfgang E. Nagel, Giang T. Nguyen, Dirk Plettemeier, Michael Schröter, Thorsten Strufe, "Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing", In Proceedings of the IEEE, vol. 107, no. 1, pp. 204–231, Jan 2019. [doi]

Bibtex

@Article{fettweis_ieeeproc19,
author = {Gerhard Fettweis and Meik D{\"o}rpinghaus and Jeronimo Castrillon and Akash Kumar and Christel Baier and Karlheinz Bock and Frank Ellinger and Andreas Fery and Frank H. P. Fitzek and Hermann H{\"a}rtig and Kambiz Jamshidi and Thomas Kissinger and Wolfgang Lehner and Michael Mertig and Wolfgang E. Nagel and Giang T. Nguyen and Dirk Plettemeier and Michael Schr{\"o}ter and Thorsten Strufe},
title = {Architecture and Advanced Electronics Pathways Toward Highly Adaptive Energy-Efficient Computing},
journal = {Proceedings of the IEEE},
year = {2019},
volume = {107},
number = {1},
pages = {204--231},
month = jan,
doi = {10.1109/JPROC.2018.2874895},
issn = {0018-9219},
url = {https://ieeexplore.ieee.org/document/8565890}
}

Downloads

1812_Fettweis_IEEEProc [PDF]

Related Paths
HAEC

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2246

×
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi] [Bibtex & Downloads]

RTSim: A Cycle-accurate Simulator for Racetrack Memories

Reference

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories", In IEEE Computer Architecture Letters, IEEE, vol. 18, no. 1, pp. 43–46, Jan 2019. [doi]

Bibtex

@Article{khan_ieeecal19,
author = {Asif Ali Khan and Fazal Hameed and Robin Bl{\"a}sing and Stuart Parkin and Jeronimo Castrillon},
title = {{RTS}im: A Cycle-accurate Simulator for Racetrack Memories},
journal = {IEEE Computer Architecture Letters},
year = {2019},
volume = {18},
number = {1},
pages = {43--46},
month = jan,
doi = {10.1109/LCA.2019.2899306},
issn = {1556-6056},
publisher = {IEEE},
url = {https://ieeexplore.ieee.org/document/8642352}
}

Downloads

1902_khan_IEEECAL [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2288

×
Hasna Bouraoui, Jeronimo Castrillon, Chadlia Jerad, "Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications", Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 4:1–4:6, New York, NY, USA, Jan 2019. [doi] [Bibtex & Downloads]

Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications

Reference

Hasna Bouraoui, Jeronimo Castrillon, Chadlia Jerad, "Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications", Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 4:1–4:6, New York, NY, USA, Jan 2019. [doi]

Bibtex

@InProceedings{bouraoui_parma19,
author = {Hasna Bouraoui and Jeronimo Castrillon and Chadlia Jerad},
title = {Comparing Dataflow and OpenMP Programming for Speaker Recognition Applications},
booktitle = {Proceedings of the 10th Workshop and 8th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'19), co-located with 14th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2019},
series = {PARMA-DITAM 2019},
pages = {4:1--4:6},
articleno = {4},
numpages = {6},
address = {New York, NY, USA},
month = jan,
publisher = {ACM},
isbn = {978-1-4503-6321-1},
url = {http://doi.acm.org/10.1145/3310411.3310417},
doi = {10.1145/3310411.3310417},
acmid = {3310417},
location = {Valencia, Spain},
numpages = {6}
}

Downloads

1901_Bouraoui_PARMA [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2249

×

2018
Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, Claude Tadonki, "Meta-programming for Cross-Domain Tensor Optimizations", Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18), ACM, pp. 79–92, New York, NY, USA, Nov 2018. [doi] [Bibtex & Downloads]

Meta-programming for Cross-Domain Tensor Optimizations

Reference

Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, Claude Tadonki, "Meta-programming for Cross-Domain Tensor Optimizations", Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18), ACM, pp. 79–92, New York, NY, USA, Nov 2018. [doi]

Bibtex

@InProceedings{rink_gpce18,
author = {Adilla Susungi and Norman A. Rink and Albert Cohen and Jeronimo Castrillon and Claude Tadonki},
title = {Meta-programming for Cross-Domain Tensor Optimizations},
booktitle = {Proceedings of 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'18)},
year = {2018},
series = {GPCE 2018},
pages = {79--92},
numpages = {14},
address = {New York, NY, USA},
month = nov,
publisher = {ACM},
keywords = {conf},
location = {Boston, MA, USA},
isbn = {978-1-4503-6045-6},
url = {http://doi.acm.org/10.1145/3278122.3278131},
doi = {10.1145/3278122.3278131},
acmid = {3278131},
}

Downloads

1811_Rink_GPCE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2206

×
Jeronimo Castrillon, "Parallel programming methodologies for manycores", In NeXtream Solution Seminar & Silexica Technology Workshop (invited talk), Oct 2018. [Bibtex & Downloads]

Parallel programming methodologies for manycores

Reference

Jeronimo Castrillon, "Parallel programming methodologies for manycores", In NeXtream Solution Seminar & Silexica Technology Workshop (invited talk), Oct 2018.

Bibtex

@Misc{castrillon_neXtream2018,
author = {Castrillon, Jeronimo},
title = {Parallel programming methodologies for manycores},
howpublished = {NeXtream Solution Seminar \& Silexica Technology Workshop (invited talk)},
month = oct,
year = {2018},
keywords = {invitedtalk},
location = {Tokyo, Japan},
url = {https://nextream.bz/nss/2018/?page_id=29}
}

Downloads

181017_castrill_slx-tokyo_sent-2 [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2241

×
Rainer Leupers, Miguel A. Aguilar, Jeronimo Castrillon, Weihua Sheng, "Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems", Chapter in Handbook of Signal Processing Systems (3rd Edition) (Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo), Springer New York, pp. 1021–1062, Sep 2018. [doi] [Bibtex & Downloads]

Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems

Reference

Rainer Leupers, Miguel A. Aguilar, Jeronimo Castrillon, Weihua Sheng, "Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems", Chapter in Handbook of Signal Processing Systems (3rd Edition) (Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo), Springer New York, pp. 1021–1062, Sep 2018. [doi]

Abstract
The increasing demands of modern embedded systems, such as high-performance and energy-efficiency, have motivated the use of heterogeneous multi-core platforms enabled by Multiprocessor System-on-Chips (MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a tool-chain to tackle the problems of application modeling, platform description, software parallelization, software distribution and code generation for an efficient usage of the target platform. This chapter discusses various aspects of compilers for heterogeneous embedded multi-core systems, using the well-established single-core C compiler technology as a baseline for comparison. After a brief introduction to the MPSoC compiler technology, the important ingredients of the compilation process are explained in detail. Finally, a number of case studies from academia and industry are presented to illustrate the concepts discussed in this chapter.

Bibtex

@InCollection{leupers18_spschapter,
author = {Leupers, Rainer and Aguilar, Miguel A. and Castrillon, Jeronimo and Sheng, Weihua},
title = {Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems},
booktitle = {Handbook of Signal Processing Systems (3rd Edition)},
publisher = {Springer New York},
year = {2018},
month = sep,
editor = {Bhattacharyya, Shuvra S. and Deprettere, Ed F. and Leupers, Rainer and Takala, Jarmo},
pages = {1021--1062},
abstract = {The increasing demands of modern embedded systems, such as high-performance and energy-efficiency, have motivated the use of heterogeneous multi-core platforms enabled by Multiprocessor System-on-Chips (MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a tool-chain to tackle the problems of application modeling, platform description, software parallelization, software distribution and code generation for an efficient usage of the target platform. This chapter discusses various aspects of compilers for heterogeneous embedded multi-core systems, using the well-established single-core C compiler technology as a baseline for comparison. After a brief introduction to the MPSoC compiler technology, the important ingredients of the compilation process are explained in detail. Finally, a number of case studies from academia and industry are presented to illustrate the concepts discussed in this chapter.},
doi = {10.1007/978-3-319-91734-4_28},
isbn = {978-3-319-91733-7},
url = {https://link.springer.com/chapter/10.1007/978-3-319-91734-4_28},
}

Downloads

1809_Leupers_SPSBookChapter [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1784

×
Andrés Goens, Christian Menard, Jeronimo Castrillon, "On the Representation of Mappings to Multicores", Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18), pp. 184–191, Vietnam National University, Hanoi, Vietnam, Sep 2018. [doi] [Bibtex & Downloads]

On the Representation of Mappings to Multicores

Reference

Andrés Goens, Christian Menard, Jeronimo Castrillon, "On the Representation of Mappings to Multicores", Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18), pp. 184–191, Vietnam National University, Hanoi, Vietnam, Sep 2018. [doi]

Abstract
Application requirements for embedded systems are growing rapidly, as is the complexity of systems designed to execute them. A common abstraction used to tame this growing complexity is that of a mapping, which assigns parts of an application to different hardware resources. Modern flows need to explore an intractably large design space of mappings, and be able to quickly find near-optimal mappings for different objectives, sometimes at runtime. With systems featuring thousands of cores in the near horizon, we need methods to make this exploration step truly scalable. In this paper we argue that the mathematical representation of a mapping is central to achieve this. We present different representations and how these could be applied to different contexts and objectives, like complex design- space exploration meta-heuristics or efficient runtime systems.

Bibtex

@InProceedings{goen_mcsoc18,
author = {Andr\'{e}s Goens and Christian Menard and Jeronimo Castrillon},
title = {On the Representation of Mappings to Multicores},
booktitle = {Proceedings of the IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-18)},
year = {2018},
address = {Vietnam National University, Hanoi, Vietnam},
month = sep,
pages = {184--191},
doi = {10.1109/MCSoC2018.2018.00039},
url = {https://ieeexplore.ieee.org/document/8540232},
isbn = {978-1-5386-6689-0/18/},
abstract = {Application requirements for embedded systems are growing rapidly, as is the complexity of systems designed to execute them. A common abstraction used to tame this growing complexity is that of a mapping, which assigns parts of an application to different hardware resources. Modern flows need to explore an intractably large design space of mappings, and be able to quickly find near-optimal mappings for different objectives, sometimes at runtime. With systems featuring thousands of cores in the near horizon, we need methods to make this exploration step truly scalable. In this paper we argue that the mathematical representation of a mapping is central to achieve this. We present different representations and how these could be applied to different contexts and objectives, like complex design- space exploration meta-heuristics or efficient runtime systems.},
}

Downloads

1809_Goens_MCSoC [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2152

×
Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, Sascha Wunderlich, "A Hardware/Software Stack for Heterogeneous Systems", In IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 243-259, Jul 2018. [doi] [Bibtex & Downloads]

A Hardware/Software Stack for Heterogeneous Systems

Reference

Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, Sascha Wunderlich, "A Hardware/Software Stack for Heterogeneous Systems", In IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 243-259, Jul 2018. [doi]

Abstract
Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.

Bibtex

@Article{castrillon_tmscs17,
author = {Jeronimo Castrillon and Matthias Lieber and Sascha Kl{\"u}ppelholz and Marcus V{\"o}lp and Nils Asmussen and Uwe Assmann and Franz Baader and Christel Baier and Gerhard Fettweis and Jochen Fr{\"o}hlich and Andr\'{e}s Goens and Sebastian Haas and Dirk Habich and Hermann H{\"a}rtig and Mattis Hasler and Immo Huismann and Tomas Karnagel and Sven Karol and Akash Kumar and Wolfgang Lehner and Linda Leuschner and Siqi Ling and Steffen M{\"a}rcker and Christian Menard and Johannes Mey and Wolfgang Nagel and Benedikt N{\"o}then and Rafael Pe{\~n}aloza and Michael Raitza and J{\"o}rg Stiller and Annett Ungeth{\"u}m and Axel Voigt and Sascha Wunderlich},
title = {A Hardware/Software Stack for Heterogeneous Systems},
journal = {IEEE Transactions on Multi-Scale Computing Systems},
year = {2018},
month = jul,
volume={4},
number={3},
pages={243-259},
abstract = {Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.},
doi = {10.1109/TMSCS.2017.2771750},
issn = {2332-7766},
url = {http://ieeexplore.ieee.org/document/8103042/}
}

Downloads

1711_Castrillon_TMSCS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1597

×
Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi] [Bibtex & Downloads]

Performance and Energy Efficient Design of STT-RAM Last-Level-Cache

Reference

Fazal Hameed, Asif Ali Khan, Jeronimo Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache", In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018. [doi]

Abstract
Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75% and improve the system performance by 6.5%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82% and improves system performance by 6.8%.

Bibtex

@Article{hameed_tvlsi18,
author = {Fazal Hameed and Asif Ali Khan and Jeronimo Castrillon},
title = {Performance and Energy Efficient Design of STT-RAM Last-Level-Cache},
journal = {IEEE Transactions on Very Large Scale Integration Systems (TVLSI)},
year = {2018},
volume = {26},
number = {6},
pages = {1059--1072},
month = jun,
abstract = {Recent research has proposed having a die-stacked last-level cache (LLC) to overcome the memory wall. Lately, spin-transfer-torque random access memory (STT-RAM) caches have received attention, since they provide improved energy efficiency compared with DRAM caches. However, recently proposed STT-RAM cache architectures unnecessarily dissipate energy by fetching unneeded cache lines (CLs) into the row buffer (RB). In this paper, we propose a selective read policy for the STT-RAM which fetches those CLs into the RB that are likely to be reused. In addition, we propose a tags-update policy that reduces the number of STT-RAM writebacks. This reduces the number of reads/writes and thereby decreases the energy consumption. To reduce the latency penalty of our selective read policy, we propose the following performance optimizations: 1) an RB tags-bypass policy that reduces STT-RAM access latency; 2) an LLC data cache that stores the CLs that are likely to be used in the near future; 3) an address organization scheme that simultaneously reduces LLC access latency and miss rate; and 4) a tags-to-column mapping policy that improves access parallelism. For evaluation, we implement our proposed architecture in the Zesto simulator and run different combinations of SPEC2006 benchmarks on an eight-core system. We compare our approach with a recently proposed STT-RAM LLC with subarray parallelism support and show that our synergistic policies reduce the average LLC dynamic energy consumption by 75\% and improve the system performance by 6.5\%. Compared with the state-of-the-art DRAM LLC with subarray parallelism, our architecture reduces the LLC dynamic energy consumption by 82\% and improves system performance by 6.8\%.},
doi = {10.1109/TVLSI.2018.2804938},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1803_Hameed_TVLSI.pdf:PDF},
issn = {1063-8210},
numpages = {14},
url = {http://ieeexplore.ieee.org/document/8307465/}
}

Downloads

1803_Hameed_TVLSI [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2099

×
Jeronimo Castrillon, "Heterogeneous Post-CMOS Technologies Meet Software", In Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk), Jun 2018. [Bibtex & Downloads]

Heterogeneous Post-CMOS Technologies Meet Software

Reference

Jeronimo Castrillon, "Heterogeneous Post-CMOS Technologies Meet Software", In Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk), Jun 2018.

Bibtex

@Misc{castrillon2018ISC,
author = {Castrillon, Jeronimo},
title = {Heterogeneous Post-CMOS Technologies Meet Software},
howpublished = {Post Moore Interconnects Workshop, ISC High Performance 2018 (invited talk)},
month = jun,
year = {2018},
keywords = {invitedtalk},
location = {Frankfurt, Germany},
url = {https://beyondcmos.ornl.gov/2018/agenda.html}
}

Downloads

180628_castrillon_isc-postmoore_send [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2151

×
Jeronimo Castrillon, "Parallel programming: Current and future systems", In 50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk), May 2018. [Bibtex & Downloads]

Parallel programming: Current and future systems

Reference

Jeronimo Castrillon, "Parallel programming: Current and future systems", In 50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk), May 2018.

Bibtex

@Misc{castrillon2018UdeA,
author = {Castrillon, Jeronimo},
title = {Parallel programming: Current and future systems},
howpublished = {50-year Celebration: Department of Electronics, Universidad de Antioquia, in the context of the IEEE Colombian Conference on Communications and Computing (COLCOM'18) (invited talk)},
month = may,
year = {2018},
location = {Universidad de Antioquia, Medell{\'i}n, Colombia}
}

Downloads

180516_castrillon_50years_EE_UdeA [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2137

×
Sven Karol, Tobias Nett, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Domain-Specific Language and Editor for Parallel Particle Methods", In ACM Transactions on Mathematical Software (TOMS), ACM, vol. 44, no. 3, pp. 32, New York, NY, USA, Mar 2018. [doi] [Bibtex & Downloads]

A Domain-Specific Language and Editor for Parallel Particle Methods

Reference

Sven Karol, Tobias Nett, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Domain-Specific Language and Editor for Parallel Particle Methods", In ACM Transactions on Mathematical Software (TOMS), ACM, vol. 44, no. 3, pp. 32, New York, NY, USA, Mar 2018. [doi]

Abstract
Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing DSLs is not easy, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL comfortable for programmers. Some of these features –e.g., syntax highlighting, type inference, error reporting– are easily provided by language workbenches, which combine language engineering techniques and tools in a common ecosystem. In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL and development environment for numerical simulations based on particle methods and hybrid particle-mesh methods. PPME uses the Meta Programming System (MPS), a projectional language workbench. PPME is the successor of the Parallel Particle-Mesh Language, a Fortran-based DSL that uses conventional implementation strategies. We analyze and compare both languages and demonstrate how the programmer’s experience is improved using static analyses and projectional editing, i.e., code-structure editing, constrained by syntax, as opposed to free-text editing. We present an explicit domain model for particle abstractions and the first formal type system for partircle methods.

Bibtex

@Article{karol_toms18,
author = {Karol, Sven and Nett, Tobias and Castrillon, Jeronimo and Sbalzarini, Ivo F.},
title = {A Domain-Specific Language and Editor for Parallel Particle Methods},
journal = {ACM Transactions on Mathematical Software (TOMS)},
issue_date = {March 2018},
volume = {44},
number = {3},
month = mar,
year = {2018},
issn = {0098-3500},
pages = {34:1--34:32},
articleno = {34},
numpages = {32},
url = {http://doi.acm.org/10.1145/3175659},
doi = {10.1145/3175659},
acmid = {3175659},
publisher = {ACM},
address = {New York, NY, USA},
pages = {32},
abstract = {
Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing DSLs is not easy, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL comfortable for programmers. Some of these features --e.g., syntax highlighting, type inference, error reporting-- are easily provided by language workbenches, which combine language engineering techniques and tools in a common ecosystem. In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL and development environment for numerical simulations based on particle methods and hybrid particle-mesh methods. PPME uses the Meta Programming System (MPS), a projectional language workbench. PPME is the successor of the Parallel Particle-Mesh Language, a Fortran-based DSL that uses conventional implementation strategies. We analyze and compare both languages and demonstrate how the programmer’s experience is improved using static analyses and projectional editing, i.e., code-structure editing, constrained by syntax, as opposed to free-text editing. We present an explicit domain model for particle abstractions and the first formal type system for partircle methods.},
}

Downloads

1709_Karol_TOMS-arxiv [PDF]

Related Paths
Biological Systems Path, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1422

×
Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018. [Bibtex & Downloads]

STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement

Reference

Fazal Hameed, Jeronimo Castrillon, "STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement", Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018), Mar 2018.

Bibtex

@InProceedings{hameed_nvmw18,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {STT-RAM Aware Last-Level-Cache Policies for Simultaneous Energy and Performance Improvement},
booktitle = {Proceedings of the 9th Annual Non-Volatile Memories Workshop (NVMW 2018)},
year = {2018},
month = mar,
location = {San Diego, CA, USA},
numpages = {2}
}

Downloads

1803_Hameed_NVMW [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1795

×
Sebastian Ertel, Andrés Goens, Justus Adam, Jeronimo Castrillon, "Compiling for Concise Code and Efficient I/O", Proceedings of the 27th International Conference on Compiler Construction (CC 2018), ACM, pp. 104–115, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]

Compiling for Concise Code and Efficient I/O

Reference

Sebastian Ertel, Andrés Goens, Justus Adam, Jeronimo Castrillon, "Compiling for Concise Code and Efficient I/O", Proceedings of the 27th International Conference on Compiler Construction (CC 2018), ACM, pp. 104–115, New York, NY, USA, Feb 2018. [doi]

Abstract
Large infrastructures of Internet companies, such as Facebook and Twitter, are composed of several layers of micro-services. While this modularity provides scalability to the system, the I/O associated with each service request strongly impacts its performance. In this context, writing concise programs which execute I/O efficiently is especially challenging. In this paper, we introduce Ÿauhau, a novel compile-time solution. Ÿauhau reduces the number of I/O calls through rewrites on a simple expression language. To execute I/O concurrently, it lowers the expression language to a dataflow representation. Our approach can be used alongside an existing programming language, permitting the use of legacy code. We describe an implementation in the JVM and use it to evaluate our approach. Experiments show that Ÿauhau can significantly improve I/O, both in terms of the number of I/O calls and concurrent execution. Ÿauhau outperforms state-of-the-art approaches with similar goals.

Bibtex

@InProceedings{ertel_cc18,
author = {Sebastian Ertel and Andr\'{e}s Goens and Justus Adam and Jeronimo Castrillon},
title = {Compiling for Concise Code and Efficient I/O},
booktitle = {Proceedings of the 27th International Conference on Compiler Construction (CC 2018)},
series = {CC 2018},
year = {2018},
month = feb,
location = {Vienna, Austria},
publisher = {ACM},
numpages = {12},
pages = {104--115},
doi = {10.1145/3178372.3179505},
url = {https://dl.acm.org/citation.cfm?id=3179505},
acmid = {3179505},
address = {New York, NY, USA},
abstract = {Large infrastructures of Internet companies, such as Facebook and Twitter, are composed of several layers of micro-services. While this modularity provides scalability to the system, the I/O associated with each service request strongly impacts its performance. In this context, writing concise programs which execute I/O efficiently is especially challenging. In this paper, we introduce Ÿauhau, a novel compile-time solution. Ÿauhau reduces the number of I/O calls through rewrites on a simple expression language. To execute I/O concurrently, it lowers the expression language to a dataflow representation. Our approach can be used alongside an existing programming language, permitting the use of legacy code. We describe an implementation in the JVM and use it to evaluate our approach. Experiments show that Ÿauhau can significantly improve I/O, both in terms of the number of I/O calls and concurrent execution. Ÿauhau outperforms state-of-the-art approaches with similar goals.},
}

Downloads

cc-2018-slides [PDF]

1802_Ertel_CC [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1793

×
Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Supporting Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), ACM, pp. 41–50, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]

Supporting Fine-grained Dataflow Parallelism in Big Data Systems

Reference

Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Supporting Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), ACM, pp. 41–50, New York, NY, USA, Feb 2018. [doi]

Abstract
Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.

Bibtex

@InProceedings{ertel_pmam18,
author = {Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {Supporting Fine-grained Dataflow Parallelism in Big Data Systems},
booktitle = {Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM)},
year = {2018},
series = {PMAM'18},
address = {New York, NY, USA},
month = feb,
publisher = {ACM},
doi = {10.1145/3178442.3178447},
isbn = {978-1-4503-5645-9},
location = {Vienna, Austria},
pages = {41--50},
numpages = {10},
acmid = {3178447},
url = {http://doi.acm.org/10.1145/3178442.3178447},
abstract = {Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.},
}

Downloads

pmam-2018-slides [PDF]

1802_Ertel_PMAM [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1794

×
Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, Claude Tadonki, "CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics", Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018), ACM, pp. 5:1–5:10, New York, NY, USA, Feb 2018. [doi] [Bibtex & Downloads]

CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics

Reference

Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, Claude Tadonki, "CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics", Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018), ACM, pp. 5:1–5:10, New York, NY, USA, Feb 2018. [doi]

Abstract
Numerical simulations continue to enable fast and enormous progress in science and engineering. Writing efficient numerical codes is a difficult challenge that encompasses a variety of tasks from designing the right algorithms to exploiting the full potential of a platform's architecture. Domain-specific languages (DSLs) can ease these tasks by offering the right abstractions for expressing numerical problems. With the aid of domain knowledge, efficient code can then be generated automatically from abstract expressions. In this work, we present the CFDlang DSL for expressing tensor operations that constitute the performance-critical code sections in a class of real numerical applications from fluid dynamics. We demonstrate that CFDlang can be used to generate code automatically that performs as well, if not better, than carefully hand-optimized code.

Bibtex

@InProceedings{rink_rwdsl18,
author = {Norman A. Rink and Immo Huismann and Adilla Susungi and Jeronimo Castrillon and J{\"o}rg Stiller and Jochen Fr{\"o}hlich and Claude Tadonki},
title = {CFDlang: High-level Code Generation for High-order Methods in Fluid Dynamics},
booktitle = {Proceedings of the 3rd International Workshop on Real World Domain Specific Languages (RWDSL 2018)},
year = {2018},
series = {RWDSL2018},
pages = {5:1--5:10},
address = {New York, NY, USA},
month = feb,
publisher = {ACM},
abstract = {Numerical simulations continue to enable fast and enormous progress in science and engineering. Writing efficient numerical codes is a difficult challenge that encompasses a variety of tasks from designing the right algorithms to exploiting the full potential of a platform's architecture. Domain-specific languages (DSLs) can ease these tasks by offering the right abstractions for expressing numerical problems. With the aid of domain knowledge, efficient code can then be generated automatically from abstract expressions. In this work, we present the CFDlang DSL for expressing tensor operations that constitute the performance-critical code sections in a class of real numerical applications from fluid dynamics. We demonstrate that CFDlang can be used to generate code automatically that performs as well, if not better, than carefully hand-optimized code.},
acmid = {3183900},
articleno = {5},
doi = {10.1145/3183895.3183900},
isbn = {978-1-4503-6355-6},
location = {Vienna, Austria},
numpages = {10},
url = {http://doi.acm.org/10.1145/3183895.3183900}
}

Downloads

1802_Rink_RWDSL [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2074

×
Hermann Härtig, Nils Asmussen, Jeronimo Castrillon, Adam Lackorzynski, Michael Roitzsch, Carsten Weinhold, Akash Kumar, "Extremely Heterogeneous Systems – Not Just For Niches", In Proceeding: Extreme Heterogeneity Workshop, Feb 2018. [Bibtex & Downloads]

Extremely Heterogeneous Systems – Not Just For Niches

Reference

Hermann Härtig, Nils Asmussen, Jeronimo Castrillon, Adam Lackorzynski, Michael Roitzsch, Carsten Weinhold, Akash Kumar, "Extremely Heterogeneous Systems – Not Just For Niches", In Proceeding: Extreme Heterogeneity Workshop, Feb 2018.

Bibtex

@InProceedings{haertig_ehw18,
author = {Hermann H{\"a}rtig and Nils Asmussen and Jeronimo Castrillon and Adam Lackorzynski and Michael Roitzsch and Carsten Weinhold and Akash Kumar},
title = {Extremely Heterogeneous Systems -- Not Just For Niches},
booktitle = {Extreme Heterogeneity Workshop},
year = {2018},
month = feb,
note = {(Workshop took place over remote conferencing)},
location = {Gaithersburg, MD, USA}
}

Downloads

1802_Haertig_EHW [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2097

×
Robert Khasanov, Andrés Goens, Jeronimo Castrillon, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap", Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 20–25, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]

Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap

Reference

Robert Khasanov, Andrés Goens, Jeronimo Castrillon, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap", Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 20–25, New York, NY, USA, Jan 2018. [doi]

Abstract
Modern embedded systems are rapidly increasing their complexity, both in terms of numbers of cores, as well as heterogeneity. To generate efficient code for these systems, it is common to leverage formal models of computation.
Among these, the dataflow model of Kahn Process Networks (KPN) is widespread because it is expressive but guarantees a deterministic execution. However, the KPN model is ill-suited to expose data-level parallelism, since this has to be made explicit in the process network. This is aggravated by the fact that its most common execution model, Kahn-MacQueen, poses restrictive conditions on the scheduling of data-parallel processes, leading to an inefficient execution. In this paper we present a novel extension to the KPN model and a relaxed execution strategy that addresses this problem, while keeping the deterministic KPN semantics. It improves run-time adaptivity in malleable way and provides implicit parallelism. We evaluate our approach on two architectures, improving the performance of a benchmark by up to 25.6% on an Intel chip with hyper-threading, and by up to 78.0% on a heterogeneous embedded ARM big.LITTLE architecture.

Bibtex

@InProceedings{khasanov_parma18,
author = {Robert Khasanov and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap},
booktitle = {Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {PARMA-DITAM '18},
isbn = {978-1-4503-6444-7},
pages = {20--25},
year = {2018},
month = jan,
numpages = {6},
url = {http://doi.acm.org/10.1145/3183767.3183790},
doi = {10.1145/3183767.3183790},
acmid = {3183790},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
abstract = {Modern embedded systems are rapidly increasing their complexity, both in terms of numbers of cores, as well as heterogeneity. To generate efficient code for these systems, it is common to leverage formal models of computation.
Among these, the dataflow model of Kahn Process Networks (KPN) is widespread because it is expressive but guarantees a deterministic execution. However, the KPN model is ill-suited to expose data-level parallelism, since this has to be made explicit in the process network. This is aggravated by the fact that its most common execution model, Kahn-MacQueen, poses restrictive conditions on the scheduling of data-parallel processes, leading to an inefficient execution. In this paper we present a novel extension to the KPN model and a relaxed execution strategy that addresses this problem, while keeping the deterministic KPN semantics. It improves run-time adaptivity in malleable way and provides implicit parallelism. We evaluate our approach on two architectures, improving the performance of a benchmark by up to 25.6% on an Intel chip with hyper-threading, and by up to 78.0% on a heterogeneous embedded ARM big.LITTLE architecture.},
}

Downloads

1801_Khasanov_PARMA-DITAM [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1781

×
Andrés Goens, Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers", Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2018. [Bibtex & Downloads]

Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers

Reference

Andrés Goens, Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers", Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Jan 2018.

Bibtex

@InProceedings{goens_multiprog18,
author = {Andr{\'e}s Goens and Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {Level Graphs: Generating Benchmarks for Concurrency Optimizations in Compilers},
booktitle = {Proceedings of the 11th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG'2018), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
year = {2018},
url = {http://research.ac.upc.edu/multiprog/multiprog2018/papers/MULTIPROG-2018_Goens.pdf},
month = jan,
location = {Manchester, United Kingdom}
}

Downloads

1801_Goens_MULTIRPOG [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1782

×
Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi] [Bibtex & Downloads]

NVMain Extension for Multi-Level Cache Systems

Reference

Asif Ali Khan, Fazal Hameed, Jeronimo Castrillon, "NVMain Extension for Multi-Level Cache Systems", Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 7:1–7:6, New York, NY, USA, Jan 2018. [doi]

Bibtex

@InProceedings{khan_rapido18,
author = {Asif Ali Khan and Fazal Hameed and Jeronimo Castrillon},
title = {NVMain Extension for Multi-Level Cache Systems},
booktitle = {Proceedings of the 10th RAPIDO Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC)},
series = {RAPIDO '18},
year = {2018},
month = jan,
pages = {7:1--7:6},
articleno = {7},
numpages = {6},
url = {http://doi.acm.org/10.1145/3180665.3180672},
doi = {10.1145/3180665.3180672},
acmid = {3180672},
publisher = {ACM},
address = {New York, NY, USA},
location = {Manchester, United Kingdom},
isbn = {978-1-4503-6417-1},
}

Downloads

1801_Khan_RAPIDO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=2098

×

2017
Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]

Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache

Reference

Fazal Hameed, Christian Menard, Jeronimo Castrillon, "Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache", Proceedings of the International Symposium on Memory Systems (MemSys'17), ACM, pp. 141–151, New York, NY, USA, Oct 2017. [doi]

Bibtex

@InProceedings{hameed_memsys17,
author = {Fazal Hameed and Christian Menard and Jeronimo Castrillon},
title = {Efficient STT-RAM Last-Level-Cache Architecture to replace DRAM Cache},
booktitle = {Proceedings of the International Symposium on Memory Systems (MemSys'17)},
series = {MEMSYS '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5335-9},
location = {Alexandria, Virginia},
pages = {141--151},
numpages = {11},
url = {http://doi.acm.org/10.1145/3132402.3132414},
doi = {10.1145/3132402.3132414},
acmid = {3132414},
publisher = {ACM},
address = {New York, NY, USA},
}

Downloads

1710_Hameed_Memsys [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1476

×
Miguel Angel Aguilar, Abhishek Aggarwal, Awaid Shaheen, Rainer Leupers, Gerd Ascheid, Jeronimo Castrillon, Liam Fitzpatrick, "Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress", Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), ACM, pp. 14:1–14:2, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]

Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress

Reference

Miguel Angel Aguilar, Abhishek Aggarwal, Awaid Shaheen, Rainer Leupers, Gerd Ascheid, Jeronimo Castrillon, Liam Fitzpatrick, "Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress", Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), ACM, pp. 14:1–14:2, New York, NY, USA, Oct 2017. [doi]

Bibtex

@inproceedings{aguilar_cases17,
author = {Aguilar, Miguel Angel and Aggarwal, Abhishek and Shaheen, Awaid and Leupers, Rainer and Ascheid, Gerd and Castrillon, Jeronimo and Fitzpatrick, Liam},
title = {Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress},
booktitle = {Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES)},
series = {CASES '17},
year = {2017},
month = oct,
isbn = {978-1-4503-5184-3},
location = {Seoul, Republic of Korea},
pages = {14:1--14:2},
articleno = {14},
numpages = {2},
url = {http://doi.acm.org/10.1145/3125501.3125521},
doi = {10.1145/3125501.3125521},
acmid = {3125521},
publisher = {ACM},
address = {New York, NY, USA},
}

Downloads

1710_Aguilar_CASES [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1498

×
Adilla Susungi, Norman A. Rink, Jeronimo Castrillon, Immo Huismann, Albert Cohen, Claude Tadonki, Jörg Stiller, Jochen Fröhlich, "Towards Compositional and Generative Tensor Optimizations", Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17), ACM, pp. 169–175, New York, NY, USA, Oct 2017. [doi] [Bibtex & Downloads]

Towards Compositional and Generative Tensor Optimizations

Reference

Adilla Susungi, Norman A. Rink, Jeronimo Castrillon, Immo Huismann, Albert Cohen, Claude Tadonki, Jörg Stiller, Jochen Fröhlich, "Towards Compositional and Generative Tensor Optimizations", Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17), ACM, pp. 169–175, New York, NY, USA, Oct 2017. [doi]

Bibtex

@InProceedings{rink_gpce17,
author = {Adilla Susungi and Norman A. Rink and Jeronimo Castrillon and Immo Huismann and Albert Cohen and Claude Tadonki and J{\"o}rg Stiller and Jochen Fr{\"o}hlich},
title = {Towards Compositional and Generative Tensor Optimizations},
booktitle = {Proceedings of 16th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE'17)},
series = {GPCE 2017},
year = {2017},
pages = {169--175},
month = oct,
isbn = {978-1-4503-5524-7},
location = {Vancouver, BC, Canada},
pages = {169--175},
numpages = {7},
url = {http://doi.acm.org/10.1145/3136040.3136050},
doi = {10.1145/3136040.3136050},
acmid = {3136050},
publisher = {ACM},
address = {New York, NY, USA},
}

Downloads

1710_Rink_GPCE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1574

×
Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017) (Lawrence Rauchwerger), Springer, Cham, pp. 281–282, Oct 2017. [doi] [Bibtex & Downloads]

POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems

Reference

Sebastian Ertel, Justus Adam, Jeronimo Castrillon, "POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems", Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017) (Lawrence Rauchwerger), Springer, Cham, pp. 281–282, Oct 2017. [doi]

Bibtex

@InProceedings{ertel_lcpc17,
author = {Sebastian Ertel and Justus Adam and Jeronimo Castrillon},
title = {POSTER: Towards Fine-grained Dataflow Parallelism in Big Data Systems},
booktitle = {Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017)},
year = {2017},
editor = {Lawrence Rauchwerger},
publisher = {Springer, Cham},
location = {Texas A{\&}M University, College Station, Texas},
month = oct,
isbn = {978-3-030-35224-0},
pages = {281--282},
doi = {10.1007/978-3-030-35225-7},
url = {https://link.springer.com/book/10.1007%2F978-3-030-35225-7},
}

Downloads

1710_Ertel_LCPC [PDF]

Related Paths
HAEC

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1584

×
Sven Karol, Tobias Nett, Pietro Incardona, Nesrine Khouzami, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Language and Development Environment for Parallel Particle Methods", Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017 (P. Wriggers and M. Bischoff and E. Oñate and D.R.J. Owen and T. Zohdi), Sep 2017. [Bibtex & Downloads]

A Language and Development Environment for Parallel Particle Methods

Reference

Sven Karol, Tobias Nett, Pietro Incardona, Nesrine Khouzami, Jeronimo Castrillon, Ivo F. Sbalzarini, "A Language and Development Environment for Parallel Particle Methods", Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017 (P. Wriggers and M. Bischoff and E. Oñate and D.R.J. Owen and T. Zohdi), Sep 2017.

Bibtex

@InProceedings{karol_particles17,
author = {Sven Karol and Tobias Nett and Pietro Incardona and Nesrine Khouzami and Jeronimo Castrillon and Ivo F. Sbalzarini},
title = {A Language and Development Environment for Parallel Particle Methods},
booktitle = {Proceedings of the 5th International Conference on Particle-based Methods. Fundamentals and Applications PARTICLES 2017},
year = {2017},
editor = {P. Wriggers and M. Bischoff and E. O{\~n}ate and D.R.J. Owen and T. Zohdi},
url = {https://www.semanticscholar.org/paper/A-Language-and-Development-Environment-for-Paralle-Karol-Nett/2b79bd3836aeb8e2fb2a2b5d9949f9efb1bdfab7?tab=abstract},
month = sep,
}

Downloads

1709_Karol_particles [PDF]

Related Paths
Biological Systems Path, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1472

×
Jeronimo Castrillon, Tei-Wei Kuo, Heike E. Riel, Matthias Lieber, "Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)", In Dagstuhl Reports (Jerónimo Castrillón-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, vol. 7, no. 2, pp. 1–22, Dagstuhl, Germany, Aug 2017. [doi] [Bibtex & Downloads]

Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)

Reference

Jeronimo Castrillon, Tei-Wei Kuo, Heike E. Riel, Matthias Lieber, "Wildly Heterogeneous Post-CMOS Technologies Meet Software (Dagstuhl Seminar 17061)", In Dagstuhl Reports (Jerónimo Castrillón-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, vol. 7, no. 2, pp. 1–22, Dagstuhl, Germany, Aug 2017. [doi]

Bibtex

@Article{castrillnmazo_et_al:DR:2017:7349,
author = {Jeronimo Castrillon and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber},
title = ,
journal = {Dagstuhl Reports},
year = {2017},
volume = {7},
number = {2},
month = aug,
pages = {1--22},
address = {Dagstuhl, Germany},
annote = {Keywords: 3D integration, compilers, emerging post-CMOS circuit materials and technologies, hardware/software co-design, heterogeneous hardware, nanoelectronics},
doi = {10.4230/DagRep.7.2.1},
editor = {Jer{\'o}nimo Castrill{\'o}n-Mazo and Tei-Wei Kuo and Heike E. Riel and Matthias Lieber},
issn = {2192-5283},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
url = {http://drops.dagstuhl.de/opus/volltexte/2017/7349},
urn = {urn:nbn:de:0030-drops-73499}
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1575

×
Andrés Goens, Sergio Siccha, Jeronimo Castrillon, "Symmetry in Software Synthesis", In ACM Transactions on Architecture and Code Optimization (TACO),, ACM, vol. 14, no. 2, pp. 20:1–20:26, New York, NY, USA, Jul 2017. [doi] [Bibtex & Downloads]

Symmetry in Software Synthesis

Reference

Andrés Goens, Sergio Siccha, Jeronimo Castrillon, "Symmetry in Software Synthesis", In ACM Transactions on Architecture and Code Optimization (TACO),, ACM, vol. 14, no. 2, pp. 20:1–20:26, New York, NY, USA, Jul 2017. [doi]

Abstract
With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this paper, we present a formal framework that can determine the inherent symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-size benchmarks, our approach consistently yields reductions of the overall execution time of algorithms, accelerating them by a factor up to 10 in our experiments, or improving the quality of the results.

Bibtex

@article{goens_taco17symmetry,
author = {Goens, Andr{\'e}s and Siccha, Sergio and Castrillon, Jeronimo},
title = {Symmetry in Software Synthesis},
journal = {ACM Transactions on Architecture and Code Optimization (TACO),},
issue_date = {July 2017},
volume = {14},
number = {2},
month = jul,
year = {2017},
issn = {1544-3566},
pages = {20:1--20:26},
articleno = {20},
numpages = {26},
url = {http://doi.acm.org/10.1145/3095747},
doi = {10.1145/3095747},
acmid = {3095747},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Scalability, automation, clusters, design-space exploration, group theory, heterogeneous, inverse-semigroups, mapping, metaheuristics, network-on-chip, symmetry},
eprint = "arXiv:1704.06623",
abstract = {With the surge of multi- and manycores, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this paper, we present a formal framework that can determine the inherent symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-size benchmarks, our approach consistently yields reductions of the overall execution time of algorithms, accelerating them by a factor up to 10 in our experiments, or improving the quality of the results.}
}

Downloads

1704_Goens_TACO-arxiv [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1453

×
Christian Menard, Matthias Jung, Jeronimo Castrillon, Norbert Wehn, "System Simulation with gem5 and SystemC: The Keystone for Full Interoperability", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), pp. 62–69, Jul 2017. [doi] [Bibtex & Downloads]

System Simulation with gem5 and SystemC: The Keystone for Full Interoperability

Reference

Christian Menard, Matthias Jung, Jeronimo Castrillon, Norbert Wehn, "System Simulation with gem5 and SystemC: The Keystone for Full Interoperability", Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), pp. 62–69, Jul 2017. [doi]

Abstract
SystemC TLM based virtual prototypes have become the main tool in industry and research for concurrent hardware and software development, as well as hardware design space exploration. However, there exists a lack of accurate, free, changeable and realistic SystemC models of modern CPUs. Therefore, many researchers use the cycle accurate open source system simulator gem5, which has been developed in parallel to the SystemC standard. In this paper we present a coupling of gem5 with SystemC that offers full interoperability between both simulation frameworks, and therefore enables a huge set of possibilities for system level design space exploration. Furthermore, we show that the coupling itself only induces a relatively small overhead to the total execution time of the simulation.

Bibtex

@InProceedings{menard_samos17,
author = {Christian Menard and Matthias Jung and Jeronimo Castrillon and Norbert Wehn},
title = {System Simulation with gem5 and SystemC: The Keystone for Full Interoperability},
booktitle = {Proceedings of the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS)},
year = {2017},
month = jul,
location = {Pythagorion, Greece},
pages = {62--69},
organization = {IEEE},
doi = {10.1109/SAMOS.2017.8344612},
url = {https://ieeexplore.ieee.org/document/8344612/},
isbn = {978-1-5386-3437-0},
abstract = {SystemC TLM based virtual prototypes have become the main tool in industry and research for concurrent hardware and software development, as well as hardware design space exploration. However, there exists a lack of accurate, free, changeable and realistic SystemC models of modern CPUs. Therefore, many researchers use the cycle accurate open source system simulator gem5, which has been developed in parallel to the SystemC standard. In this paper we present a coupling of gem5 with SystemC that offers full interoperability between both simulation frameworks, and therefore enables a huge set of possibilities for system level design space exploration. Furthermore, we show that the coupling itself only induces a relatively small overhead to the total execution time of the simulation.},
}

Downloads

1707_Menard_SAMOS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1475

×
Andrés Goens, Robert Khasanov, Marcus Hähnel, Till Smejkal, Hermann Härtig, Jeronimo Castrillon, "TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17), ACM, pp. 11–20, New York, NY, USA, Jun 2017. [doi] [Bibtex & Downloads]

TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings

Reference

Andrés Goens, Robert Khasanov, Marcus Hähnel, Till Smejkal, Hermann Härtig, Jeronimo Castrillon, "TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17), ACM, pp. 11–20, New York, NY, USA, Jun 2017. [doi]

Bibtex

@InProceedings{goens_scopes17,
author = {Andr\'{e}s Goens and Robert Khasanov and Marcus H{\"a}hnel and Till Smejkal and Hermann H{\"a}rtig and Jeronimo Castrillon},
title = {TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings},
booktitle = {Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES'17)},
year = {2017},
month = jun,
series = {SCOPES '17},
isbn = {978-1-4503-5039-6},
location = {Sankt Goar, Germany},
pages = {11--20},
numpages = {10},
url = {http://doi.acm.org/10.1145/3078659.3078663},
doi = {10.1145/3078659.3078663},
acmid = {3078663},
publisher = {ACM},
address = {New York, NY, USA}
}

Downloads

1706_Goens_SCOPES [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1451

×
Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo Castrillon, Ivo F. Sbalzarini, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17), ACM, pp. 21–30, New York, NY, USA, Jun 2017. [doi] [Bibtex & Downloads]

Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering

Reference

Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo Castrillon, Ivo F. Sbalzarini, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17), ACM, pp. 21–30, New York, NY, USA, Jun 2017. [doi]

Bibtex

@InProceedings{hempel_scopes17,
author = {Gerald Hempel and Andr\'{e}s Goens and Josefine Asmus and Jeronimo Castrillon and Ivo F. Sbalzarini},
title = {Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering},
booktitle = {Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17)},
year = {2017},
series = {SCOPES '17},
pages = {21--30},
address = {New York, NY, USA},
month = jun,
publisher = {ACM},
acmid = {3078667},
doi = {10.1145/3078659.3078667},
isbn = {978-1-4503-5039-6},
location = {Sankt Goar, Germany},
numpages = {10},
url = {http://doi.acm.org/10.1145/3078659.3078667}
}

Downloads

1706_Hempel_SCOPES [PDF]

Related Paths
Biological Systems Path, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1452

×
Johanna Sepúlveda, Vania Marangozova-Martin, Jeronimo Castrillon, "Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface", Elsevier, Jun 2017. [doi] [Bibtex & Downloads]

Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface

Reference

Johanna Sepúlveda, Vania Marangozova-Martin, Jeronimo Castrillon, "Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface", Elsevier, Jun 2017. [doi]

Bibtex

@Article{sepulveda_alchemy17_preface,
author = {Sep{\'u}lveda, Johanna and Marangozova-Martin, Vania and Castrillon, Jeronimo},
title = {Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface},
year = {2017},
month = jun,
doi = {10.1016/j.procs.2017.05.276},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1706_sepulveda_alchemy.pdf:PDF},
url = {http://www.sciencedirect.com/science/article/pii/S1877050917309286},
publisher = {Elsevier}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1479

×
Norman A. Rink, Jeronimo Castrillon, "Extending a Compiler Backend for Complete Memory Error Detection", In Proceeding: Lecture Notes in Informatics: Automotive - Safety & Security 2017 (Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Plödereder), pp. 61–74, May 2017. (Best paper award) [Bibtex & Downloads]

Extending a Compiler Backend for Complete Memory Error Detection

Reference

Norman A. Rink, Jeronimo Castrillon, "Extending a Compiler Backend for Complete Memory Error Detection", In Proceeding: Lecture Notes in Informatics: Automotive - Safety & Security 2017 (Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Plödereder), pp. 61–74, May 2017. (Best paper award)

Abstract
Technological advances drive hardware to ever smaller feature sizes, causing devices to become more vulnerable to faults. Applications can be protected against errors resulting from faults by adding error detection and recovery measures in software. This is popularly achieved by applying automatic program transformations. However, transformations applied to intermediate program representations are fundamentally incapable of protecting against vulnerabilities that are introduced during compilation. In particular, the compiler backend may introduce additional memory accesses. This report presents an extended compiler backend that protects these accesses against faults in the memory system. It is demonstrated that this enables the detection of all single bit flips in memory. On a subset of SPEC CINT2006 the runtime overhead caused by the extended backend amounts to 1.50x for the 32-bit processor architecture i386, and 1.13x for the 64-bit architecture x86 64.

Bibtex

@InProceedings{rink_automotive17,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Extending a Compiler Backend for Complete Memory Error Detection},
booktitle = {Lecture Notes in Informatics: Automotive - Safety \& Security 2017},
editor = {Peter Dencker and Herbert Klenk and Hubert Kelle and Erhard Pl{\"o}dereder},
year = {2017},
pages = {61--74},
month = may,
abstract = {Technological advances drive hardware to ever smaller feature sizes, causing devices to become more vulnerable to faults. Applications can be protected against errors resulting from faults by adding error detection and recovery measures in software. This is popularly achieved by applying automatic program transformations. However, transformations applied to intermediate program representations are fundamentally incapable of protecting against vulnerabilities that are introduced during compilation. In particular, the compiler backend may introduce additional memory accesses. This report presents an extended compiler backend that protects these accesses against faults in the memory system. It is demonstrated that this enables the detection of all single bit flips in memory. On a subset of SPEC CINT2006 the runtime overhead caused by the extended backend amounts to 1.50x for the 32-bit processor architecture i386, and 1.13x for the 64-bit architecture x86 64.},
file = {:/Users/jeronimocastrillon/Documents/Academic/mypapers/1705_rink_automotive.pdf:PDF},
isbn = {978-3-88579-663-3},
issn = {1617-5468},
url = {https://dl.gi.de/bitstream/handle/20.500.12116/147/paper04.pdf?sequence=1&isAllowed=y},

}

Downloads

1705_rink_automotive [PDF]

Related Paths
Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1322

×
Markus Haehnel, Frehiwot Melak Arega, Waltenegus Dargie, Robert Khasanov, Jeronimo Castrillon, "Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures", Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 432-437, May 2017. [doi] [Bibtex & Downloads]

Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures

Reference

Markus Haehnel, Frehiwot Melak Arega, Waltenegus Dargie, Robert Khasanov, Jeronimo Castrillon, "Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures", Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 432-437, May 2017. [doi]

Bibtex

@InProceedings{khasanov_dcperf17,
author = {Markus Haehnel and Frehiwot Melak Arega and Waltenegus Dargie and Robert Khasanov and Jeronimo Castrillon},
title = {Application Interference Analysis: Towards Energy-efficient Workload Management on Heterogeneous Micro-Server Architectures},
booktitle = {Proceedings of the 7th International Workshop on Big Data in Cloud Performance (DCPerf'17), IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)},
year = {2017},
month = may,
volume={},
number={},
pages={432-437},
doi={10.1109/INFCOMW.2017.8116415},
ISSN={},
url = {http://ieeexplore.ieee.org/document/8116415/},
location = {Atlanta, USA}
}

Downloads

1705_Khasanov_DCPerf [PDF]

Related Paths
HAEC

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1385

×
Norman A. Rink, Jeronimo Castrillon, "Trading Fault Tolerance for Performance in AN Encoding", Proceedings of the ACM International Conference on Computing Frontiers (CF'17), ACM, pp. 183–190, New York, NY, USA, May 2017. [doi] [Bibtex & Downloads]

Trading Fault Tolerance for Performance in AN Encoding

Reference

Norman A. Rink, Jeronimo Castrillon, "Trading Fault Tolerance for Performance in AN Encoding", Proceedings of the ACM International Conference on Computing Frontiers (CF'17), ACM, pp. 183–190, New York, NY, USA, May 2017. [doi]

Bibtex

@InProceedings{rink_cf17,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Trading Fault Tolerance for Performance in {AN} Encoding},
booktitle = {Proceedings of the ACM International Conference on Computing Frontiers (CF'17)},
year = {2017},
isbn = {978-1-4503-4487-6},
location = {Siena, Italy},
pages = {183--190},
numpages = {8},
url = {http://doi.acm.org/10.1145/3075564.3075565},
doi = {10.1145/3075564.3075565},
acmid = {3075565},
publisher = {ACM},
address = {New York, NY, USA},
month = may,
}

Downloads

1705_Rink_cf [PDF]

Related Paths
Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1421

×
Rainer Leupers, Miguel Angel Aguilar, Juan Fernando Eusse, Jeronimo Castrillon, Weihua Sheng, "MAPS: A Software Development Environment for Embedded Multicore Applications", Springer Netherlands, pp. 1–33, Dordrecht, Apr 2017. [doi] [Bibtex & Downloads]

MAPS: A Software Development Environment for Embedded Multicore Applications

Reference

Rainer Leupers, Miguel Angel Aguilar, Juan Fernando Eusse, Jeronimo Castrillon, Weihua Sheng, "MAPS: A Software Development Environment for Embedded Multicore Applications", Springer Netherlands, pp. 1–33, Dordrecht, Apr 2017. [doi]

Abstract
The use of heterogeneous Multi-Processor System-on-Chip (MPSoC) is a widely accepted solution to address the increasing demands on high performance and energy efficiency for modern embedded devices. To enable the full potential of these platforms, new tools are needed to tackle the programming complexity of MPSoCs, while allowing for high productivity. This chapter discusses the MPSoC Application Programming Studio (MAPS), a framework that provides facilities for expressing parallelism and tool flows for parallelization, mapping/scheduling, and code generation for heterogeneous MPSoCs. Two case studies of the use of MAPS in commercial environments are presented. This chapter closes by discussing early experiences of transferring the MAPS technology into Silexica GmbH, a start-up company that provides multi-core programming tools.

Bibtex

@InBook{leupers_hhcd17,
title = {MAPS: A Software Development Environment for Embedded Multicore Applications},
author = {Rainer Leupers and Miguel Angel Aguilar and Juan Fernando Eusse and Jeronimo Castrillon and Weihua Sheng},
editor = {Soonhoi Ha and J{\"u}rgen Teich},
publisher = {Springer Netherlands},
year = {2017},
address = {Dordrecht},
month = apr,
booktitle = {Handbook of Hardware/Software Codesign},
doi = {10.1007/978-94-017-7358-4_2-1},
isbn = {978-94-017-7358-4},
url = {http://dx.doi.org/10.1007/978-94-017-7358-4_2-1},
pages = {1--33},
abstract = {The use of heterogeneous Multi-Processor System-on-Chip (MPSoC) is a widely accepted solution to address the increasing demands on high performance and energy efficiency for modern embedded devices. To enable the full potential of these platforms, new tools are needed to tackle the programming complexity of MPSoCs, while allowing for high productivity. This chapter discusses the MPSoC Application Programming Studio (MAPS), a framework that provides facilities for expressing parallelism and tool flows for parallelization, mapping/scheduling, and code generation for heterogeneous MPSoCs. Two case studies of the use of MAPS in commercial environments are presented. This chapter closes by discussing early experiences of transferring the MAPS technology into Silexica GmbH, a start-up company that provides multi-core programming tools.},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1181

×
Lars Schütze, Jeronimo Castrillon, "Analyzing State-of-the-Art Role-based Programming Languages", Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17), ACM, pp. 9:1–9:6, New York, NY, USA, Apr 2017. [doi] [Bibtex & Downloads]

Analyzing State-of-the-Art Role-based Programming Languages

Reference

Lars Schütze, Jeronimo Castrillon, "Analyzing State-of-the-Art Role-based Programming Languages", Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17), ACM, pp. 9:1–9:6, New York, NY, USA, Apr 2017. [doi]

Bibtex

@InProceedings{schuetze_lassy17,
author = {Lars Sch{\"u}tze and Jeronimo Castrillon},
title = {Analyzing State-of-the-Art Role-based Programming Languages},
booktitle = {Proceedings of the First International Conference on the Art, Science and Engineering of Programming (Programming'17)},
series = {Programming '17},
year = {2017},
month = apr,
isbn = {978-1-4503-4836-2},
location = {Brussels, Belgium},
pages = {9:1--9:6},
articleno = {9},
numpages = {6},
url = {http://doi.acm.org/10.1145/3079368.3079386},
doi = {10.1145/3079368.3079386},
acmid = {3079386},
publisher = {ACM},
address = {New York, NY, USA},
}

Downloads

1704_Schuetze_lassy [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1388

×
Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi] [Bibtex & Downloads]

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Reference

Fazal Hameed, Jeronimo Castrillon, "Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization", Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE), EDA Consortium, pp. 362–367, Mar 2017. [doi]

Abstract
State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2% and 11.4% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62%.

Bibtex

@InProceedings{hameed_date17,
author = {Fazal Hameed and Jeronimo Castrillon},
title = {Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization},
booktitle = {Proceedings of the 2017 Design, Automation and Test in Europe conference (DATE)},
year = {2017},
series = {DATE '17},
pages = {362--367},
month = mar,
publisher = {EDA Consortium},
abstract = {State-of-the-art DRAM cache employs a small Tag-Cache and its performance is dependent upon two important parameters namely bank-level-parallelism and Tag-Cache hit rate. These parameters depend upon the row buffer organization. Recently, it has been shown that a small row buffer organization delivers better performance via improved bank-level-parallelism than the traditional large row buffer organization along with energy benefits. However, small row buffers do not fully exploit the temporal locality of tag accesses, leading to reduced Tag- Cache hit rates. As a result, the DRAM cache needs to be re-designed for small row buffer organization to achieve additional performance benefits. In this paper, we propose a novel tag-store mechanism that improves the Tag-Cache hit rate by 70\% compared to existing DRAM tag-store mechanisms employing small row buffer organization. In addition, we enhance the DRAM cache controller with novel policies that take into account the locality characteristics of cache accesses. We evaluate our novel tag-store mechanism and controller policies in an 8-core system running the SPEC2006 benchmark and compare their performance and energy consumption against recent proposals. Our architecture improves the average performance by 21.2\% and 11.4\% respectively compared to large and small row buffer organizations via simultaneously improving both parameters. Compared to DRAM cache with large row buffer organization, we report an energy improvement of 62\%.},
isbn = {978-3-9815370-8-6},
doi={10.23919/DATE.2017.7927017},
url = {http://ieeexplore.ieee.org/document/7927017/},
location = {Lausanne, Switzerland}
}

Downloads

1703_Hameed_DATE [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1254

×
Norman A. Rink, Jeronimo Castrillon, "flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication", Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017, pp. 15–22, Mar 2017. [Bibtex & Downloads]

flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication

Reference

Norman A. Rink, Jeronimo Castrillon, "flexMEDiC: flexible Memory Error Detection by Combined data encoding and duplication", Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017, pp. 15–22, Mar 2017.

Abstract
Errors in memory are known to be a major cause of system failures. Moreover, it has recently been found that single-error correcting, double-error detecting (SECDED) codes, which are widely used in ECC memory modules, are incapable of handling large fractions of errors that occur in practice. This calls for more powerful error detection measures. However, the higher the number of bit flips that can still be detected as an error, the larger the memory overhead. Cost considerations and the varying needs for reliability of different applications may not always warrant laying down extra hardware to accommodate overheads. Software-implemented error detection offers a flexible alternative. In this work we propose the software-implemented flexMEDiC scheme for detecting errors in the memory system, including main memory, on-chip caches, and load-store queues. It is shown that single and double bit flips are detected by flexMEDiC, and evidence is given that suggests that up to five bit flips within a single data word can still be detected as errors. The average runtime overhead incurred by flexMEDiC is 1.55x.

Bibtex

@InProceedings{rees:2017,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {{flexMEDiC}: flexible {M}emory {E}rror {D}etection by Combined data encoding and duplication},
booktitle = {Proceedings of the 2nd International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with DATE 2017},
year = {2017},
month = mar,
pages = {15--22},
abstract = {Errors in memory are known to be a major cause of system failures. Moreover, it has recently been found that single-error correcting, double-error detecting (SECDED) codes, which are widely used in ECC memory modules, are incapable of handling large fractions of errors that occur in practice. This calls for more powerful error detection measures. However, the higher the number of bit flips that can still be detected as an error, the larger the memory overhead. Cost considerations and the varying needs for reliability of different applications may not always warrant laying down extra hardware to accommodate overheads. Software-implemented error detection offers a flexible alternative. In this work we propose the software-implemented flexMEDiC scheme for detecting errors in the memory system, including main memory, on-chip caches, and load-store queues. It is shown that single and double bit flips are detected by flexMEDiC, and evidence is given that suggests that up to five bit flips within a single data word can still be detected as errors. The average runtime overhead incurred by flexMEDiC is 1.55x.},
}

Downloads

1703_Rink_REES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1321

×
Jeronimo Castrillon, "Programming for adaptive and energy-efficient computing", In International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote), Mar 2017. [Bibtex & Downloads]

Programming for adaptive and energy-efficient computing

Reference

Jeronimo Castrillon, "Programming for adaptive and energy-efficient computing", In International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote), Mar 2017.

Bibtex

@Misc{castrillon2017hp3c,
author = {Castrillon, Jeronimo},
title = {Programming for adaptive and energy-efficient computing},
howpublished = {International Conference on High Performance Compilation, Computing and Communications (HP3C-2017) (keynote)},
month = mar,
year = {2017},
location = {Kuala Lumpur, Malaysia}
}

Downloads

170323_castrill_hp3c [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1473

×
Andrés Goens, Jeronimo Castrillon, "Optimizing for Data-Parallelism in Kahn Process Networks", In Proceeding: ACM SRC at International Symposium on Code Generationand Optimization (CGO), Feb 2017. [Bibtex & Downloads]

Optimizing for Data-Parallelism in Kahn Process Networks

Reference

Andrés Goens, Jeronimo Castrillon, "Optimizing for Data-Parallelism in Kahn Process Networks", In Proceeding: ACM SRC at International Symposium on Code Generationand Optimization (CGO), Feb 2017.

Bibtex

@inproceedings{goens17cgo,
author = {Andr\'{e}s Goens and Jeronimo Castrillon},
title = {Optimizing for Data-Parallelism in Kahn Process Networks},
year = {2017},
month = feb,
booktitle= {ACM SRC at International Symposium on
Code Generationand Optimization (CGO)},
location = {Austin, TX, USA},
}

Downloads

1701_Goens_SRCCGO [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1544

×
Jeronimo Castrillon, "On Mapping to Multi/Manycores", In 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017. [Bibtex & Downloads]

On Mapping to Multi/Manycores

Reference

Jeronimo Castrillon, "On Mapping to Multi/Manycores", In 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017.

Bibtex

@Misc{castrillon2017multiprog,
author = {Castrillon, Jeronimo},
title = {On Mapping to Multi/Manycores},
howpublished = {10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk)},
month = jan,
year = {2017},
location = {Stockholm, Sweden}
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1319

×
Jeronimo Castrillon, "Flexible and Scalable Dataflow Programming for Manycores", In Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017. [Bibtex & Downloads]

Flexible and Scalable Dataflow Programming for Manycores

Reference

Jeronimo Castrillon, "Flexible and Scalable Dataflow Programming for Manycores", In Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk), Jan 2017.

Bibtex

@Misc{castrillon2017hipeactut,
author = {Castrillon, Jeronimo},
title = {Flexible and Scalable Dataflow Programming for Manycores},
howpublished = {Tutorial for heterogeneous multicore design automation: current and future, held in conjunction with the 12th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) (invited talk)},
month = jan,
year = {2017},
location = {Stockholm, Sweden}
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1320

×

2016
Norman A. Rink, Jeronimo Castrillon, "Comprehensive Backend Support for Local Memory Fault Tolerance", Technical report, Technische Universität Dresden, pp. 11, Dec 2016. [Bibtex & Downloads]

Comprehensive Backend Support for Local Memory Fault Tolerance

Reference

Norman A. Rink, Jeronimo Castrillon, "Comprehensive Backend Support for Local Memory Fault Tolerance", Technical report, Technische Universität Dresden, pp. 11, Dec 2016.

Bibtex

@TechReport{rink_techrep16,
author = {Norman A. Rink and Jeronimo Castrillon},
title = {Comprehensive Backend Support for Local Memory Fault Tolerance},
institution = {Technische Universit{\"a}t Dresden},
year = {2016},
month = dec,
issn = {1430-211X},
pages = {11},
url = {https://cfaed.tu-dresden.de/files/user/nrink/tech-report-ro.pdf}
}

Downloads

tech-report-ro [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1308

×
Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Hermann Härtig, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andres Goens, Sebastian Haas, Dirk Habich, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Wolfgang Lehner, Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, "The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware", Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, USA, Nov 2016. [Bibtex & Downloads]

The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware

Reference

Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Hermann Härtig, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andres Goens, Sebastian Haas, Dirk Habich, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Wolfgang Lehner, Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, "The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware", Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, USA, Nov 2016.

Abstract
Future systems based on post-CMOS technologies,
will be wildly heterogeneous, with properties largely unknown today.,
This paper presents our design of a new hardware/software stack to address the,
challenge of preparing software development for such systems.,
It combines well-understood technologies from different areas, e.g., network-on-chips,
capability operating systems, flexible programming models and model checking.,
We describe our approach and provide details on key technologies.

Bibtex

@InProceedings{voelp16_pmes,
author = {Marcus V{\"o}lp and Sascha Kl{\"u}ppelholz and Jeronimo Castrillon and Hermann H{\"a}rtig and Nils Asmussen and Uwe Assmann and Franz Baader and Christel Baier and Gerhard Fettweis and Jochen Fr{\"o}hlich and Andres Goens and Sebastian Haas and Dirk Habich and Mattis Hasler and Immo Huismann and Tomas Karnagel and Sven Karol and Wolfgang Lehner and Linda Leuschner and Matthias Lieber and Siqi Ling and Steffen M{\"a}rcker and Johannes Mey and Wolfgang Nagel and Benedikt N{\"o}then and Rafael Pe{\~n}aloza and Michael Raitza and J{\"o}rg Stiller and Annett Ungeth{\"u}m and Axel Voigt},
title = {The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware},
booktitle = {Proceedings of the 1st International Workshop on Post-Moore's Era Supercomputing (PMES), Co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16)},
year = {2016},
address = {Salt Lake City, USA},
month = nov,
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1611_Voelp_PMES.pdf},
abstract = {Future systems based on post-CMOS technologies,
will be wildly heterogeneous, with properties largely unknown today.,
This paper presents our design of a new hardware/software stack to address the,
challenge of preparing software development for such systems.,
It combines well-understood technologies from different areas, e.g., network-on-chips,
capability operating systems, flexible programming models and model checking.,
We describe our approach and provide details on key technologies.},
}

Downloads

1611_Voelp_PMES [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=923

×
Christian Menard, Andrés Goens, Jeronimo Castrillon, "High-Level NoC Model for MPSoC Compilers", Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16), pp. 1-6, Copenhagen, Denmark, Nov 2016. [doi] [Bibtex & Downloads]

High-Level NoC Model for MPSoC Compilers

Reference

Christian Menard, Andrés Goens, Jeronimo Castrillon, "High-Level NoC Model for MPSoC Compilers", Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16), pp. 1-6, Copenhagen, Denmark, Nov 2016. [doi]

Bibtex

@InProceedings{menard_norcas16,
author = {Christian Menard and Andr\'{e}s Goens and Jeronimo Castrillon},
title = {High-Level NoC Model for MPSoC Compilers},
booktitle = {Proceedings of the IEEE Nordic Circuits and Systems Conference (NORCAS'16)},
year = {2016},
pages={1-6},
doi = {10.1109/NORCHIP.2016.7792876},
series = {NORCAS},
address = {Copenhagen, Denmark},
month = nov,
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1611_Menard_NORCAS.pdf}
}

Downloads

1611_Menard_NORCAS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1180

×
Andres Goens, Robert Khasanov, Jeronimo Castrillon, Simon Polstra, Andy Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study", Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16), pp. 281-288, Ecole Centrale de Lyon, Lyon, France, Sep 2016. [doi] [Bibtex & Downloads]

Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study

Reference

Andres Goens, Robert Khasanov, Jeronimo Castrillon, Simon Polstra, Andy Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study", Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16), pp. 281-288, Ecole Centrale de Lyon, Lyon, France, Sep 2016. [doi]

Abstract
Software abstractions are crucial to effectively program heterogeneous Multi-Processor Systems on Chip (MPSoCs). Prime examples of such abstractions are Kahn Process Networks (KPNs) and execution traces. When modeling computation as a KPN, one of the key challenges is to obtain a good mapping, i.e., an assignment of logical computation and communication to physical resources. In this paper we compare two system-level frameworks for solving the mapping problem: Sesame and MAPS. These frameworks, while superficially similar, embody different approaches. Sesame, motivated by modeling and design-space exploration, uses evolutionary algorithms for mapping. MAPS, being a compiler framework, uses simple and fast heuristics instead. In this work we highlight the value of common abstractions, such as KPNs and traces, as a vehicle to enable comparisons between large independent frameworks. These types of comparisons are fundamental for advancing research in the area. At the same time, we illustrate how the lack of formalized models at the hardware level are an obstacle to achieving fair comparisons. Additionally, using a set of applications from the embedded systems domain, we observe that genetic algorithms tend to outperform heuristics by a factor between 1x and 5x, with notable exceptions. This performance comes at the cost of a longer computation time, between 0 and 2 orders of magnitude in our experiments.

Bibtex

@InProceedings{goen_mcsoc16,
author= {Andres Goens and Robert Khasanov and Jeronimo Castrillon and Simon Polstra and Andy Pimentel},
title= {Why Comparing System-level {MPSoC} Mapping Approaches is Difficult: a Case Study},
booktitle= {Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16)},
year= {2016},
address= {Ecole Centrale de Lyon, Lyon, France},
month= sep,
pages = {281-288},
doi = {10.1109/MCSoC.2016.48},
abstract = {Software abstractions are crucial to effectively program heterogeneous Multi-Processor Systems on Chip (MPSoCs). Prime examples of such abstractions are Kahn Process Networks (KPNs) and execution traces. When modeling computation as a KPN, one of the key challenges is to obtain a good mapping, i.e., an assignment of logical computation and communication to physical resources. In this paper we compare two system-level frameworks for solving the mapping problem: Sesame and MAPS. These frameworks, while superficially similar, embody different approaches. Sesame, motivated by modeling and design-space exploration, uses evolutionary algorithms for mapping. MAPS, being a compiler framework, uses simple and fast heuristics instead. In this work we highlight the value of common abstractions, such as KPNs and traces, as a vehicle to enable comparisons between large independent frameworks. These types of comparisons are fundamental for advancing research in the area. At the same time, we illustrate how the lack of formalized models at the hardware level are an obstacle to achieving fair comparisons. Additionally, using a set of applications from the embedded systems domain, we observe that genetic algorithms tend to outperform heuristics by a factor between 1x and 5x, with notable exceptions. This performance comes at the cost of a longer computation time, between 0 and 2 orders of magnitude in our experiments.},
days= {21},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/1609_Goens_MCSoC.pdf}
}

Downloads

1609_Goens_MCSoC [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=806

×
Benjamin Schiller, Clemens Deusser, Jeronimo Castrillon, Thorsten Strufe, "Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis", In Journal of Applied Network Science, vol. 1, no. 9, pp. 1–22, Sep 2016. [doi] [Bibtex & Downloads]

Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis

Reference

Benjamin Schiller, Clemens Deusser, Jeronimo Castrillon, Thorsten Strufe, "Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis", In Journal of Applied Network Science, vol. 1, no. 9, pp. 1–22, Sep 2016. [doi]

Bibtex

@Article{schiller16_jans,
author = {Benjamin Schiller and Clemens Deusser and Jeronimo Castrillon and Thorsten Strufe},
title = {Compile- and Run-time Approaches for the Selection of Efficient Data Structures for Dynamic Graph Analysis},
journal = {Journal of Applied Network Science},
year = {2016},
volume = {1},
number = {9},
pages = {1--22},
month = sep,
doi = {10.1007/s41109-016-0011-2},
url= {http://dynamic-networks.org/publications/papers/papers/gds-dynamic.pdf}
}

Downloads

1607_Schiller_JANS [PDF]

Related Paths
HAEC, Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=855

×
Jeronimo Castrillon, "Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems", In IEEE 5G Dresden Summit (invited talk), Sep 2016. [Bibtex & Downloads]

Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems

Reference

Jeronimo Castrillon, "Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems", In IEEE 5G Dresden Summit (invited talk), Sep 2016.

Bibtex

@Misc{castrillon20165gsummit,
author = {Castrillon, Jeronimo},
title = {Compiling for Deeply Embedded and Heterogeneous Signal Processing Systems},
howpublished = {IEEE 5G Dresden Summit (invited talk)},
month = sep,
year = {2016},
location = {Dresden, Germany},
url= {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/160929_castrillon_5G-summit.pdf}
}

Downloads

160929_castrillon_5G-summit [PDF]

Permalink

https://cfaed.tu-dresden.de/publications?pubId=1186

×
Andrés Goens, Jeronimo Castrillon, Maximilian Odendahl, Rainer Leupers, "An Optimal Allocation of Memory Buffers for Complex Multicore Platforms", In Journal of Systems Architecture, Elsevier, vol. 66-67, pp. 69–83, May 2016. [doi] [Bibtex & Downloads]

An Optimal Allocation of Memory Buffers for Complex Multicore Platforms

Reference

Andrés Goens, Jeronimo Castrillon, Maximilian Odendahl, Rainer Leupers, "An Optimal Allocation of Memory Buffers for Complex Multicore Platforms", In Journal of Systems Architecture, Elsevier, vol. 66-67, pp. 69–83, May 2016. [doi]

Abstract
In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.
In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear pro- gramming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.

Bibtex

@Article{goens_jsa16,
Title={An Optimal Allocation of Memory Buffers for Complex Multicore Platforms},
Author={Goens, Andr\'{e}s and Castrillon, Jeronimo and Odendahl, Maximilian and Leupers, Rainer},
Journal={Journal of Systems Architecture},
volume={66-67},
pages={69--83},
doi={10.1016/j.sysarc.2016.05.002},
publisher={Elsevier},
Year={2016},
month=may,
abstract={In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.
In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear pro- gramming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.}
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=684

×
Jeronimo Castrillon, "Programming Heterogeneous Embedded Systems for IoT", In Workshop get-togethers toward a sustainable collaboration in IoT (invited talk), Apr 2016. ([link]) [Bibtex & Downloads]

Programming Heterogeneous Embedded Systems for IoT

Reference

Jeronimo Castrillon, "Programming Heterogeneous Embedded Systems for IoT", In Workshop get-togethers toward a sustainable collaboration in IoT (invited talk), Apr 2016. ([link])

Bibtex

@Misc{castrillon2016tunis,
author={Castrillon, Jeronimo},
title={Programming Heterogeneous Embedded Systems for IoT},
howpublished={Workshop get-togethers toward a sustainable collaboration in IoT (invited talk)},
month=apr,
year={2016},
location={Tunis, Tunisia},

url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/160418_castrillon_dataflow4IoT.pdf}
}

Downloads

160418_castrillon_dataflow4IoT [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=779

×
Sven Karol, Norman A. Rink, Bálint Gyapjas, Jeronimo Castrillon, "Fault Tolerance with Aspects: a Feasibility Study", Proceedings of the 15th International Conference on Modularity, ACM, pp. 66–69, New York, NY, USA, Mar 2016. [doi] [Bibtex & Downloads]

Fault Tolerance with Aspects: a Feasibility Study

Reference

Sven Karol, Norman A. Rink, Bálint Gyapjas, Jeronimo Castrillon, "Fault Tolerance with Aspects: a Feasibility Study", Proceedings of the 15th International Conference on Modularity, ACM, pp. 66–69, New York, NY, USA, Mar 2016. [doi]

Bibtex

@inproceedings{karol2016faulttolerance,
author={Karol, Sven and Rink, Norman A. and Gyapjas, B\'{a}lint and Castrillon, Jeronimo},
title={Fault Tolerance with Aspects: a Feasibility Study},
booktitle={Proceedings of the 15th International Conference on Modularity},
series={MODULARITY 2016},
year={2016},
pages={66--69},
address={New York, NY, USA},
month={mar},
publisher={ACM},
doi={10.1145/2889443.2889453},
isbn={978-1-4503-3995-7/16/03},
location={M{\'a}laga, Spain},

}

Downloads

1603_Karol_Modularity_preprint [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=651

×

2015
Andrés Goens, Jeronimo Castrillon, "Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs", In Proceeding: System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523 (Götz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aurélio and Al Faruque, Mohammad Abdullah and Rettberg, Achim), Springer International Publishing, pp. 116–127, Foz do Iguaçu, Brazil, Nov 2015. [doi] [Bibtex & Downloads]

Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs

Reference

Andrés Goens, Jeronimo Castrillon, "Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs", In Proceeding: System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523 (Götz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aurélio and Al Faruque, Mohammad Abdullah and Rettberg, Achim), Springer International Publishing, pp. 116–127, Foz do Iguaçu, Brazil, Nov 2015. [doi]

Abstract
Current approaches for mapping Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) applications rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior. We address this problem by leveraging inherent mathematical structures of the dataflow models and the hardware architectures. On the side of the dataflow models, we rely on the monoid structure of histories and traces. This structure help us formalize the behavior of multiple executions of a given dynamic application. By defining metrics we have a formal framework for comparing the executions. On the side of the hardware, we take advantage of symmetries in the architecture to reduce the search space for the mapping problem. We evaluate our implementation on execution variations of a randomly-generated KPN application and on a low-variation JPEG encoder benchmark. Using the described methods we show that trace differences are not sufficient for characterizing performance losses. Additionally, using platform symmetries we manage to reduce the design space in the experiments by two orders of magnitude.

Bibtex

@InProceedings{goens_iess15,
author = {Goens, Andr\'{e}s and Castrillon, Jeronimo},
title = {Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs},
booktitle = {System Level Design from HW/SW to Memory for Embedded Systems. IESS 2015. IFIP Advances in Information and Communication Technology, vol 523},
year = {2015},
editor = {G{\"o}tz, Marcelo and Schirner, Gunar and Wehrmeister, Marco Aur{\'e}lio and Al Faruque, Mohammad Abdullah and Rettberg, Achim},
pages = {116--127},
address = {Foz do Igua{\c{c}}u, Brazil},
month = nov,
publisher = {Springer International Publishing},
doi = {10.1007/978-3-319-90023-0_10},
url = {https://link.springer.com/chapter/10.1007%2F978-3-319-90023-0_10},
isbn={978-3-319-90023-0},
abstract = {Current approaches for mapping Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) applications rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior. We address this problem by leveraging inherent mathematical structures of the dataflow models and the hardware architectures. On the side of the dataflow models, we rely on the monoid structure of histories and traces. This structure help us formalize the behavior of multiple executions of a given dynamic application. By defining metrics we have a formal framework for comparing the executions. On the side of the hardware, we take advantage of symmetries in the architecture to reduce the search space for the mapping problem. We evaluate our implementation on execution variations of a randomly-generated KPN application and on a low-variation JPEG encoder benchmark. Using the described methods we show that trace differences are not sufficient for characterizing performance losses. Additionally, using platform symmetries we manage to reduce the design space in the experiments by two orders of magnitude.},
}

Downloads

1511_Goens_IESS [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=463

×
Benjamin Schiller, Jeronimo Castrillon, Thorsten Strufe, "Efficient data structures for dynamic graph analysis", Proceedings of the 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (Lisa O'Conner), IEEE Computer Society, pp. 497–504, Bangkok, Thailand, Nov 2015. [doi] [Bibtex & Downloads]

Efficient data structures for dynamic graph analysis

Reference

Benjamin Schiller, Jeronimo Castrillon, Thorsten Strufe, "Efficient data structures for dynamic graph analysis", Proceedings of the 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (Lisa O'Conner), IEEE Computer Society, pp. 497–504, Bangkok, Thailand, Nov 2015. [doi]

Bibtex

@InProceedings{schiller_sitis15,
Title={Efficient data structures for dynamic graph analysis},
Author={Schiller, Benjamin and Castrillon, Jeronimo and Strufe, Thorsten},
Booktitle={Proceedings of the 11th International Conference on Signal-Image Technology \& Internet-Based Systems (SITIS)},
Year={2015},
Address={Bangkok, Thailand},
Editor={Lisa O'Conner},
Month=nov,
Publisher={IEEE Computer Society},
Series={SITIS 2015},
pages={497--504},
doi={10.1109/SITIS.2015.94}
}

Downloads

1511_Schiller_SITIS [PDF]

Related Paths
Orchestration Path, Resilience Path, HAEC

Permalink

https://cfaed.tu-dresden.de/publications?pubId=465

×
Norman A. Rink, Jeronimo Castrillon, "Improving Code Generation for Software-based Error Detection", Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015, pp. 16–30, Amsterdam, The Netherlands, Oct 2015. ([link]) [Bibtex & Downloads]

Improving Code Generation for Software-based Error Detection

Reference

Norman A. Rink, Jeronimo Castrillon, "Improving Code Generation for Software-based Error Detection", Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015, pp. 16–30, Amsterdam, The Netherlands, Oct 2015. ([link])

Bibtex

@InProceedings{rink_ress15,
Title={Improving Code Generation for Software-based Error Detection},
Author={Rink, Norman A. and Castrillon, Jeronimo},
Booktitle={Proceedings of the 1st International Workshop on Resiliency in Embedded Electronic Systems (REES), co-located with ESWEEK 2015},
Year={2015},
Series={REES 2015},
Address={Amsterdam, The Netherlands},
Month=oct,
Pages={16--30},

}

Downloads

1510_Rink_REES [PDF]

Related Paths
Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=464

×
Jeronimo Castrillon, "Analysis and software synthesis of KPN applications", In Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk), Oct 2015. ([link]) [Bibtex & Downloads]

Analysis and software synthesis of KPN applications

Reference

Jeronimo Castrillon, "Analysis and software synthesis of KPN applications", In Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk), Oct 2015. ([link])

Abstract
Programming models based on dataflow or process
networks are a good match for streaming
applications, common in the signal processing,
multimedia and automotive domains. In such models,
parallelism is expressed explicitly which makes
them well-suited for programming parallel
machines. Since today's applications are no
longer static, expressive programming models are
needed, such as those based on Kahn Process
Networks (KPNs). In these models, tasks cannot be
handled as black boxes, but have to be analyzed,
profiled and traced to characterize their
behavior. This is especially important in the case
of heterogenous platforms with many processors of
multiple different types. This presentation
describes a tool flow to handle KPN applications
and gives insights into mapping algorithms for
heterogeneous platforms.

Bibtex

@Misc{castrillon15_dreams,
Title={Analysis and software synthesis of KPN applications},
Author={Jeronimo Castrillon},
HowPublished={Design of Robotics and Embedded systems, Analysis, and Modeling Seminar (DREAMS) (invited talk)},
Month=oct,
Year={2015},
Day={22},
Location={Berkeley, CA, USA},

Abstract={Programming models based on dataflow or process
networks are a good match for streaming
applications, common in the signal processing,
multimedia and automotive domains. In such models,
parallelism is expressed explicitly which makes
them well-suited for programming parallel
machines. Since today's applications are no
longer static, expressive programming models are
needed, such as those based on Kahn Process
Networks (KPNs). In these models, tasks cannot be
handled as black boxes, but have to be analyzed,
profiled and traced to characterize their
behavior. This is especially important in the case
of heterogenous platforms with many processors of
multiple different types. This presentation
describes a tool flow to handle KPN applications
and gives insights into mapping algorithms for
heterogeneous platforms.},
url={https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/151022_castrillon_dreams.pdf},
}

Downloads

151022_castrillon_dreams [PDF]

Related Paths
Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=466

×
Jeronimo Castrillon, "Dataflow programming for heterogeneous computing systems", In Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15), Oct 2015. ([link]) [Bibtex & Downloads]

Dataflow programming for heterogeneous computing systems

Reference

Jeronimo Castrillon, "Dataflow programming for heterogeneous computing systems", In Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15), Oct 2015. ([link])

Abstract
This tutorial talk starts by introducing new types of heterogeneous systems and their challenges for hardware/software programming stacks. These systems are currently being investigated in the context of the German cluster of excellence Cfaed – ''Center for Advancing Electronics Dresden''. We will then look at dataflow modeling concepts, with emphasis on the dynamic models that are needed to express today's changing workloads. Finally, the talk will introduce methods and algorithms for mapping sets of applications modeled in this way to heterogeneous systems.

Bibtex

@Misc{castrillon15_pacttut,
Title={Dataflow programming for heterogeneous computing systems},
Author={Jeronimo Castrillon},
HowPublished={Tutorial Algorithmic Specification, Tools and Algorithms for Programming Heterogeneous Platforms. Co-located with the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT'15)},
Month=oct,
Year={2015},
Abstract={This tutorial talk starts by introducing new types of heterogeneous systems and their challenges for hardware/software programming stacks. These systems are currently being investigated in the context of the German cluster of excellence Cfaed – ''Center for Advancing Electronics Dresden''. We will then look at dataflow modeling concepts, with emphasis on the dynamic models that are needed to express today's changing workloads. Finally, the talk will introduce methods and algorithms for mapping sets of applications modeled in this way to heterogeneous systems.},
Day={18},

}

Downloads

151018_castrillon_dataflow_pacttut [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=468

×
Markus Vogt, Gerald Hempel, Jeronimo Castrillon, Christian Hochberger, "GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs", Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP), Sep 2015. ([link]) [Bibtex & Downloads]

GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs

Reference

Markus Vogt, Gerald Hempel, Jeronimo Castrillon, Christian Hochberger, "GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs", Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP), Sep 2015. ([link])

Abstract
In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.

Bibtex

@InProceedings{vogt15,
Title={GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs},
Author={Vogt , Markus and Hempel, Gerald and Castrillon, Jeronimo and Hochberger, Christian},
Booktitle={Proceedings of the Second International Workshop on FPGAs for Software Programmers (FSP)},
Year={2015},
Month=sep,
Series={FSP 2015},
archivePrefix={arXiv},
arxivId={1509.00025},
eprint={1509.00025},

abstract={In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.},
}

Downloads

1509_Vogt_FSP [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=454

×
Jeronimo Castrillon, "Orchestration: Turning material breakthroughs into application performance", In Dresden Microelectronics Academy, (invited talk), Sep 2015. [Bibtex & Downloads]

Orchestration: Turning material breakthroughs into application performance

Reference

Jeronimo Castrillon, "Orchestration: Turning material breakthroughs into application performance", In Dresden Microelectronics Academy, (invited talk), Sep 2015.

Bibtex

@Misc{castrillon2015dma,
Title={Orchestration: Turning material breakthroughs into application performance},
Author={Castrillon, Jeronimo},
HowPublished={Dresden Microelectronics Academy, (invited talk)},
Month=sep,
Year={2015},
Location={Dresden, Germany},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150918_castrillon_dma.pdf}
}

Downloads

150918_castrillon_dma [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=462

×
Norman A. Rink, Dmitrii Kuvaiskii, Jeronimo Castrillon, Christof Fetzer, "Compiling for Resilience: the Performance Gap", Chapter in Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015) (Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer), IOS Press, vol. 27, pp. 721–730, Edinburgh, Scotland, Sep 2015. [doi] [Bibtex & Downloads]

Compiling for Resilience: the Performance Gap

Reference

Norman A. Rink, Dmitrii Kuvaiskii, Jeronimo Castrillon, Christof Fetzer, "Compiling for Resilience: the Performance Gap", Chapter in Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015) (Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer), IOS Press, vol. 27, pp. 721–730, Edinburgh, Scotland, Sep 2015. [doi]

Abstract
In order to perform reliable computations on unreliable hardware, software-based protection mechanisms have been proposed. In this paper we present a compiler infrastructure for software-based code hardening based on encoding. We analyze the trade-off between performance and fault coverage. We look at different code generation strategies that improve the performance of hardened programs by up to 2x while incurring little fault coverage degradation.

Bibtex

@InCollection{rink_erpp2015,
author={Rink, Norman A. and Kuvaiskii, Dmitrii and Castrillon, Jeronimo and Fetzer, Christof},
title={Compiling for Resilience: the Performance Gap},
booktitle={Parallel Computing: On the Road to Exascale (ParCo 2015). Extended from Proceedings of the Mini-Symposium on Energy and Resilience in Parallel Programming (ERPP 2015)},
publisher={IOS Press},
year={2015},
editor={Gerhard R. Joubert and Hugh Leather and Mark Parsons and Frans Peters and Mark Sawyer},
volume={27},
series={ParCo 2015},
pages={721--730},
address={Edinburgh, Scotland},
month=sep,
abstract={In order to perform reliable computations on unreliable hardware, software-based protection mechanisms have been proposed. In this paper we present a compiler infrastructure for software-based code hardening based on encoding. We analyze the trade-off between performance and fault coverage. We look at different code generation strategies that improve the performance of hardened programs by up to 2x while incurring little fault coverage degradation.},
doi={10.3233/978-1-61499-621-7-721},
}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path, Resilience Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=782

×
Gerald Hempel, Markus Vogt, Jeronimo Castrillon, Christian Hochberger, "Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs", Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session (Grosspietsch, Erwin and Klöckner, Konrad), SEA-Publications: SEA-SR-44, Funchal, Madeira (Portugal), August 2015. [Bibtex & Downloads]

Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs

Reference

Gerald Hempel, Markus Vogt, Jeronimo Castrillon, Christian Hochberger, "Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs", Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session (Grosspietsch, Erwin and Klöckner, Konrad), SEA-Publications: SEA-SR-44, Funchal, Madeira (Portugal), August 2015.

Bibtex

@InProceedings{hempeldsd15,
Title={Software-Backed Caching and Virtual Addressing for Generated Accelerators in SoC FPGAs},
Author={Hempel, Gerald and Vogt, Markus and Castrillon, Jeronimo and Hochberger, Christian},
Booktitle={Proceedings of 41st EUROMICRO Conference on Software Engineering and Advanced Applications - Work in Progress Session},
Year={2015},
Address={Funchal, Madeira (Portugal)},
Editor={Grosspietsch, Erwin and Kl{\"o}ckner, Konrad},
Month={August},
Publisher={SEA-Publications: SEA-SR-44},
Series={DSD/SEAA 2015},
ISBN={978-3-902457-44-8}
}

Downloads

1508_Hempel_DSD [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=452

×
Sven Karol, Pietro Incardona, Yaser Afshar, Ivo Sbalzarini, Jeronimo Castrillon, "Towards a Next-Generation Parallel Particle-Mesh Language", Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI), pp. 15–18, Jul 2015. ([link]) [Bibtex & Downloads]

Towards a Next-Generation Parallel Particle-Mesh Language

Reference

Sven Karol, Pietro Incardona, Yaser Afshar, Ivo Sbalzarini, Jeronimo Castrillon, "Towards a Next-Generation Parallel Particle-Mesh Language", Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI), pp. 15–18, Jul 2015. ([link])

Bibtex

@InProceedings{karol15,
Title={Towards a Next-Generation Parallel Particle-Mesh Language},
Author={Karol, Sven and Incardona, Pietro and Afshar, Yaser and Sbalzarini, Ivo and Castrillon, Jeronimo},
Booktitle={Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI)},
series={DSLDI'15},
Year={2015},
Month=jul,
pages={15--18},

}

Downloads

1507_Karol_PPML [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=453

×
Diana Göhringer, Michael Hübner, Jeronimo Castrillon, Cristina Silvano, "ViPES 2015-Preface", Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 347–347, Jul 2015. [doi] [Bibtex & Downloads]

ViPES 2015-Preface

Reference

Diana Göhringer, Michael Hübner, Jeronimo Castrillon, Cristina Silvano, "ViPES 2015-Preface", Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 347–347, Jul 2015. [doi]

Bibtex

@InProceedings{gohringer2015vipes,
author={G{\"o}hringer, Diana and H{\"u}bner, Michael and Castrillon, Jeronimo and Silvano, Cristina},
title={ViPES 2015-Preface},
booktitle={Proceedings of the 15th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)},
year={2015},
pages={347--347},
organization={IEEE},
month=jul,
doi={10.1109/SAMOS.2015.7363696},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=783

×
Jeronimo Castrillon, "Portable Libraries and Programming Environments", In HiPEAC Computing Systems Week, (invited talk), May 2015. [Bibtex & Downloads]

Portable Libraries and Programming Environments

Reference

Jeronimo Castrillon, "Portable Libraries and Programming Environments", In HiPEAC Computing Systems Week, (invited talk), May 2015.

Bibtex

@Misc{castrillon2015csw,
Title={Portable Libraries and Programming Environments},
Author={Castrillon, Jeronimo},
HowPublished={HiPEAC Computing Systems Week, (invited talk)},
Year={2015},
Month=may,
Location={Oslo, Noway},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150505_castrillon_csw.pdf}
}

Downloads

150505_castrillon_csw [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=449

×
Jeronimo Castrillon, Lothar Thiele, Lars Schorr, Weihua Sheng, Ben Juurlink, Mauricio Alvarez-Mesa, Angela Pohl, Ralph Jessenberger, Victor Reyes, Rainer Leupers, "Multi/Many-core Programming: Where Are We Standing?", Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), EDA Consortium, pp. 1708–1717, San Jose, CA, USA, Mar 2015. ([link]) [Bibtex & Downloads]

Multi/Many-core Programming: Where Are We Standing?

Reference

Jeronimo Castrillon, Lothar Thiele, Lars Schorr, Weihua Sheng, Ben Juurlink, Mauricio Alvarez-Mesa, Angela Pohl, Ralph Jessenberger, Victor Reyes, Rainer Leupers, "Multi/Many-core Programming: Where Are We Standing?", Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), EDA Consortium, pp. 1708–1717, San Jose, CA, USA, Mar 2015. ([link])

Bibtex

@inproceedings{Castrillon:2015,
author={Castrillon, Jeronimo and Thiele, Lothar and Schorr, Lars and Sheng, Weihua and Juurlink, Ben and Alvarez-Mesa, Mauricio and Pohl, Angela and Jessenberger, Ralph and Reyes, Victor and Leupers, Rainer},
title={Multi/Many-core Programming: Where Are We Standing?},
booktitle={Proceedings of the 2015 Design, Automation \& Test in Europe Conference \& Exhibition (DATE)},
series={DATE '15},
year={2015},
Month=mar,
location={Grenoble, France},
pages={1708--1717},
numpages={10},
acmid={2757208},
publisher={EDA Consortium},
address={San Jose, CA, USA},
Bdsk-url-1={http://dl.acm.org/citation.cfm?id=2757012.2757208},

}

Downloads

No Downloads available for this publication

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=178

×
Jeronimo Castrillon, "Tools and dataflow-based programming models for heterogeneous MPSoCs", In Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk), Jan 2015. [Bibtex & Downloads]

Tools and dataflow-based programming models for heterogeneous MPSoCs

Reference

Jeronimo Castrillon, "Tools and dataflow-based programming models for heterogeneous MPSoCs", In Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk), Jan 2015.

Bibtex

@Misc{castrillon2015pegpum,
Title={Tools and dataflow-based programming models for heterogeneous MPSoCs},
Author={Castrillon, Jeronimo},
HowPublished={Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM'15) in conjunction with the HiPEAC Conference (invited talk)},
Year={2015},
Month=jan,
Location={Amsterdam, The Netherlands},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150121_castrillon_pegpum.pdf}
}

Downloads

150121_castrillon_pegpum [PDF]

Related Paths
HAEC, Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=450

×
Jeronimo Castrillon, "Simulation and Estimation for MPSoC Programming Tools", In Proceeding: Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote), Jan 2015. [Bibtex & Downloads]

Simulation and Estimation for MPSoC Programming Tools

Reference

Jeronimo Castrillon, "Simulation and Estimation for MPSoC Programming Tools", In Proceeding: Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote), Jan 2015.

Bibtex

@InProceedings{castrillon2015rapido,
Title={Simulation and Estimation for MPSoC Programming Tools},
Author={Castrillon, Jeronimo},
Booktitle={Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO'15), in conjunction with the HiPEAC Conference (keynote)},
Year={2015},
Month=jan,
Location={Amsterdam, The Netherlands},
url = {https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/150121_castrillon_rapido.pdf}
}

Downloads

150121_castrillon_rapido [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=451

×

2014
Jeronimo Castrillon, "Compiler Flow for Processors and Systems", In Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk), Nov 2014. [Bibtex & Downloads]

Compiler Flow for Processors and Systems

Reference

Jeronimo Castrillon, "Compiler Flow for Processors and Systems", In Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk), Nov 2014.

Bibtex

@Misc{castrillon2015tunis,
Title={Compiler Flow for Processors and Systems},
Author={Castrillon, Jeronimo},
HowPublished={Winter School on Design, Programming and Applications of Multi Processor System on Chip (invited talk)},
Year={2014},
Month=nov,
Location={Tunis, Tunisia},
url={https://cfaed.tu-dresden.de/files/user/jcastrillon/publications/141127_castrillon_compilers.pdf}
}

Downloads

141127_castrillon_compilers [PDF]

Related Paths
Orchestration Path

Permalink

https://cfaed.tu-dresden.de/publications?pubId=447

×
Jeronimo Castrillon, Rainer Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap", Springer, pp. 258, 2014. ([link]) [Bibtex & Downloads]

Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap

Reference

Jeronimo Castrillon, Rainer Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap", Springer, pp. 258, 2014. ([link])

Bibtex

@Book{castrillon14_springer,
Title={Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap},
Author={Castrillon, Jeronimo and Leupers, Rainer},
Publisher={Springer},
Year={2014},
ISBN={978-3-319-00675-8},

Pages={258}
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=446

×
Diandian Zhang, Jeronimo Castrillon, Stefan Schürmans, Gerd Ascheid, Rainer Leupers, and Bart Vanthournout, "System-Level Analysis of MPSoCs with a Hardware Scheduler", Hershey: IGI Global, pp. 335–367, 2014. [doi] [Bibtex & Downloads]

System-Level Analysis of MPSoCs with a Hardware Scheduler

Reference

Diandian Zhang, Jeronimo Castrillon, Stefan Schürmans, Gerd Ascheid, Rainer Leupers, and Bart Vanthournout, "System-Level Analysis of MPSoCs with a Hardware Scheduler", Hershey: IGI Global, pp. 335–367, 2014. [doi]

Bibtex

@InBook{zhang2014_inbook,
Title={System-Level Analysis of MPSoCs with a Hardware Scheduler},
Author={Diandian Zhang and Jeronimo Castrillon and Stefan Schürmans and Gerd Ascheid and Rainer Leupers and and Bart Vanthournout},
Chapter={Advancing Embedded Systems and Real-Time Communications with Emerging Technologies},
Editor={Seppo Virtanen},
Pages={335--367},
Publisher={Hershey: IGI Global},
doi={10.4018/978-1-4666-6034-2.ch014},
Year={2014},
}

Downloads

No Downloads available for this publication

Permalink

https://cfaed.tu-dresden.de/publications?pubId=448

×

Prof. Dr.-Ing. Jeronimo Castrillon

2026

2025