Publications – Dr. Kostas Siozios

Selective Publications

— Approximate Computing
— Accelerate Computing (FPGA and GPU)
— Reconfigurable Architectures, Platforms and Tools
— 3D Integration
— Fault-Tolerance and Reliability Improvement
— CAD Algorithms
— Platform Virtualization
— Network-on-Chip (NoC)
— Energy Harvesting and Power Optimization
— Internet-of-Things (IoT) & CyberPhysical Systems (CPS)

“IoT for Smart Grids: Design Challenges and Paradigms“, Editors: K. Siozios, D. Anagnostos, D. Soudris and E. Kosmatopoulos, Springer International Publishing, 2019

This book explains the fundamentals of control theory for Internet of Things (IoT) systems and smart grids and its applications. It discusses the challenges imposed by large-scale systems, and describes the current and future trends and challenges in decision-making for IoT in detail, showing the ongoing industrial and academic research in the field of smart grid domain applications. It presents step-by-step design guidelines for the modeling, design, customisation and calibration of IoT systems applied to smart grids, in which the challenges increase with each system’s increasing complexity. It also provides solutions and detailed examples to demonstrate how to use the techniques to overcome these challenges, as well as other problems related to decision-making for successful implementation. Further, it anaylses the features of decision-making, such as low-complexity and fault-tolerance, and uses open-source and publicly available software tools to show readers how they can design, implement and customise their own system control instantiations. This book is a valuable resource for power engineers and researchers, as it addresses the analysis and design of flexible decision-making mechanisms for smart grids. It is also of interest to students on courses related to control of large-scale systems, since it covers the use of state-of-the-art technology with examples and solutions in every chapter. And last but not least, it offers practical advice for professionals working with smart-grids.

Link for the Book

“Designing 2D and 3D Network-on-Chip Architectures“, Editors: K. Tatas, K. Siozios, A. Jantsch and D. Soudris, Springer Publishing, 2014

This book covers key concepts in the design of 2D and 3D Network-on-Chip interconnect. It highlights design challenges and discusses fundamentals of NoC technology, including architectures, algorithms and tools. Coverage focuses on topology exploration for both 2D and 3D NoCs, routing algorithms, NoC router design, NoC-based system integration, verification and testing, and NoC reliability. Case studies are used to illuminate new design methodologies.

Link for the Book

“Cyber-Physical Systems: Decision Making Mechanisms and Applications“, Editors: K. Siozios, D. Soudris and E. Kosmatopoulos, River Publishers Series in Circuits and Systems, ISBN: 9788793609099, Nov. 2017

As systems continue to evolve they rely less on human decision-making and more on computational intelligence. This trend in conjunction with the available technologies for providing advanced sensing, measurement, process control, and communication lead towards the new field of the CyberPhysical System (CPS). CyberPhysical systems are expected to play a major role in the design and development of future engineering platforms with new capabilities that far exceed today’s levels of autonomy, functionality and usability. Although these systems exhibit remarkable characteristics, their design and implementation is a challenging issue, as numerous (heterogeneous) components and services have to be appropriately modeled and simulated together. The problem of designing efficient CPS becomes far more challenging in case the target system has to meet also real-time constraints.

CyberPhysical Systems: Decision Making Mechanisms and Applications describes essential theory, recent research and large-scale user cases that addresses urgent challenges in CPS architectures. In particular, it includes chapters on:

Decision making for large scale CPS
Modeling of CPS with emphasis at the control mechanisms
Hardware/software implementation of the control mechanisms
Fault-tolerant and reliability issues for the control mechanisms
CyberPhysical user-cases that incorporate challenging decision making

Link for the Book

“Προγραμματίζοντας στη Γλώσσα C με τη βοήθεια λυμένων παραδειγμάτων”, Συγγραφέας: Κώστας Σιώζιος

Το βιβλίο αυτό απευθύνεται τόσο σε εκείνους που ξεκινούν τώρα το ταξίδι τους στον προγραμματισμό, αλλά και σε εκείνους που έχουν ήδη κάποια εμπειρία με άλλες γλώσσες προγραμματισμού και θέλουν να εμβαθύνουν στη γλώσσα ANSI C. Ακολουθώντας έναν συνδυασμό που περιλαμβάνει τη θεωρητική ανάλυση των εντολών της κάθε ενότητας, κατανοητά παραδείγματα εφαρμογής αυτών, και πρακτικές ασκήσεις, ο αναγνώστης έχει την ευκαιρία να εξοικειωθεί με τις βασικές αρχές της γλώσσας C και να αναπτύξει σταδιακά τις δεξιότητές του. Κάθε κεφάλαιο έχει σχεδιαστεί με προσοχή προκειμένου να προάγει τις γνώσεις σας βήμα-βήμα, εξηγώντας σύνθετες έννοιες με έναν τρόπο που είναι προσιτός και εύκολος να ακολουθηθεί. Η προσέγγιση που ακολουθείται στα πλαίσια του βιβλίου είναι πρακτική και αναλυτική. Κάθε κεφάλαιο σχεδιάστηκε για να οδηγεί τον αναγνώστη βήμα προς βήμα στην κατανόηση των θεμελιωδών αρχών της C, ενθαρρύνοντας την ενεργητική μάθηση μέσω ασκήσεων και προκλήσεων. Στόχος είναι να γίνει ο αναγνώστης ικανός να αναλύει προβλήματα, να σχεδιάζει λύσεις και να υλοποιεί αποτελεσματικά τις ιδέες του μέσω του προγραμματισμού.

Link for the Book

Approximate Computing

V. Mrazek, A. Kokkinis, P. Papanikolaou, Z. Vasicek, K. Siozios, G. Tzimpragos, M. Tahoori and G. Zervakis, “Evolutionary Approximation of Ternary Neurons for On-sensor Printed Neural Networks”, ACM/IEEE International Conference on Computer-Aided Design (ICCAD), Oct. 2024, New Jersey, USA (download local copy from here).

Printed electronics offer ultra-low manufacturing costs and the potential for on-demand fabrication of flexible hardware. However, significant intrinsic constraints stemming from their large feature sizes and low integration density pose design challenges that hinder their practicality. In this work, we conduct a holistic exploration of printed neural network accelerators, starting from the analog-to-digital interface – a major area and power sink for sensor processing applications – and extending to networks of ternary neurons and their implementation. We propose bespoke ternary neural networks using approximate popcount and popcount-compare units, developed through a multi-phase evolutionary optimization approach and interfaced with sensors via customizable analog-to-binary converters. Our evaluation results show that the presented designs outperform the state of the art, achieving at least 6x improvement in area and 19x in power. To our knowledge, they represent the first open-source digital printed neural network classifiers capable of operating with existing printed energy harvesters.

K. Balaskas, H. Khdr, M. Sikal, F. Kreß, K. Siozios, J. Becker, J. Henkel, “Heterogeneous Accelerator Design for Multi-DNN Workloads via Heuristic Optimization” in IEEE Embedded Systems Letters, 2024 (accepted for publication).

Emerging AI-enabled applications such as augmented and virtual reality (AR/VR) leverage multiple deep neural network (DNN) models for various sub-tasks such as object detection, image segmentation, eye-tracking, speech recognition, and so on. Because of the diversity of the sub-tasks, the layers within and across the DNN models are highly heterogeneous in operation and shape. Diverse layer operations and shapes are major challenges for a fixed dataflow accelerator (FDA) that employs a fixed dataflow strategy on a single DNN accelerator substrate since each layer prefers different dataflows (computation order and parallelization) and tile sizes. Reconfigurable DNN accelerators (RDAs) have been proposed to adapt their dataflows to diverse layers to address the challenge. However, the dataflow flexibility in RDAs is enabled at the cost of expensive hardware structures (switches, interconnects, controller, etc.) and requires per-layer reconfiguration, which introduces considerable energy costs. Alternatively, this work proposes a new class of accelerators, heterogeneous dataflow accelerators (HDAs), which deploy multiple accelerator substrates (i.e., sub-accelerators), each supporting a different dataflow. HDAs enable coarser-grained dataflow flexibility than RDAs with higher energy efficiency and lower area cost comparable to FDAs. To exploit such benefits, hardware resource partitioning across sub-accelerators and layer execution schedule need to be carefully optimized. Therefore, we also present Herald, a framework for co-optimizing hardware partitioning and layer scheduling. Using Herald on a suite of AR/VR and MLPerf workloads, we identify a promising HDA architecture, Maelstrom, which demonstrates 65.3% lower latency and 5.0% lower energy compared to the best fixed dataflow accelerators and 22.0% lower energy at the cost of 20.7% higher latency compared to a state-of-the-art reconfigurable DNN accelerator (RDA). The results suggest that HDA is an alternative class of Pareto-optimal accelerators to RDA with strength in energy, which can be a better choice than RDAs depending on the use cases.

F. Afentaki, G. Saglam, A. Kokkinis, K. Siozios, G. Zervakis and M.B. Tahoori,”Bespoke Approximation of Multiplication-Accumulation and Activation Targeting Printed Multilayer Perceptrons” in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1-9, 2023, San Francisco, USA (download local copy from here).

Printed Electronics (PE) feature distinct and remarkable characteristics that make them a prominent technology for achieving true ubiquitous computing. This is particularly relevant in application domains that require conformal and ultra-low cost solutions, which have experienced limited penetration of computing until now. Unlike silicon-based technologies, PE offer unparalleled features such as non-recurring engineering costs, ultra-low manufacturing cost, and on-demand fabrication of conformal, flexible, non-toxic, and stretchable hardware. However, PE face certain limitations due to their large feature sizes, that impede the realization of complex circuits, such as machine learning classifiers. In this work, we address these limitations by leveraging the principles of Approximate Computing and Bespoke (fully-customized) design. We propose an automated framework for designing ultra-low power Multilayer Perceptron (MLP) classifiers which employs, for the first time, a holistic approach to approximate all functions of the MLP’s neurons: multiplication, accumulation, and activation. Through comprehensive evaluation across various MLPs of varying size, our framework demonstrates the ability to enable battery-powered operation of even the most intricate MLP architecture examined, significantly surpassing the current state of the art.

K. Balaskas, A. Karatzas, C. Sad, K. Siozios, I. Anagnostopoulos, G. Zervakis, J. Henkel,”Hardware-Aware DNN Compression via Diverse Pruning and Mixed-Precision Quantization” in IEEE Transactions on Emerging Topics in Computing, vol. , no. 01, pp. 1-14, 5555. doi: 10.1109/TETC.2023.3346944 (download local copy from here).

Deep Neural Networks (DNNs) have shown significant advantages in a wide variety of domains. However, DNNs are becoming computationally intensive and energy hungry at an exponential pace, while at the same time, there is a vast demand for running sophisticated DNN-based services on resource constrained embedded devices. In this paper, we target energy-efficient inference on embedded DNN accelerators. To that end, we propose an automated framework to compress DNNs in a hardware-aware manner by jointly employing pruning and quantization. We explore, for the first time, per-layer fine- and coarse-grained pruning, in the same DNN architecture, in addition to low bit-width mixed-precision quantization for weights and activations. Reinforcement Learning (RL) is used to explore the associated design space and identify the pruning-quantization configuration so that the energy consumption is minimized whilst the prediction accuracy loss is retained at acceptable levels. Using our novel composite RL agent we are able to extract energy-efficient solutions without requiring retraining and/or fine tuning. Our extensive experimental evaluation over widely used DNNs and the CIFAR-10/100 and ImageNet datasets demonstrates that our framework achieves 39% average energy reduction for 1.7% average accuracy loss and outperforms significantly the state-of-the-art approaches.

D. Danopoulos, G. Zervakis, K. Siozios, D. Soudris and J. Henkel, “AdaPT: Fast Emulation of Approximate DNN Accelerators in PyTorch“, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 42, No. 6, pp. 2074-2078, June 2023 (download local copy from here).

Current state-of-the-art employs approximate multipliers to address the highly increased power demands of deep neural network (DNN) accelerators. However, evaluating the accuracy of approximate DNNs is cumbersome due to the lack of adequate support for approximate arithmetic in DNN frameworks. We address this inefficiency by presenting AdaPT, a fast emulation framework that extends PyTorch to support approximate inference as well as approximation-aware retraining. AdaPT can be seamlessly deployed and is compatible with the most DNNs. We evaluate the framework on several DNN models and application fields, including CNNs, LSTMs, and GANs for a number of approximate multipliers with distinct bitwidth values. The results show substantial error recovery from approximate retraining and reduced inference time up to 53.9x with respect to the baseline approximate implementation.

K. Balaskas, F. Klemme, G. Zervakis, K. Siozios, H. Amrouch, J. Henkel, “Variability-Aware Approximate Circuit Synthesis via Genetic Optimization“, in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 10, pp. 4141-4153, Oct. 2022, doi: 10.1109/TCSI.2022.3183858 (download local copy from here).

One of the major barriers that CMOS devices face at nanometer scale is increasing parameter variation due to manufacturing imperfections. Process variations severely inhibit the reliable operation of circuits, as the operational frequency at the nominal process corner is insufficient to suppress timing violations across the entire variability spectrum. To avoid variability-induced timing errors, previous efforts impose pessimistic and performance-degrading timing guardbands atop the operating frequency. In this work, we employ approximate computing principles and propose a circuit-agnostic automated framework for generating variability-aware approximate circuits that eliminate process-induced timing guardbands. Variability effects are accurately portrayed with the creation of variation-aware standard cell libraries, fully compatible with standard EDA tools. The underlying transistors are fully calibrated against industrial measurements from Intel 14nm FinFET in which both electrical characteristics of transistors and variability effects are accurately captured. In this work, we explore the design space of approximate variability-aware designs to automatically generate circuits of reduced variability and increased performance without the need for timing guardbands. Experimental results show that by introducing negligible functional error of merely 5.3*10^-3, our variability-aware approximate circuits can be reliably operated under process variations without sacrificing the application performance.

K. Balaskas, G. Zervakis, K. Siozios, M. B. Tahoori and J. Henkel, “Approximate Decision Trees For Machine Learning Classification on Tiny Printed Circuits,” 23rd International Symposium on Quality Electronic Design (ISQED), 2022, pp. 1-6, DOI: 10.1109/ISQED54688.2022.9806213 (download local copy from here).

Although Printed Electronics (PE) cannot compete with silicon-based systems in conventional evaluation metrics, e.g., integration density, area and performance, PE offers attractive properties such as on-demand ultra-low-cost fabrication, flexibility and non-toxicity. As a result, it targets application domains that are untouchable by lithography-based silicon electronics and thus have not yet seen much proliferation of computing. However, despite the attractive characteristics of PE, the large feature sizes in PE prohibit the realization of complex printed circuits, such as Machine Learning (ML) classifiers. In this work, we exploit the hardware-friendly nature of Decision Trees for machine learning classification and leverage the hardware-efficiency of the approximate design in order to generate approximate ML classifiers that are suitable for tiny, ultra-resource constrained, and battery-powered printed applications.

K. Balaskas, G. Zervakis, H. Amrouch, J. Henkel and K. Siozios, “Automated Design Approximation to Overcome Circuit Aging“, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 68, No. 11, pp. 4710-4721, Nov. 2021 (download local copy from here).

Transistor aging phenomena manifest themselves as degradations in the main electrical characteristics of transistors. Over time, they result in a significant increase of cell propagation delay, leading to errors due to timing violations, since the operating frequency becomes unsustainable as the circuit ages. Conventional techniques employ timing guardbands to mitigate aging-induced delay increase, which leads to considerable performance losses from the beginning of the circuit’s lifetime. Leveraging the inherent error resilience of a vast number of application domains, approximate computing was recently introduced as an aging mitigation mechanism. In this work, we present the first automated framework for generating aging-aware approximate circuits . Our framework, by applying directed gate-level netlist approximation, induces a small functional error and recovers the delay degradation due to aging. As a result, our optimized circuits eliminate aging-induced timing errors. Experimental evaluation over a variety of arithmetic circuits and image processing benchmarks demonstrates that for an average error of merely 5×10-3, our framework completely eliminates aging-induced timing guardbands. Compared to the respective baseline circuits without timing guardbands (i.e., iso-performance evaluation), the error of the circuits generated by our framework is 1208x smaller.

G. Zervakis, K. Koliogeorgi, D. Anagnostos, N. Zompakis and K. Siozios, “VADER: Voltage-Driven Netlist Pruning for Cross-Layer Approximate Arithmetic Circuits“, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 27, no. 6, pp. 1460-1464, June 2019 (download local copy from here).

Leveraging the inherent error resilience of a large number of application domains, approximate computing is established as an efficient design alternative to improve their energy profile. In this brief, we design energy optimal cross-layer approximate arithmetic circuits by enabling the efficient application of voltage overscaling (VOS). Departing from the conventional approaches followed today, we introduce the voltage-driven functional approximation and present the VoltAge-Driven nEtlist pRuning (VADER) framework. VADER is an automated synthesis framework that can be seamlessly integrated in any hardware design flow and implements a voltage-driven gate-level netlist pruning. Experimental evaluation shows that VADER reduces the error of the VOS application by 52% on average and delivers on average designs with 34% higher energy savings compared to state-of-the-art approximate adders and multipliers.

K. Koliogeorgi, G. Zervakis, D. Anagnostos, N. Zompakis and K. Siozios, “Optimizing SVM Classifier Through Approximate and High Level Synthesis Techniques“, International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4, Thessaloniki, Greece, 2019 (download local copy from here).

Leveraging the inherent error resilience of a large number of application domains, approximate computing is established as an efficient design alternative to improve their performance. Support Vector Machine (SVM) classifier is a widely adopted machine learning algorithm, that exhibits high error resilience and requires real-time execution. In this paper, we propose a highly optimized approximate SVM FPGA accelerator, utilizing arrhythmia detection in ECG signals as a case study. The proposed methodology applies two algorithmic approximation techniques, i.e., precision scaling and loop perforation, implemented in a coordinated manner in High-Level Synthesis (HLS). As a second level of performance enhancement, an exploration of the in-build optimization techniques of the HLS tool, with respect to the applied approximation, is also performed. Experimental evaluation shows that the proposed approximate SVM classifier attains a 15× speedup, while maintaining an accuracy of 96.7%.

N. Zompakis, D. Anagnostos, K. Koliogeorgi, G. Zervakis, and K. Siozios, “A Design Flow Framework for Fully-Connected Neural Networks Rapid Prototyping“, International Conference on Omni-Layer Intelligent Systems (COINS), pp. 44-49, Crete, Greece, 2019 (download local copy from here).

The current work deploys a framework for rapid prototyping of Fully-Connected Neural Networks (FCNs). The scope is to provide an automatic design flow that generates a template-based VHDL code considering the accuracy, the resource utilization and the design complexity. More precisely, the deployed tool incorporates hardware optimizations in the implementation of the multiplications, the activation function and the definition of the fixed-point types providing user-defined configurations thought a GUI. The FCNs of two applications (Alexnet and Lenet) were implemented to evaluate our approach. The results seem promising and prove the design flexibility of our framework generating optimized code that exceeds the 10K lines for each hardware instance within a few hours, while preserving low levels of latency that does not exceed 400 cycles for our applications.

Accelerate Computing (FPGA and GPU)

A. Ferikoglou, A. Kokkinis, D. Danopoulos, I. Oroutzoglou, A. Nanos, S. Karanastasis, M. Sipos, J. Ghotbi, J. J. Vegas Olmos, D. Masouros and K. Siozios, “The SERRANO platform: Stepping towards seamless application development & deployment in the heterogeneous edge-cloud continuum”, Design, Automation and Test in Europe (DATE), April 17-21, 2023 (download local copy from here).

The need for real-time analytics and faster decision-making mechanisms has led to the adoption of hardware accelerators such as GPUs and FPGAs within the edge cloud computing continuum. However, their programmability and lack of orchestration mechanisms for seamless deployment make them difficult to use efficiently. We address these challenges by presenting SERRANO, a project for transparent application deployment in a secure, accelerated, and cognitive cloud continuum. In this work, we introduce the SERRANO platform and its software, orchestration, and deployment services, focusing on its methods for automated GPU/FPGA acceleration and efficient, isolated, and secure deployments. By evaluating these services against representative use cases, we highlight SERRANO ‘s ability to simplify the development and deployment process without sacrificing performance.

A. Kokkinis, A. Ferikoglou, I. Oroutzoglou, D. Danopoulos, D. Masouros, and K. Siozios, “HW/SW Acceleration of Multiple Workloads Within the SERRANO’s Computing Continuum“, Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Vol. 13511, pp. 394-405, 2022 (download local copy from here).

Nowadays, we witness emerging cloud technologies and a growth of cloud computing services that are used for numerous applications with diverse requirements. Although the technological innovations in the field of cloud computing; a power-efficient and automatic deployment of different applications in a multi-cloud environment is still a major challenge. SERRANO aims to take important steps in providing a transparent way of deploying applications in the Edge-Cloud-HPC computing continuum, by providing an abstraction layer that automates the process of application deployment across the various computing platforms and realizing an intent-based paradigm of operating federated infrastructures. In this paper, the acceleration process of different algorithms in the edge and cloud infrastructure of the SERRANO’s platform is described. Specifically, we showcase the benefits of HW and SW acceleration in four different algorithms from three use-case scenarios. The achieved results show that an increase at the application’s performance ranging from 7x and 6.6x up to 229x and 113.14x for the cloud and edge devices respectively, can be achieved when the evaluated workloads are executed in the SERRANO’s infrastructure.

I. Oroutzoglou, A. Kokkinis, A. Ferikoglou, D. Danopoulos, D. Masouros, K. Siozios, “Optimizing Savitzky-Golay Filter on GPU and FPGA Accelerators for Financial Applications“, International Conference on Modern Circuits and Systems Technologies (MOCAST), 2022, pp. 1-4 (download local copy from here).

Over the last few years, computational power and intelligence are becoming more and more necessary in the sector of finance. More specifically, computational finance turns into a very popular topic for both academia and industry, where numerous published works from this field and especially investment and risk management, showcase the effects of these technological advancements. At the same time, the ever-increased computational demands have led to the deployment of various accelerators in order to meet both latency and power constraints for financial applications that vary from special purpose, made by economists, to general purpose Digital Signal Processing (DSP) applied in financial time-series. One of the most widely used applications, belonging to the 2nd category, is the Savitzky-Golay algorithm, a filter used for smoothing time-series data. In this work, we propose a mechanism that automatically creates different accelerated Savitzky-Golay filters for GPUs and FPGAs, based on a set of pre-accelerated templates. By evaluating the provided templates with a set of real use-case parameters, a speedup of x33.5 on the NVIDIA T4 GPU and x21.9 on the Alveo U50 FPGA is achieved compared with an Intel Xeon Gold 5218R CPU as a baseline, while achieving a decrease in power consumption of 89% and 70% respectively, disclosing a real latency-power trade-of between both accelerators.

A. Kretsis, P. Kokkinos, P. Soumplis, J.J. Vegas-Olmos, M. Feher, M. Sipos, D. Lucani, D. Khabi, D. Masouros, K. Siozios, P. Bourgos, S. Tsekeridou, F. Zyulkyarov, E. Karanastasis, E. Chondrogiannis, V. Andronikou, A. Fernandez-Gomez, S. Panica, G. Iuhasz, A. Nanos, C. Chalios, M. Varvarigos, “SERRANO: Transparent Application Deployment in a Secure, Accelerated and Cognitive Cloud Continuum,” 2021 IEEE International Mediterranean Conference on Communications and Networking (MeditCom), 2021, pp. 55-60 (download local copy from here).

We are witnessing a wave of emerging cloud computing technologies and services that empower advanced applications from different vertical sectors, with diverse requirements. These trends give rise to a number of fundamental challenges that relate to the application deployment, the support of heterogeneous infrastructures and the provided security. In this setting, the SERRANO project steps in to define an intent-based paradigm of operating federated infrastructures consisting of edge, cloud and HPC resources, which will be realized through the SERRANO platform. Applications’ high-level requirements will be translated to infrastructure-aware configuration parameters. SERRANO orchestration will then provide adaptive and efficient access to secure by design and accelerated resources. In this way, SERRANO will support cloud-native applications and services towards the cloud continuum.

A. Kokkinis, A. Ferikoglou, D. Danopoulos, D. Masouros and K. Siozios, “Leveraging HW approximation for exploiting performance-energy trade-offs within the edge-cloud computing continuum“, 16th Workshop on Virtualization in High-Performance Cloud Computing, July, 2021 (download local copy from here).

Today, the need for real-time analytics and faster decision making mechanisms has led to the adoption of hardware accelerators, such as GPUs and FPGAs, within the edge-cloud computing continuum. Moreover, the need for energy-, yet performance-efficient solutions both in the edge and cloud has led to the rise of approximate computing as a promising paradigm, where “acceptable errors” are introduced to error-tolerant applications, thus, providing significant power-saving gains. In this work, we leverage approximate computing for exploiting performance-energy trade-offs of FPGA accelerated kernels with faster design time though an extended source-to-source HLS compiler based on Xilinx Vitis framework. We introduce a novel programming interface that operates at a high level of abstraction, thus, enabling automatic optimizations to the existing HLS design flow supporting both embedded and cloud devices through a common API. We evaluate our approach over three different application from DSP and machine learning domains and show that a decrease of 27% and 28% in power consumption, 61% and 69% in DSP utilization and 7% in clock period is achieved for Alveo U200 and ZCU104 FPGA platforms, on average.

A. Ferikoglou, I. Oroutzoglou, A. Kokkinis, D. Danopoulos, D. Masouros, E. Chondrogiannis, A. Gomez, A. Kretsis, P. Kokkinos, E. Varvarigos, K. Siozios, “Towards efficient HW acceleration in edge-cloud infrastructures: The SERRANO approach“, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 5-7 July 2021 (download local copy from here).

Nowadays, we witness an ever-increased number of applications deployed over Edge, Cloud and HPC infrastructures. This rapid explosion of computing devices across the computing continuum poses new challenges in terms of providing a power-efficient, secure and automatic way for deployment of different applications in such heterogeneous environments. Moreover, the need for performance efficient deployments within such environments, has introduced the presence of hardware accelerators over the entire computing stack. In this paper, we present SERRANO’s approach for providing efficient HW accelerated deployments over edge-cloud infrastructures. First, we give a brief overview of the SERRANO project, describing its goals and objectives, providing a high-level overview of SERRANO’s platform architecture and presenting the use-cases involved. Then, we describe SERRANO’s approach for providing efficient HW accelerators by identifying trade-offs between performance, accuracy and power consumption and also demonstrate how SERRANO aims to automate the optimization process through machine learning models in order to construct a generic optimization heuristic to fine-tune programs for both GPU and FPGA accelerators. Through some illustrative examples, we showcase that by applying approximation and optimization techniques, we are able to achieve an average decrease of 28% in power consumption for FPGA devices and trade-off between performance and power usage for GPUs, achieving up to 1.21x speedups and 8% power improvement.

E. Ruggeri, C. Vagionas, N. Karagiorgos, A. Tsakiridis, Y. Leiba, G. Kalfas, A. Mesodiakaki, K. Siozios, A. Miliou, N. Pleros, “Multi-user IFoF uplink transmission over a 32-element 60GHz phased array antenna enabling both Frequency and Spatial Division Multiplexing”, 45th European Conference on Optical Communication (ECOC 2019), Dublin, Ireland, 2019, pp. 1-4, doi: 10.1049/cp.2019.0780 (download local copy from here).

We experimentally present the A-RoF/IFoF uplink transmission of three 60GHz terminals with 200Mb/s QPSK through a 32-element antenna with beamsteering up to a 60 ^o -angle and 10km fiber, comparing for the first time the frequency and spatial division multiplexing for multi-user uplink for 5G mmWave networks.

G. Lentaris, I. Stamoulias, D. Diamantopoulos, K. Maragos, K. Siozios, D. Soudris, M. Aviles Rodrigalvarez, M. Lourakis, X. Zabulis, I. Kostavelis, L. Nalpantidis, E. Boukas and A. Gasteratos, “SPARTAN/SEXTANT/COMPASS: Advancing Space Rover Vision via Reconfigurable Platforms“, International Workshop on Applied Reconfigurable Computing (ARC), pp 475-486, Apr. 2015, Bochum, Germany (download local copy from here).

Targeting enhanced navigational speed and autonomy for the space exploration rovers, researchers are gradually turning to reconfigurable computing and FPGAs. High-density space-grade FPGAs will enable the acceleration of high-complexity computer vision algorithms for improving the localization and mapping functions of the future Mars rovers. In the projects SPARTAN/SEXTANT/COMPASS of the European Space Agency, we study the potential use of FPGAs for implementing a variety of stereo correspondence, feature extraction, and visual odometry algorithms, all with distinct cost-performance tradeoffs. The most efficient of the developed accelerators will assist the slow space-grade CPU in completing the visual tasks of the rover faster, by one order of magnitude, and thus, will allow the future missions to visit larger areas on Mars. Our work bases on a custom HW/SW co-design methodology, parallel architecture design, optimization techniques, tradeoff analysis, and system tuning with Martian-like scenarios.

P. Figuli, C. Tradowsky, J. A. Lucio Martinez, H. Sidiropoulos, K. Siozios, H. Stenschke, D. Soudris, and J. Becker, “A Novel Concept for Adaptive Signal Processing on Reconfigurable Hardware“, International Workshop on Applied Reconfigurable Computing (ARC), pp. 311-320, April 2015, Bochum, Germany (download local copy from here).

Today, digital signal processing systems for applications like audio or video production are restricted as they do not exhaust the possibilities given by modern hardware. Reconfigurable hardware exploits a huge degree of parallelism and provides flexibility at an affordable energy budget, thus becoming a competitive alternative for high performance Digital Signal Processing (DSP) applications, previously dominated by general purpose processing cores and Application-Specific Integrated Circuits (ASICs). This paper describes the design and evaluation of a novel concept for adaptive signal processing on reconfigurable hardware by using an adaptive reverberation algorithm targeting real time streams. Novel solutions were adopted in several critical parts of the signal processing chain in order to achieve a high level of accuracy under real time constraints. Experimental results show the efficiency of the introduced implementation on a Virtex-7 FPGA, as we can provide reality accurate reverberation with ultra low latency of about 20.8μsec.

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos and D. Soudris, “Hardware Accelerated Rician Denoise Algorithm for High Performance Magnetic Resonance Imaging“, International Conference on Wireless Mobile Communication and Healthcare (MOBIHEALTH), pp. 222-225, Nov. 2014, Athens, Greece (download local copy from here).

Rician denoising is a mandatory task of Magnetic Resonance Imaging (MRI), as it enables higher-quality image processing, which is crucial for correct diagnosis. However, denoising is a slow task, especially because of the increased image resolution and the need for high image clarity. A solution towards this need is the implementation of rician denoise algorithm onto hardware. In this paper, we propose a hardware implementation of rician denoise, which processes the MR image into segments in a pipelined manner, while avoiding further processing on already denoised pixels of the image. Using a synthetic MRI scan separated into 16 segments, the proposed implementation achieves a speedup of 6.8× with comparable image quality, as compared to a software-only approach running on Intel Core2Duo.

I. Kostavelis, L. Nalpantidis, E. Boukas, M. Aviles Rodrigalvarez, I. Stamoulias, G. Lentaris, D. Diamantopoulos, K. Siozios, D. Soudris, A. Gasteratos, “SPARTAN: Developing a Vision System for Future Autonomous Space Exploration Robots“, Journal of Field Robotics, pp. 1–34, Wiley Periodicals, Oct. 2013 (download local copy from here).

Mars exploration is expected to remain a focus of the scientific community in the years to come. A Mars rover should be highly autonomous because communication between the rover and the terrestrial operation center is difficult, and because the vehicle should spend as much of its traverse time as possible moving. Autonomous behavior of the rover implies that the vision system provides both a wide view to enable navigation and three-dimensional (3D) reconstruction, and at the same time a close-up view ensuring safety and providing reliable odometry data. The European Space Agency funded project “SPAring Robotics Technologies for Autonomous Navigation” (SPARTAN) aimed to develop an efficient vision system to cover all such aspects of autonomous exploratory rovers. This paper presents the development of such a system, starting from the requirements up to the testing of the working prototype. The vision system was designed with the intention of being efficient, low-cost, and accurate and to be implemented using custom-designed vectorial processing by means of field programmable gate arrays (FPGAs). A prototype of the complete vision system was developed, mounted on a basic mobile robot platform, and tested. The results on both real-world Mars-like and long-range simulated data are presented in terms of 3D reconstruction and visual odometry accuracy, as well as execution speed. The developed system is found to fulfill the set requirements.

G. Lentaris, D. Diamantopoulos, G. Stamoulias, K. Siozios, D.Soudris and M. Avilés Rodrigálvarez, “FPGA-based Path-planning of High Mobility Rover for Future Planetary Missions“, International Conference on Electronics, Circuits, and Systems (ICECS), pp. 85-88, Dec. 2012, Sevilla, Spain (download local copy from here).

Work in navigational autonomy for planetary exploration rovers has focused on achieving safe and efficient path-planning through unknown, rocky terrain. Computer vision is playing an important role in increasing autonomy of both spacecraft and robotic vehicles, however due to their increased computational complexity, they are rarely implemented onto conventional computing systems. This problem becomes far more severe whenever real-time constraints have also to be considered. This paper addresses progress on rover path planning for high mobility mars rover missions. More specifically, at SPARTAN project, a demonstrator is being developed that meets ESA’s specifications for the problem of rover Localization. This problem accounts for the generation of location estimates. This process involves four distinct applications: Imaging, 3D-Reconstruction, Visual Odometry (VO) and Visual Simultaneous Localization and Map reconstruction (SLAM). On supporting these functions we developed a novel HW/SW co-design methodology that extracts parallelism in higher algorithmic level, and thus allows the simultaneous execution of computational intensive kernels on a FPGA device, while the control flow is maintained by a low-performance CPU. Experimental results show that our implementation achieves the scopes of SPARTAN project.

G. Lentaris, D. Diamantopoulos, K. Siozios, D. Soudris and M. Aviles, “Hardware Implementation of Stereo Correspondence Algorithm for the ExoMars Mission“, International Conference on Field-Programmable Logic and Applications (FPL), pp. 667-670, Aug. 2012, Oslo, Norway (download local copy from here).

Computer vision algorithms exhibit increased complexity introducing significant implementation problems in conventional computing systems, especially whenever real-time constraints are imposed. This paper describes the ESA compatible VHDL development of a stereo correspondence algorithm for rover navigation in the SPARTAN system. The design is implemented on a Xilinx Virtex-6 FPGA and the evaluation results validate the efficiency of the applied methodology by showing real-time performance with minimal hardware utilization.

D. Diamantopoulos, K. Siozios, G. Lentaris, D. Soudris and M. Aviles Rodrigalvarez, “SPARTAN Project: On Profiling Computer Vision Algorithms for Rover Navigation“, NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 174-181, June 2012, Erlangen, Germany (download local copy from here).

The exploration of Mars is one of the main goals for NASA/ESA, as confirmed by past and recent activities. One of the most challenging tasks for these missions is the autonomous robot’s navigation. Existing approaches incorporate vision-based solutions and exhibit remarkable results in term of accuracy. Unfortunately, these approaches affect mostly computational and memory intensive algorithms running on software-level. In this paper, we introduce a novel methodology for efficient implementation of computer vision algorithms for the SPARTAN project (ExoMars 2018 mission). Experimental results prove the effectiveness of the introduced solution, as compared to a software-based implementation.

M. Aviles, K. Siozios, D. Diamantopoulos, L. Nalpantidis, I. Kostavelis, E. Boukas, D. Soudris and A. Gasteratos, “A Co-design Methodology for Implementing Computer Vision Algorithms for Rover Navigation onto Reconfigurable Hardware“, Workshop on Computer Vision on Low-Power Reconfigurable Architectures at FPL, Sept. 2011, Chania, Greece (download local copy from here).

Vision-based robotics applications have been widely studied in the last years. However, up to now solutions that have been proposed were affecting mostly software level. The SPARTAN project focuses in the tight and optimal implementation of computer vision algorithms targeting to rover navigation. For evaluation purposes, these algorithms will be implemented with a co-design methodology onto a Virtex-6 FPGA device.

K. Siozios, D. Diamantopoulos, I. Kostavelis, E. Boukas, L. Nalpantidis, D. Soudris, A. Gasteratos, M. Avilesand I. Anagnostopoulos, “SPARTAN Project: Efficient Implementation of Computer Vision Algorithms onto Reconfigurable Platform Targeting to Space Applications“, International Workshop on Reconfigurable Communication Centric Systems-on-Chip (ReCoSoC), pp. 1-9, Montpellier, France, June 2011 (download local copy from here).

Vision-based robotic applications exhibit increased computational complexity. This problem becomes even more important regarding mission critical application domains. The SPARTAN project focuses in the tight and optimal implementation of computer vision algorithms targeting to rover navigation for space applications. For evaluation purposes, these algorithms will be implemented with a co-design methodology onto a Virtex-6 FPGA device.

Reconfigurable Architectures, Platforms and Tools

A. Michailidis, T. Noulis and K. Siozios, “CMOS Noise Analysis and Simulation From Low Frequency and Baseband to RF and Millimeter Wave“, IEEE Access, vol. 11, pp. 39807-39823, 2023 (download local copy from here).

Generic guidelines for noise simulation of CMOS integrated circuits and advanced noise signal integrity analyses, were addressed in this work. The provided noise simulation guidelines are referring to a wide application spectrum, from baseband designs to high frequency RF/mmWave applications. Linear, periodic state and non-linear circuits, were used to categorize the noise simulation guidelines to the respective topology in which they can be applied. In addition, thermal, flicker and gate induced noise, were addressed, and classified with respect to the application field. Three test case vehicles were designed, each vehicle corresponds to a linear, periodic state or text non-linear circuit topology. As a linear vehicle, an Operational Transconductance Amplifier (OTA) was designed, as a periodic state vehicle, a Voltage Controlled Oscillator (VCO) was designed while as a non-linear vehicle, a DC-to-DC Boost Converter was designed. Small-signal and large-signal noise analyses were benchmarked in terms of simulation time, compatibility and obtained accuracy. As small-signal analyses, AC noise, PNOISE and ENVLP analyses were performed, whereas as large-signal analyses, direct time domain (TRAN), PSS and QPSS analyses were performed in the respective topologies. Furthermore, using as a victim the VCO test case circuit, noise interference due to mutual coupling (mixing interference) was thoroughly addressed and simulation compatibility check was presented. Finally, noise simulation guidelines were efficiently summarized and categorized with respect to the circuit type and the application field.

C. Sad, A. Michailidis, T. Noulis and K. Siozios, “A Hybrid GA/ML-based end-to-end Automated Methodology for Design Acceleration of Wireless Communications CMOS LNAs”, Electronics, Vol. 12, No. 11, pp. 2428, 2023 (download local copy from here).

A new methodology for the RF/mmWave analog design process, automation and acceleration, is presented in this work. The proposed framework was implemented so as to accelerate the design cycle of analog/RF circuits by creating a dataset in a fully automated manner and training a combination of machine learning models for the optimal design parameters’ prediction. machine learning polynomial regression was adopted to accelerate the design process, predicting the optimal design parameters’ values while genetic algorithm optimization was exploited for the dataset creation automation. To evaluate the efficiency of the proposed methodology, the framework was implemented for the design of a common source Low-Noise-Amplifier, using a 65 nm CMOS process node. The proposed methodology successfully tackles the design cycle speed-up, automation, and acceleration, utilizing machine learning prediction for the design parameters and genetic algorithm for the dataset creation automation instead of the classical, simulation-based, standard design methodology. The provided experimental results have shown the effectiveness of the proposed hybrid approach, creating very precise RF matching networks for LNA designs and achieving >99% wave transmission efficiency while reaching >99% accuracy on the parameters’ prediction task.

H. Sidiropoulos, I. Koutras, D. Soudris and K. Siozios, “Algorithmic and memory optimizations on multiple application mapping onto FPGAs“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 146-153, Pythagorion, 2017 (download local copy from here).

Field Programmable Gate Arrays (FPGAs) offer a low power flexible accelerator alternative due to their inherent parallelism. Reprogrammability, although its their key feature, it is used almost exclusively on design time due to the constrains imposed by the modern CAD tools that require even days to run and tens of GB of RAM. In order to effectively utilize FPGAs on run time we propose a novel methodology and the supporting toolflow that enable efficient mapping of multiple applications onto heterogeneous FPGAs. With the use of a floorplanning step, memory optimizations and custom memory allocators, we alleviate the constrains imposed by CAD tools, and provide a proof of concept that application mapping onto FPGAs can be done on run time. Experimental results prove the efficiency of the introduced solution, as we achieve application’s mapping 40× faster on average compared to a state-of-art approach, without performance degradation and with 12× on average reduced memory usage.

K. Siozios, P. Figuli, H. Sidiropoulos, C. Tradowsky, K. Maragos, S. P. Delicia, D. Soudris and J. Becker, “TEAChER: TEach AdvanCEd Reconfigurable architectures and tools“, International Workshop on Applied Reconfigurable Computing (ARC), pp. 103-114, April 2015, Bochum, Germany (download local copy from here).

This paper presents an on-going collaboration project, named TEAChER for providing breakthrough knowledge to students and young researchers on reconfigurable computing and advanced digital systems. The project is intended to cover topics like architectures and capabilities of field-programmable gate arrays, languages for the specification, modeling, and synthesis of digital systems. Furthermore design methods, computer-aided design tools, reconfiguration techniques and practical applications are taught. The virtual laboratory enables the remote students to easily interact with a set of reconfigurable platforms in order to control experiments through the internet. By using the user-friendly interface, the remote user can change predefined system parameters and observe system response either in textual, or graphical format. In addition such a virtual laboratory includes a booking system, which enables remote users to conduct experiments in advance.

H. Sidiropoulos, K. Siozios, D. Soudris, “On supporting rapid exploration of memory hierarchies onto FPGAs“, Journal of Systems Architecture (JSA), Vol. 59, No. 2, pp. 78-90, Feb. 2013 (download local copy from here).

This paper introduces a novel methodology for enabling fast yet accurate exploration of memory organizations onto FPGA devices. The proposed methodology is software supported by a new open-source tool framework, named NAROUTO. This framework is the only public available solution for performing architecture-level exploration, as well as application mapping onto FPGA devices with different memory organizations, under a variety of design criteria (e.g. delay improvement, power optimization, area savings, etc.). Experimental results with a number of industrial oriented kernels prove the efficiency of the proposed solution, as compared to similar approaches, since it provides better manipulation of memory blocks, leading to architectures with higher performance in terms of area, power and delay.

H. Sidiropoulos, K. Siozios and D. Soudris, “A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs“, International Conference on Field Programmable Logic and Applications (FPL), pp. 238-243, Sept. 2011, Chania, Greece (download local copy from here).

This paper introduces a novel methodology for enabling rapid exploration of memory hierarchies onto FPGA devices. The methodology is software supported by a new open-source tool framework, named NAROUTO. Among others, the proposed framework enables critical tasks during architecture’s design, such as memory hierarchy and floor-planning. Furthermore, NAROUTO framework is the only available solution for power/energy evaluation of different memory organizations. Experimental results shown that NAROUTO framework leads to significant area, power (about 82%) and performance (about 46%) improvements, as compared to existing solutions.

M. . Hubner, P. Figuli, R. Girardey, D. Soudris, K. Siozios and J. Becker, “A Heterogeneous Multicore System on Chip with Run-Time Reconfigurable Virtual FPGA Architecture“, International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 143-149, Shanghai, May 2011 (download local copy from here).

System design, especially for low power embedded applications often profit from a heterogeneous target hardware platform. The application can be partitioned into modules with specific requirements e.g. parallelism or performance in relation to the provided hardware blocks on the multicore hardware. The result is an optimized application mapping and a parallel processing with lower power consumption on the different cores on the hardware. This paper presents a heterogeneous platform consisting of a microprocessor and a field programmable gate array (FPGA) connected via a standard AMBA bus. The novelty of this approach is that the FPGA is realized as virtual reconfigurable hardware upon a traditional off the shelf FPGA device. The advantage with this approach is that the specification of the virtual FPGA stays unchanged, independent to the underlying hardware and provides therefore features, which the exploited physical host FPGA cannot provide. A special feature of the presented virtual FPGA amongst others is the dynamic reconfigurability which is for example not available with all off the shelf FPGAs. Furthermore the concept of FPGA virtualization enables the re-use of hardware blocks on other physical FPGA devices. This paper presents the hardware platform and describes the tool chain for the heterogeneous system on chip.

H. Sidiropoulos, K. Siozios and D. Soudris, “NAROUTO: An Open-Source Framework for Supporting Architecture-Level Exploration at Heterogeneous FPGAs“, International Conference on Electronics, Circuits, and Systems (ICECS), pp. 527-530, Dec. 2010, Athens, Greece (download local copy from here).

A novel framework for supporting architecture-level exploration for heterogeneous FPGA devices is introduced. This framework, named NAROUTO, is based on open-source tools in order to support further extensions and improvements. As compared to previous works, the introduced framework provides higher flexibility for application implementation, while it can also evaluate power/energy consumption. Experimental results prove that NAROUTO leads to significant area, power (about 82%) and performance (about 46%) improvements, as compared to existing solutions.

K. Siozios and D. Soudris, “Designing a Novel High-Performance FPGA Architecture for Data Intensive Applications“, Journal of Real-Time Image Processing, Springer Berlin/Heidelberg, Vol. 4, No. 2, pp. 155–166, June, 2009 (download local copy from here).

A wide variety of real-time applications (e.g. multimedia, communication, etc.) require implementations that meet tight timing constraints. This work introduces novel high-performance FPGA architecture capable of implementing efficiently any time critical application. The fundamental contribution of the proposed reconfigurable architecture is the design of a highly efficient (performance and power consumption) interconnection structure, taking into consideration the statistical and spatial data extracted from applications, which are implemented on Virtex FPGAs. The derived architecture is software-supported by the MEANDER design framework. Using a number of real-time applications, extensive comparison study in terms of several design parameters proves the effectiveness of the proposed architecture against to Virtex one. More specifically, the proposed architecture achieves performance improvement and power savings up to 20 and 16%, respectively. Moreover, compared to a Virtex architecture with same power budget, our architecture achieves performance improvement by 42%.

K. Siozios, D. Soudris and A. Thanailakis, “Designing a General-Purpose Heterogeneous Interconnection Architecture for FPGAs“, Journal of Low Power Electronics (JOLPE), Vol. 4, No. 1, pp. 34-47, April 2008 (download local copy from here).

A novel methodology for designing interconnection architectures, targeting to general-purpose FPGAs, is introduced. The design parameters for this architecture are extracted after the implementation and profiling numerous applications on Virtex FPGAs. Among the design parameters, we study the connectivity demand for each (x,y) point of the interconnection network. The derived data are appropriately manipulated to build a topology consisting of different Switch Boxes (SB), which are assigned over the FPGA architecture taking into account the connectivity requirements for each part of the device. The selected architecture resulted from the specific Pareto optimal points which tradeoffs the maximum operation frequency versus the energy consumption. The methodology is software supported by three new CAD tools, part of the MEANDER Framework. Using numerous applications, extensive comparison study in terms of several design parameters proves the efficiency of the proposed interconnection compared to Virtex-style FPGAs. More specifically, we achieve an average increase in operation frequency by 31%, a reduction in leakage power and energy consumption by 8% and 22%, respectively.

K. Tatas, K. Siozios, and D. Soudris, “A Survey of Existing Fine-Grain Reconfigurable Architectures and CAD Tools“, Book Chapter in “Fine- and Coarse-Grain Reconfigurable Systems“, Editors: S. Vassiliadis and D. Soudris, pp. 3–88, Springer, 2007 (download local copy from here).

This chapter contains an introduction to FPGA technology that includes architecture, power consumption and configuration models, as well as an extensive survey of existing fine-grain reconfigurable architectures that have emerged from both academia and industry. All aspects of the architectures, including logic block structure, interconnect, and configuration methods are discussed. Qualitative and quantitative comparisons in terms of testability, technology portability, design flow completeness and configuration type are shown. Additionally, the implementation techniques and CAD tools (synthesizers, LUT-mapping tools and placement and routing tools) that have been developed to facilitate the implementation of a system in reconfigurable hardware by the industry (both by FPGA manufacturers and third-party EDA tool vendors) and academia are described.

K. Siozios, S. Mamagkakis, D. Soudris and A. Thanailakis, “Designing Heterogeneous FPGAs with Multiple SBs“, International Workshop on Applied Reconfigurable Computing (ARC), pp. 91-96, Brazil, March 2007 (download local copy from here).

The novel design of high-speed and low-energy FPGA routing architecture consisting of appropriate wire segments and multiple Switch Boxes is introduced. For that purpose, we develop a new methodology consisting of two steps: (i) Exploration and determination of the optimal wire length and (ii) Exploration and determination of the optimal combination of multiple switch-boxes, considering the optimal choice of the former step. The proposed methodology for designing the high performance interconnection architecture is fully-supported by the software tool called EX-VPR. For both steps, the selection criterion for a minimal Energy×Delay Product is chosen. Depending on the localized performance and energy consumption requirements of each specific region of FPGA architecture, we derive a set of corresponding spatial routing information of the applications mapped onto FPGA. We achieved Energy×Delay Product reduction by 55%, performance increase by 52%, reduction in total energy consumption by 8%, at the expense of increase of channel width by 20%.

K. Siozios, D. Soudris and A. Thanailakis, “Designing Alternative FPGA Implementations Using Spatial Data from Hardware Resources“, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 403-414, Montpellier, France, Sept. 2006 (download local copy from here).

A novel approach for efficient implementation of applications onto reconfigurable architectures is introduced. The proposed methodology can applied both for designing an interconnection architecture as well as for making a thermal-aware placement. In the first case, the dominant parameters that affect performance and energy (segment length and switch boxes) are examined. This approach is based on finding the optimal wire length and then making exploration in order to determine the appropriate combination of multiple switch boxes. In the second case, a new technique for thermal-aware placement is introduced. The main goal of this technique is to spread out the power consumption across the whole device, as well as to minimize it. Both of the methodologies are fully-supported by the software tool called EX-VPR. For the purposes of this paper, the Energy× Delay Product (EDP) is chosen as selection criterion for the optimal interconnection network, while the total power consumption is the criterion for the thermal-aware routing. For the designing of the interconnection network we achieved EDP reduction by 45%, performance increase by 40% and reduction in total energy consumption by 8%, at the expense of increase of channel width by 20%. On the other hand, for the thermal-aware approach, we spread the heat and power across the whole FPGA, while we achieve about 20% reduction in total power consumption. In this case, the penalty in channel width is about 10%.

K. Siozios, K. Tatas, D. Soudris and A. Thanailakis, “Platform-based FPGA Architecture: Designing High-Performance and Low-Power Routing Structure for Realizing DSP Applications“, International Parallel & Distributed Processing Symposium, pp. 10, Rhodes Island, Greece, 2006 (download local copy from here).

The novel design of an efficient FPGA interconnection architecture with multiple switch boxes (SB) and hardwired connections for realizing data intensive applications (i.e. DSP applications), is introduced. For that purpose, after exhaustive exploration, we modify the routing architecture through efficient selection of the appropriate switch box with hardwired connections, taking into account the statistical and spatial routing restrictions of DSP applications mapped onto FPGA. More specifically, we propose a new technique for selecting the appropriate combination of switch boxes, depending on the localized performance and power consumption requirements of each specific region of FPGA architecture. In order to perform the mapping, we developed a novel algorithm, which takes into account the modified architectural routing features. This algorithm was implemented within a new tool called EX-VPR. Using a number of DSP applications, extensive comparison study of various combinations of switch boxes in terms of total power consumption, performance, Power * Delay product prove the effectiveness of the proposed approach.

K. Siozios, D. Soudris and A. Thanailakis, “A Novel Methodology for Designing High-Performance and Low-Power FPGA Interconnection Targeting DSP Applications“, International Symposium on Circuits and Systems (ISCAS), pp. 4383-4386, Kos, Greece, 2006 (download local copy from here).

The novel design of an efficient FPGA interconnection architecture with multiple switch boxes (SB) for realizing data intensive applications (i.e. DSP applications), is introduced. For that purpose, after exhaustive exploration, we modify the routing architecture through efficient selection of the appropriate switch box, taking into account the statistical and spatial routing restrictions of DSP applications mapped onto FPGA. More specifically, we propose a new technique for selecting the appropriate combination of switch boxes, depending on the localized performance and power consumption requirements of each specific region of FPGA architecture. In order to perform the mapping, we developed a novel algorithm, which takes into account the modified architectural routing features. This algorithm was implemented within a new tool called EX-VPR. Using a number of DSP applications, extensive comparison study of various combinations of switch boxes in terms of total power consumption and performance prove the effectiveness of the proposed approach

K. Siozios and D. Soudris, “Wire Segment Length and Switch Box Co-Optimization for FPGA Architectures“, International Conference on Field Programmable Logic and Applications (FPL), pp. 1-4, Madrid, Spain, Aug. 2006 (download local copy from here).

The novel methodology for designing a high-performance and low-energy FPGA interconnection architecture consisting of appropriate wire segments and multiple switch boxes is introduced. Depending on the localized performance and energy consumption requirements of each specific region of FPGA architecture, we derive a set of corresponding spatial routing information of the applications mapped onto reconfigurable device. In this paper, an interconnection network with segments L1&L2 and 3 different switch box regions is used. The selection criterion for our approach is the minimal energytimesdelay product (EDP). The proposed methodology is fully-supported by the software tool called EX-VPR. With this interconnection architecture we achieved EDP reduction by 56%, performance increase by 47%, reduction in leakage power by 18%, reduction in total energy consumption by 9%, at the expense of increase of channel width by 15% compared to conventional FPGA architectures

K. Siozios, K. Tatas, G. Koutroumpezis, D. Soudris, and A. Thanailakis, “An Integrated Framework for Architecture Level Exploration of Reconfigurable Platform“, International Conference on Field Programmable Logic and Applications (FPL), pp. 658-661, 2005, Tampere, Finland (download local copy from here).

In this paper, the EX-VPR tool, which used for architecture level exploration, is presented. This tool belongs to an integrated framework (MEANDER) for mapping applications into a fine-grain reconfigurable platform (FPGA). Having as input VHDL description of an application, the framework produces the appropriate configuration bitstream. The proposed tool framework supports a variety of FPGA architectures. Additionally, a novel power aware switch box is proposed. Quantitative comparisons with existing switch boxes are provided, yielding promising results.

D. Soudris, S. Nikolaidis, S. Siskos, K. Tatas, K. Siozios, G. Koutroumpezis, N. Vasiliadis, V. Kalenteridis, H. Pournara, I. Pappas, and A. Thanailakis, “AMDREL: A Novel Low-Energy FPGA Architecture and Supporting CAD Tool Design Flow“, Asia South Pacific Design Automation Conference (ASP-DAC), pp. D3-D4, 2005, Shanghai, China (download local copy from here).

The design of a novel embedded FPGA reconfigurable hardware architecture is introduced. The architecture features a number of circuit-level low-power techniques, since power consumption is considered a primary concern. Additionally, a complete set of tools facilitating implementation of applications on the proposed FPGA was presented, starting from an RTL description and producing the actual configuration bit stream. The designed full-custom FPGA is under fabrication in 0.18/spl mu/m STM CMOS technology. The prototype supports partial and dynamic reconfiguration. The efficiency of the entire system (FPGA and tools) was proven by comparisons with commercial systems.

V. Kalenteridis, H. Pournara, K. Siozios, K. Tatas, I. Pappas, S. Nikolaidis, S.Siskos, D. Soudris and A. Thanailakis, “A Complete Platform and Toolset for System Implementation on Fine-Grain Reconfigurable Hardware“, Microprocessors and Microsystems, Elsevier Publishers, Vol. 29, No. 6, pp. 247-259, Aug. 2005 (download local copy from here).

In this paper a complete system for the implementation of digital logic in a fine-grain reconfigurable platform is introduced. The system is composed of two parts: the fine-grain reconfigurable hardware platform (FPGA) on which the logic is implemented and the set of CAD tools for mapping logic to the FPGA platform. It is the first such complete academic system. The novel energy efficient FPGA architecture was designed and simulated in STM 0.18 μm CMOS technology. The detailed design and circuit characteristics of the Configurable Logic Block as well as the interconnection network are determined and evaluated for energy, delay and area. Concerning the tool flow, each tool can operate as a standalone program as well as part of a complete design framework, composed by existing and new tools.

V. Kalenteridis, H. Pournara, K. Siozios, K. Tatas, I. Pappas, S. Nikolaidis, S. Siskos, D. J. Soudris and A. Thanailakis, “An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Application Mapping Toolset Development“, International Parallel and Distributed Processing Symposium, pp. 138-145, April 26-27, 2004, Santa Fe, New Mexico, USA (download local copy from here).

Summary form only given. A complete system for the implementation of digital logic in a fine-grain reconfigurable platform is introduced. The system is composed of two parts: The fine-grain reconfigurable hardware platform (FPGA) on which the logic is implemented and the set of CAD tools for mapping logic to the FPGA platform. The novel energy-efficient FPGA architecture was designed and simulated in STM 0.18/spl mu/m CMOS technology. Concerning the tool flow, each tool can operate as a standalone program as well as part of a complete design framework, composed by existing and new tools.

I. Pappas, N. Vassiliadis, V. Kalenteridis, H. Pournara, S. Nikolaidis, S. Siskos, K. Siozios, G. Koutroumpezis, K. Tatas, D. J. Soudris and A. Thanailakis, “Fine-Grain Reconfigurable Platform: FPGA Hardware Design and Software Toolset Development“, Conference on Microelectronics Microsystems and Nanotechnology, pp. 352-356, Nov. 2004, Greece (download local copy from here).

A complete system for the implementation of digital logic in a fine-grain reconfigurable platform is introduced. The system is composed of two parts. The fine-grain reconfigurable hardware platform (FPGA) on which the logic is implemented and the set of CAD tools for mapping logic to the FPGA platform. A novel energy-efficient FPGA architecture is presented (CLB, interconnect network, configuration hardware) and simulated in STM 0.18 μm CMOS technology. Concerning the tool flow, each tool can operate as a standalone program as well as part of a complete design framework, composed by existing and new tools.

3D Integration

H. Sidiropoulos, K. Siozios and D. Soudris, “A Novel 3-D FPGA Architecture Targeting Communication Intensive Applications“, Journal of Systems Architecture (JSA), Vol. 60, No. 1, pp. 32-39, Jan. 2014 (download local copy from here).

The interconnection structures in FPGA devices increasingly contribute more to the delay, power consumption and area overhead. The demand for even higher clock frequencies makes this problem even more important. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moores momentum and fuel the next wave of consumer electronics products. However, the benefits of such a new integration paradigm have not been sufficiently explored yet. In this paper, a novel 3-D architecture, as well as the software supporting tools for exploring and evaluating application implementation, are introduced. More specifically, by assigning to different layers logic and I/O resources, we achieve mentionable wire-length reduction. Experimental results prove the effectiveness of such a selection, since target architectures outperform the conventional 2-D FPGAs.

K. Siozios, V.F. Pavlidis, and D. Soudris, “A Novel Framework for Exploring 3-D FPGAs with Heterogeneous Interconnect Fabric“, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 5, No. 1, pp. 4:1-4:23, March 2012 (download local copy from here).

A heterogeneous interconnect architecture can be a useful approach for the design of 3-D FPGAs. A methodology to investigate heterogeneous interconnection schemes for 3-D FPGAs under different 3-D fabrication technologies is proposed. Application of the proposed methodology on benchmark circuits demonstrates an improvement in delay, power consumption, and total wire-length of approximately 41%, 32%, and 36%, respectively, as compared to 2-D FPGAs. These improvements are additional to reducing the number of interlayer connections. The fewer interlayer connections are traded off for a higher yield. An area model to evaluate this trade-off is presented. Results indicate that a heterogeneous 3-D FPGA requires 37% less area as compared to a homogeneous 3-D FPGA. Consequently, the heterogeneous FPGAs can exhibit a higher manufacturing yield. A design toolset is also developed to support the design and exploration of various performance metrics for the proposed 3-D FPGAs.

D. Diamantopoulos, K. Siozios and D. Soudris, “Framework for Performing Rapid Evaluation of 3D SoCs“, IET Electronics Letters, Vol. 48, No. 12, pp. 679-681, June 2012 (download local copy from here).

Integrating more functionality in a smaller form factor with lower power consumption pushes traditional semiconductor technology scaling to its limits. Three-dimensional (3D) chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronic products. Introduced is a framework that enables rapid evaluation of 3D SoCs with existing physical design tools.

H. Sidiropoulos, K. Siozios and D. Soudris, “A Framework for Architecture-Level Exploration of 3-D FPGA Platforms“, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 298-307, Madrid, Spain, Sept. 2011 (download local copy from here).

Interconnection structures in FPGAs increasingly contribute more to the delay and power consumption. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronics products. However, the benefits of such a technology have not been sufficiently explored yet. This paper introduces a novel 3-D FPGA, where logic, memory and I/O resources are assigned to different layers. Experimental results prove the efficiency of our architecture for a wide range of application domains, since we achieve average performance improvement and power saving of 30% and 10%, respectively.

H. Sidiropoulos, K. Siozios and D. Soudris, “A Framework for Architecture-level Exploration of Communication Intensive Applications onto 3-D FPGAs“, International Conference on Field Programmable Logic and Applications (FPL), pp. 30-33, Chania, Greece, Sept. 2011 (download local copy from here).

The interconnection structures in FPGA devices increasingly contribute more to the delay, power consumption and area overhead. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moores momentum and fuel the next wave of consumer electronics products. However, the benefits of such an integration technology have not been sufficiently explored yet. In this paper, we introduce a novel 3-D architecture, as well as the software supporting tools for exploring and evaluating application mapping onto 3-D FPGAs, where logic and I/O resources are assigned to different layers. Experimental results shown that such a 3-D architecture is suitable especially for communication intensive applications, since a device with two layers achieves delay reduction, as compared to conventional 2-D FPGAs up to 87% without any overhead in power dissipation.

K. Siozios, A. Papanikolaou and D. Soudris, “CAD Tools for Designing 3D Integrated Systems“, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2229-2232, Rio de Janeiro, Brazil, May 2011 (download local copy from here).

Expectations of consumer for future consumer electronics devices put significant strain on conventional design and manufacturing processes. Integrating more functionality in a smaller form factor with lower power consumption and cost is pushing traditional semiconductor technology scaling to its limits. Three dimensional chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronics products. This paper outlines a generic methodology to design 3D systems.

D. Diamantopoulos, K. Siozios, D. Bekiaris, and D. Soudris, “A Novel Methodology for Architecture-Level Exploration of 3D SoCs“, Design & Technology of Integrated Systems in Nanoscale Era (DTIS), pp. 1-6, Athens, Greece, Apr. 2011 (download local copy from here).

Three-dimensional (3D) integration is an emerging technology that is expected to lead to tremendous benefits in terms of power, delay and silicon area. Moreover, 3D technology continues interconnect advances beyond the CMOS scaling predicted by Moore’s Law, which enable new capabilities not available in 2D ICs. This paper proposes a physical design framework that enables rapid evaluation of 3D SOCs under different optimization goals. For demonstration purposes we apply the proposed framework for the 3D physical design of an embedded processor. Experimental results shown that 3D integration can alleviate the constraints posed by increased wire-length, such as power consumption, by about 20% compared to the 2D implementation.

K. Siozios, V. Pavlidis, and D. Soudris, “A Software-Supported Methodology for Exploring Interconnection Architectures Targeting 3-D FPGAs“, Proceedings in Design, Automation and Testing in Europe (DATE), pp. 172-177, Nice, France, April 2009 (download local copy from here).

Interconnect structures significantly contribute to the delay, power consumption, and silicon area of modern reconfigurable architectures. The demand for higher clock frequencies and logic densities is also important for the field-programmable gate array (FPGA) paradigm. Three-dimensional (3-D) integration can alleviate such performance limitations by accommodating a number of additional silicon layers. However, the benefits of 3-D integration have yet to be sufficiently investigated. In this paper, we propose a software-supported methodology to explore and evaluate 3-D FPGAs fabricated with alternative technologies. Based on the evaluation results, the proposed FPGA device improves speed and energy dissipation by approximately 38% and 26%, respectively, as compared to 2-D FPGAs. Furthermore, these gains are achieved in addition to reducing the interlayer connections, as compared to existing design approaches, leading to cheaper and more reliable architectures.

K. Siozios, D. Soudris and G. Economakos, “Three-Dimensional FPGA Architectures: A Shift Paradigm for Energy-Performance Efficient DSP Implementations“, International Conference on Digital Signal Processing (DSP), pp. 1-6, Santorini, Greece, July 2009 (download local copy from here).

Modern applications exhibit increased complexity which introduces extra constraints during implementation related to delay, power consumption and silicon area. This problem is even more important when we deal with digital system processor (DSP) kernels, as there are demands for even higher clock frequencies and logic densities, which cannot be satisfied with existing design technologies. Three-dimensional (3D) integration is an emerging technology that promises to alleviate problems related to performance improvement, but up to now this new design approach has not been sufficiently explored. In this paper we propose a novel 3D FPGA architecture able to implement efficiently DSP applications. The proposed architecture is software-supported by a methodology targeting to explore DSP enhanced 3D FPGA devices. During our study we quantify a number of design parameters, such as the selected number of layers, the proper bonding approach, the process technology for each layer, etc. Comparison results prove the efficiency (in terms of performance and power consumption) of the new design paradigm, as compared to existing commercial devices with similar hardware resources.

K. Siozios, A. Bartzas and D. Soudris , “Architecture-Level Exploration of Alternative Interconnection Schemes Targeting 3D FPGAs: A Software-Supported Methodology“, International Journal of Reconfigurable Computing, Vol. 2008, Article ID 76942, 18 pages (download local copy from here).

In current reconfigurable architectures, the interconnection structures increasingly contribute more to the delay and power consumption. The demand for increased clock frequencies and logic density (smaller area footprint) makes the problem even more important. Three-dimensional (3D) architectures are able to alleviate this problem by accommodating a number of functional layers, each of which might be fabricated in different technology. However, the benefits of such integration technology have not been sufficiently explored yet. In this paper, we propose a software-supported methodology for exploring and evaluating alternative interconnection schemes for 3D FPGAs. In order to support the proposed methodology, three new CAD tools were developed (part of the 3D MEANDER Design Framework). During our exploration, we study the impact of vertical interconnection between functional layers in a number of design parameters. More specifically, the average gains in operation frequency, power consumption, and wirelength are 35%, 32%, and 13%, respectively, compared to existing 2D FPGAs with identical logic resources. Also, we achieve higher utilization ratio for the vertical interconnections compared to existing approaches by 8% for designing 3D FPGAs, leading to cheaper and more reliable devices.

K. Siozios, K. Sotiriadis, V. F. Pavlidis and D. Soudris, “A Software-Supported Methodology for Designing High-Performance 3D FPGA Architectures“, International Conference on Very Large Scale Integration (VLSI-SoC), pp. 54-59, Atlanta, USA, Oct. 2007 (download local copy from here).

A software-supported systematic methodology for exploring and evaluating alternative 3D reconfigurable FPGA architectures is introduced. Two new software tools were developed: (i) a placement and routing tool for 3D FPGAs (3DPRO) and (H) a power/energy consumption estimation tool for such architectures (3DPower). Both of them are part of the new Design Framework, named 3D-MEANDER. We mainly focus our exploration on parameters that dominate the maximum operation frequency of the 3D FPGAs (i.e. vertical interconnections, number of layers, etc.). We evaluate the efficiency of the proposed methodology by making an exhaustive exploration for device delay, power consumption and utilized number of vertical connections for alternative 3D interconnection schemes. Experimental results demonstrate the effectiveness of our methodology, considering the 20 largest MCNC benchmarks. We achieve an average decrease in the delay, the wire length, and the energy consumption of 27%, 26%, and 34%, respectively, as compared to traditional 2D FPGAs, considering 3D architectures with 50% and 70% of fabricated vias. Also, we proved that actually-utilized via links are practically independent from the number of fabricated vias of a 3D FPGA architecture.

K. Siozios, K. Sotiriadis, V. F. Pavlidis and D. Soudris, “Exploring Alternative 3D FPGA Architectures: Design Methodology and CAD Tool Support“, International Conference on Field Programmable Logic and Applications (FPL), pp. 652-656, Amsterdam, Aug. 2007. Amsterdam, Netherlands (download local copy from here).

This paper introduces a software supported methodology for exploring/evaluating 3D FPGA architectures. Two new CAD tools are developed: (i) the 3DPRO for placement and routing on 3D FPGAs and (ii) the 3DPower for power/energy estimation on such architectures. We mainly focus our exploration on the total number of layers and the amount of vertical interconnects (or vias). The efficiency of the proposed architecture is evaluated by making an exhaustive exploration for via connections under the Energy×Delay Product criterion. Experimental results demonstrate the effectiveness of our solution, considering the 20 largest MCNC benchmarks. Considering 3D architectures with 4 layers and two scenarios of fabricated via densities (30% and 70%), we achieve an average decrease in the delay, the wire length, and the energy consumption of 18%, 17%, and 31%, respectively, as compared to 2D FPGAs. We also achieved high utilization of vias links.

Fault-Tolerance and Reliability Improvement

I. Stratakos, K. Maragos, G. Lentaris, D. Soudris and K. Siozios, “Aging Evaluation and Mitigation Techniques Targeting FPGA Devices“, Chapter 7 at “Low Power Circuits for Emerging Applications in Communications, Computing, and Sensing”, Editor: K. Iniewski and F. Yuan, CRC Press, Boca Raton, 2019 (download local copy from here).

Digital circuit downscaling is driven by a continuous need for integrated solutions that deliver higher performance in the smallest possible size. Field-programmable gate arrays (FPGAs) are an attractive solution for use in the implementation of digital systems, because they exploit the latest fabrication processes on complementary metal-oxide semiconductor technology in order to provide the highest possible performance with power consumption as low as possible. The mechanisms behind aging in digital circuits have been well known for a long time, but a systematic exploration of their effect in FPGA devices started to become relevant, because of the continuous market demand for more performance in the most compact size. After an electronic device has been shipped to customers, it is expected to operate correctly under predefined conditions, determined during the test phase, for its full operation life cycle. However, continuous operation in line with vendor specifications is hard to achieve.

K. Maragos, G. Lentaris, K. Siozios, D. Soudris and V. Pavlidis, “Application Performance Improvement By Exploiting Process Variability On FPGA Devices“, Proceedings in Design, Automation and Testing in Europe (DATE), pp. 452-457, Lausanne, Switzerland, March 2017 (download local copy from here).

Process variability is known to be increasing with technology scaling in IC fabrication, thereby degrading the overall performance of the manufactured devices. The current paper focuses on the variability effect in FPGAs and the possibility to boost the performance of each device at run-time, after fabrication, based on the individual characteristics of this device. First, we develop a sensing infrastructure involving a wide network of customized ring oscillators to measure intra-chip and inter-chip variability in 28nm FPGAs, i.e., in eight Xilinx Zynq XC7Z020T-1CSG324 devices. Second, we develop a closed-loop framework based on dynamic reconfiguration of clock tiles, I/O data sniffing, HW/SW communication, and verification with test vectors, to dynamically increase the operating frequency in Zynq while preserving its correctness. Our results show intra-chip variability in the area of 5.2% to 7.7% and inter-chip variability up to 17%. Our framework improves the performance of example FIR designs by up to 90.3% compared to the SW tool reports and shows speed difference among devices by up to 12.4%.

K. Siozios, D. Soudris and M. Hubner, “A Framework for Supporting Adaptive Fault-Tolerant Solutions“, ACM Transactions on Embedded Computing Systems (TECS), Vol. 13, No. 5s, Article 169, 22 pages, Dec. 2014 (download local copy from here).

For decades, computer architects pursued one primary goal: performance. The ever-faster transistors provided by Moore’s law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity impose several new challenges, including a decrease in dependability level due to physical failures. In this article we propose a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at runtime. Experimental results prove the efficiency of our solution since it achieves comparable fault masking to relevant solutions, but with significantly lower mitigation cost. More specifically, our framework speeds up the identification of suspicious failure resources on average by 76% as compared to the HotSpot tool. Similarly, the introduced solution leads to average Power×Delay (PDP) savings against an existing TMR approach by 53%.

K. Siozios and D. Soudris, “A low-cost fault tolerant solution targeting commercial FPGA devices“, Journal of System Architecture (JSA), Vol. 59, No. 10, Part D, pp. 1255-1265, Nov. 2013 (download local copy from here).

Technology scaling, in conjunction to the trend towards higher operation frequency, results in increased thermal stress, which in turn leads to upsets due to reliability degradation. In this paper, we introduce a software-supported framework targeting to enable sufficient fault coverage against upsets occurred due to aging phenomena. Experimental results with a number of industrial oriented DSP kernels shown the effectiveness of our framework, since we achieved average improvement in terms of maximum operation frequency and power consumption by 15% and 70%, respectively, as compared to a well-established commercial solution, for comparable fault masking.

K. Siozios, D. Soudris and M. Hubner, “On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPGAs“, International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 2206–2211, Cambridge, MA, May 2013 (download local copy from here).

For decades computer architects pursued one primary goal: performance. The even-faster transistors provided by Moore’s law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity imposes several new challenges, including a decrease in dependability level due to physical failures. This makes crucial the usage of fault tolerance. Existing solutions are applied statically at design-time ignoring about constraints posed during execution phase, while they also introduce mentionable overheads in terms of delay and power consumption. In this paper we introduce a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at run-time. Experimental results prove the efficiency of our methodology since it achieves comparable fault masking to relevant solutions, but with significant lower mitigation cost.

K. Siozios and D. Soudris, “A low-cost fault tolerant solution targeting to commercial FPGA devices“, NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Special session on dependability by reconfigurable hardware, pp. 46-53, Erlangen, Germany, June 2012 (download local copy from here).

Technology scaling in conjunction to the trend towards higher performance introduce an increased number of upsets due to reliability degradation. This problem becomes an important design concern, not only for safety critical systems, but almost for the majority of architectures. At this paper, a novel software-supported framework targeting to provide sufficient fault masking at SRAM-based FPGAs against to reliability degradation, without the excessive mitigation cost of similar approaches, is introduced. Experimental results with a number of industrial oriented DSP kernels prove the effectiveness of our solution, since we achieve considerable delay and power improvements for comparable fault masking.

D. Diamantopoulos, K. Siozios, S. Xydis and D. Soudris, “A Systematic Methodology for Reliability Improvements on SoC-based Software Defined Radio Systems“, VLSI Design, Vol. 2012, Article ID 784945, 15 pages (download local copy from here).

Shrinking silicon technologies, increasing logic densities and clock frequencies, lead to a rapid elevation in power density. Increased power density results in higher onchip temperature, which creates numerous problems tightly firmed to reliability degradation. Since typical low-power design has been proved inefficient to tackle the temperature increment by itself, device architects are facing the challenge of developing new methodologies to guarantee timing, power, and thermal integrity of the chip. In this paper, we propose a thermal-aware exploration framework targeting temperature hotspots elimination through the efficient exploration of multiple microarchitecture selections over the temperature-area trade-off curve. By carefully planning at design time the resources of the initial microarchitecture that should be replicated, the proposed methodology optimizes the system’s thermal profile and attens on-chip temperature under various design constraints. The introduced framework does not impose any architectural or compiler modification, whereas it is orthogonal to any other thermal-aware methodology. For evaluation purposes, we employ the software-defined radio executed onto a thermal-aware instance of LEON3 processor. Based on experimental results, we found that our methodology leads to an architecture that exhibits temperature reduction of 17 Kelvin degrees, which leads to improvement against aging phenomena about 14%, with a controllable overhead in silicon area about 15%, compared to the initial LEON3 instance.

K. Siozios, D. Rodopoulos and D. Soudris, “On Supporting Rapid Thermal Analysis“, IEEE Computer Architecture Letters, Vol. 10, No. 2, pp. 53-56, Jul.-Dec. 2011 (download local copy from here).

Detailed thermal analysis is usually performed exclusively at design time since it is a computationally intensive task. In this paper, we introduce a novel methodology for fast, yet accurate, thermal analysis. The introduced methodology is software supported by a new open source tool that enables hierarchical thermal analysis with adaptive levels of granularity. Experimental results prove the efficiency of our approach since it leads to average reduction of the execution overhead up to 70% with a penalty in accuracy ranging between 2% and 8%.

D. Diamantopoulos, K. Siozios, S. Xydis and D. Soudris, “Thermal optimization for micro-architectures through selective block replication“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 59-66, Samos, Greece, July 2011 (download local copy from here).

Increased power densities result to higher on-chip temperatures, which in turn creates numerous problems tightly firmed to reliability issues. This problem is expected to become even more severe for deep-submicron technologies. In this paper, we propose a thermal-aware exploration framework at the microarchitecture level for temperature hotspots elimination through selective resource replication. Experimental results based on the LEON3 processor synthesized with a 45 nm technology library, shown that the proposed methodology leads to designs with fewer hotspots, while the maximal temperatures at these hotspots are also reduced. Specifically, temperature reduction of 17 Kelvin degrees is feasible, which leads to improvement against aging phenomena about 14%, with a controllable overhead in silicon area about 15%, as compared to conventional architecture design.

K. Siozios, D. Rodopoulos and D. Soudris, “Quick_Hotspot: A Software Supported Methodology for Supporting Run-Time Thermal Analysis at MPSoC Designs“, International Conference on Architecture of Computing Systems (ARCS), Como, Italy, Feb. 2011 (download local copy from here).

Detailed thermal analysis and exploration has recently received significant attention since it is straightforward-related to numerous reliability issues. Furthermore, thermal profiling is a critical challenge for supporting efficient power management, especially to multi-processor system-on-chips (MPSoCs). This problem becomes even more important if we take into account the computational complexity of existing thermal profiling and analysis approaches. Among others this limitation imposes that thermal analysis is performed solely at design time. However, such a static exploration does not take into account constraints posed during application execution that lead to temperature variations. Hence, new algorithms and software tools able to provide accurate yet fast thermal analysis are upmost required. In this paper, we introduce a new software supported methodology for performing thermal analysis at run-time with different levels of granularity. Additional performance improvement is feasible by applying thermal analysis only to device regions with blocks that operate under high power densities. For demonstration purposes we show how this methodology is applied to an Altera Stratix-based FPGA device. Experimental results prove the efficiency of the proposed methodology, since the average execution time ranges between 41% and 78%, as compared to state of the art relevant solution, without any accuracy degradation at the derived thermal profile.

K. Siozios and D. Soudris, “A Methodology for Alleviating the Performance Degradation of TMR Solutions“, IEEE Embedded Systems Letters, Vol. 2, No. 4, pp. 111-114, Dec. 2010 (download local copy from here).

The shrinking of process technologies in conjunction to the manufacturing and transient faults may be abundant in high density reconfigurable architectures. Design of reliable applications on such unreliable architectures requires techniques able to provide a balance between the desired fault masking and the associated performance and power degradation. Starting from a well established solution for reliability improvement in field-programmable gate arrays (FPGAs) domain, we discuss a software-supported methodology that removes redundancy as much as possible from the design without affecting it’s efficiency in terms of fault masking. Based on experimental results, our proposed methodology achieves comparable fault masking with commercial solutions, but in reasonable lower mitigation cost.

K. Siozios, D. Soudris and D. Pnevmatikatos, “Towards Supporting Fault-Tolerance in FPGAs“, IEEE Annual Symposium on VLSI (ISVLSI), pp. 446-447, Kefalonia, Greece, July 2010 (download local copy from here).

This paper proposes a novel methodology for improving reliability of FPGAs without requiring special purpose hardware. In contrast to related approaches that are applied uniformly over the target architecture, the proposed one insert redundancy only the critical for failure resources. Such an approach leads to reasonable performance improvement.

K. Siozios, D. Soudris and D. Pnevmatikatos, “A Framework for Enabling Fault Tolerance in Reconfigurable Architectures“, International Workshop on Applied Reconfigurable Computing (ARC), pp. 257-268, Bangkok, Thailand, March 2010 (download local copy from here).

Fault tolerance is a pre-request not only for safety critical systems, but almost for the majority of applications. However, the additional hardware elements impose performance degradation. In this paper we propose a software-supported methodology for protecting reconfigurable architectures against Single Event Upsets (SEUs), even if the target device is not aware about this feature. This methodology initially predicts areas of the target architecture where faults are most possible to occur and then inserts selectively redundancy only there. Based on experimental results, we show that our proposed selectively fault-tolerance results to a better tradeoff between desired level of reliability and area, delay, power overhead.

CAD Algorithms

A. Kokkinis, D. Diamantopoulos and K. Siozios, “Dynamic Heap Management in High-Level Synthesis for Many-Accelerator Architectures“, in International Conference on Field-Programmable Logic and Applications (FPL), 29 Aug.- 2 Sept. 2022, United Kingdom (download local copy from here).

Dynamic Memory Management (DMM) in High-Level Synthesis has been introduced as a promising solution for optimizing the accelerators’ memory usage and reducing the occupied on-chip area. Schemes for dynamic memory allocation have been suggested for many-accelerator architectures where memory sharing and resource reusing has the potential to increase the number of synthesized accelerators, rising the throughput per Watt ratio. However, in those architectures, the simultaneous execution of many accelerators may reduce memory efficiency, increasing the Memory Allocation Failures (MAFs) as a consequence of the sub-optimal utilization of the shared memories. MAFs due to memory fragmentation can reach up to 38.5% of the overall memory allocation failures when accelerators with heterogeneous allocation sizes are executed in parallel in a shared memory space. In this manuscript we propose an HLS methodology for minimizing MAFs for many-accelerator DMM frameworks that are caused by on-chip inefficient memory utilization. Our proposed methodology is orthogonal to the static memory allocation techniques of the Xilinx Vitis suite and was evaluated using Xilinx Vitis/Vitis HLS 2020.1 on an Alveo U200 FPGA device as an extension of the Memluv DMM framework. In the experimental results we show that our proposed methodology may decrease up to 38.5% the MAFs due to fragmentation and up to 91% the overall allocation fails with a controllable increase on the utilized resources and a on the accelerators’ latency.

A. Kokkinis, D. Diamantopoulos and K. Siozios, “Dynamic Optimization of On-Chip Memories for HLS Targeting Many-Accelerator Platforms,” in IEEE Computer Architecture Letters, vol. 21, no. 2, pp. 41-44, 1 July-Dec. 2022 (download local copy from here).

Many-accelerator platforms have been introduced for maximizing FPGA’s throughput. However, as the high saturation rate of the FPGA’s on-chip memories limits the number of synthesized accelerators, frameworks for Dynamic Memory Management (DMM) that allow the synthesized designs to allocate/de-allocate on-chip memory resources during run-time have been suggested. Although, those frameworks manage to increase the accelerators’ density by minimizing the utilized memory resources, the parallel execution of many-accelerators may cause severe memory fragmentation and thus memory allocation failures. In this work, a framework that optimizes the memory usage by performing memory defragmentation operations in HLS many-accelerator architectures that share on-chip memories is proposed. Experimental results highlight the effectiveness of the proposed solution to eliminate memory allocation failures due to memory fragmentation, reduce memory allocation failures up to 32% on average and decrease the memory size requirements up to 5% with controllable latency and resource utilization overhead.

K. Siozios and D. Soudris, “A Customizable Framework for Application Implementation onto 3-D FPGAs“, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 35, No. 11, pp. 1783-1796, Nov. 2016 (download local copy from here).

Integrating more functionality in a smaller form factor with higher performance and lower-power consumption is pushing semiconductor technology scaling to its limits. 3-D chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronic products. Additionally, the complexity of digital designs imposes that computer-aided design algorithms are getting harder and slower. This paper introduces a framework for application implementation onto 3-D reconfigurable architectures. In contrast to existing approaches, the proposed solution is customizable according to constraints posed by the application and the target 3-D device in order to improve performance metrics. Experimental results highlight the effectiveness of our framework, as we achieve average enhancements in terms of maximum operation frequency and power consumption by 35% and 47%, respectively, as compared to state-of-the-art algorithms.

P. Danassis, K. Siozios and D. Soudris, “ANT3D: Simultaneous Partitioning and Placement for 3-D FPGAs based on Ant Colony Optimization“, IEEE Embedded Systems Letters, Vol. 8, No. 2, pp. 41–44, June 2016 (download local copy from here).

Three-dimensional technologies offer great promise in providing improvements in the overall circuit performance. This letter introduces a novel netlist partitioning and placement algorithm, named ANT3D, targeting 3-D reconfigurable architectures, based on ant colony optimization (ACO). Experimental results show the effectiveness of ANT3D algorithm as we achieve performance enhancement by 10% on average, compared to state-of-the-art tools, while using significantly fewer through-silicon vias (TSVs). Finally, by taking benefit from the inherent parallelism found in ACO algorithms, it is feasible to notable reduce the execution run-time of our algorithm.

P. Danassis, K. Siozios and D. Soudris, “Parallel Application Placement onto 3-D Reconfigurable Architectures“, International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1–4, May 2016, Thessaloniki, Greece (download local copy from here).

Placement is considered one of the most arduous and time-consuming processes in physical implementation flows for reconfigurable architectures, while it highly affects the quality of derived application implementation as it is tightly firmed to the total wirelength and hence the maximum operating frequency. This problem becomes more acute for three-dimensional (3-D) architectures since the complexity of such architectures imposes additional challenges that have to be sufficiently addressed. Throughout this paper we introduce a novel placement algorithm, targeting 3-D reconfigurable architectures, based on Ant Colony Optimization (ACO). Experimental results validate the effectiveness of our algorithm since it achieves 10% reduction in the critical path delay on average. Additionally, in contrast to relevant approaches which are executed sequentially, the proposed algorithm exhibits inherent parallelism and can take full advantage of today’s multi-core architectures.

K. Maragos, K. Siozios and D. Soudris, “An Evolutionary Algorithm for Netlist Partitioning Targeting 3-D FPGAs“, IEEE Embedded System Letters, Vol. 7, No. 4, pp. 117–120, Dec. 2015 (download local copy from here).

Three-dimensional (3-D) chip stacking is considered as the silver bullet technology to preserve Moore’s momentum and fuel the next wave of consumer electronics. However, the benefits of such an integration technology have not yet been explored due to limitations posed mostly by the lack of efficient tools to support application mapping onto these devices. This letter introduces a framework based on a genetic algorithm for netlist partitioning targeting 3-D reconfigurable platforms. Experimental results prove the efficiency of our solution, as we achieve average reduction of the number of utilized through-silicon vias (TSVs) up to 17% for comparable performance metrics against relevant state-of-the-art algorithms.

D. Diamantopoulos, S. Xydis, K. Siozios and D. Soudris, “Mitigating Memory-induced Dark Silicon in Many-Accelerator Architectures”, IEEE Computer Architecture Letters, Vol. 14, No. 2, pp. 136–139, July-Dec. 2015 (download local copy from here).

Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can boost performance and improve power of general-purpose computing platforms. In this paper, we focus on the problem of resource under-utilization, i.e. Dark Silicon, in FPGA-based MA platforms. We show that except the typically expected peak power budget, on-chip memory resources form a severe under-utilization factor in MA platforms, leading up to 75 percent of dark silicon. Recognizing that static memory allocation-the de-facto mechanism supported by modern design techniques and synthesis tools-forms the main source of memory-induced Dark Silicon, we introduce a novel framework that extends conventional high level synthesis (HLS) with dynamic memory management (DMM) features, enabling accelerators to dynamically adapt their allocated memory to the runtime memory requirements, thus maximizing the overall accelerator count through effective sharing of FPGA’s memories resources. We show that our technique delivers significant gains in FPGA’s accelerators density, i.e. 3.8×, and application throughput up to 3.1× and 21.4× for shared and private memory accelerators.

D. Diamantopoulos, S. Xydis, K. Siozios and D. Soudris, “High-Level-Synthesis extensions for scalable Single-Chip Many-Accelerators on FPGAs“, International Conference on Field-Programmable Logic and Applications (FPL), pp.1-2, Sept. 2015, London, England (download local copy from here).

Accelerator-coupled systems have been introduced as a promising architectural paradigm that can boost performance and improve power of general-purpose computing platforms. This research focuses on the accelerators’ scalability problem due to resource under-utilization in FPGA-based accelerator-coupled platforms. By recognizing that static memory allocation the de-facto memory management mechanism supported by modern design techniques and synthesis tools forms the main source of memory-induced under-utilization, i.e. leading up to 75% of dark silicon, we propose the development of a) a Single-Chip Many-Accelerator (SCMA) architecture that reduces energy budget by providing high-throughput processing nodes hooked under the same low-latency FPGA die and b) a novel design framework that extends conventional RTL and High Level Synthesis (HLS) design flows with dynamic memory management (DMM) features to leverage scalability by enabling accelerators to dynamically adapt their allocated memory to the runtime memory requirements, thus maximizing the overall accelerator count through effective sharing of FPGA’s memories resources. By applying these novel techniques in the state-of-art Vivado-HLS tool, we managed to increase accelerator density up to 3.8× for a Xilinx Ultrascale device and deliver architecture solutions that trade-off per-accelerator latency overhead (1.2×- 19.9×) with overall system’s throughput (2.6×- 23.1×) and performance-per-watt (0.09×- 21.7×).

D. Diamantopolous, S. Xydis, K. Siozios and D. Soudris, “Dynamic Memory Management in Vivado-HLS for Scalable Many-Accelerator Architectures“, International Workshop on Applied Reconfigurable Computing (ARC), pp. 117-128, April 2015, Bochum, Germany (download local copy from here).

This paper discusses the incorporation of dynamic memory management during High-Level-Synthesis (HLS) for effective resource utilization in many-accelerator architectures targeting to FPGA devices. We show that in today’s FPGA devices, the main limiting factor of scaling the number of accelerators is the starvation of the available on-chip memory. For many-accelerator architectures, this leads in severe inefficiencies, i.e. memory-induced resource under-utilization of the rest of the FPGA’s resources. Recognizing that static memory allocation – the de-facto mechanism supported by modern design techniques and synthesis tools – forms the main source of “resource under-utilization” problems, we introduce the DMM-HLS framework that extends conventional HLS with dynamic memory allocation/deallocation mechanisms to be incorporated during many-accelerator synthesis. We integrated the proposed framework with the industrial strength Vivado-HLS tool, and we evaluate its effectiveness with a set of key accelerators from emerging application domains. DMM-HLS delivers significant increase in FPGA’s accelerators density (3.8x more accelerators) in exchange for affordable overheads in terms of delay and resource count.

K. Siozios, H. Sidiropoulos and D. Soudris, “Architectures and CAD Tools for 3D FPGAs“, Book Chapter in “Reconfigurable Logic: Architecture, Tools and Applications“, pp. 489–511, CRC press, 1st Edition, 2016 (download local copy from here).

Field-Programmable Gate Arrays (FPGAs) have become the implementation medium for the majority of digital circuits. The key to FPGAs’ popularity is their feature to support application implementation by appropriately (re-)configuring the functionality of hardware resources. This allows FPGAs to provide higher flexibility, rapid product prototyping and significantly reduced non-recurring engineering (NRE) costs, as compared to ASIC (ApplicationSpecific Integrated Circuit) devices. Additionally, this situation makes the FPGA paradigm to grow in importance, as there is a stronger demand for faster, smaller, cheaper, and lower-energy devices. For decades, semiconductor manufacturers have been shrinking transistor size in Integrated Circuits (ICs) to achieve the yearly increases in performance described by Moores Law, which exists only because the RC delay was negligible, as compared to the signal propagation delay. For sub-micron technology, however, the RC delay becomes a dominant factor. Furthermore, previous studies showed that at 130nm technology node approximately 51% of the microprocessors power is consumed by interconnect fabric. This has generated many discussions concerning the end of device scaling as we know it, and has hastened the search for solutions beyond the perceived limits of current 2-D devices.

D. Diamantopoulos, K. Siozios, S. Xydis and D. Soudris, “GENESIS: Parallel Application Placement onto Reconfigurable Architectures“, ACM Transactions on Embedded Computing Systems, Vol. 14, No. 1, Article 18, 26 pages, January 2015 (download local copy from here).

Placement is though as the most time-consuming processes in physical implementation flows for reconfigurable architectures, while it highly affects the quality of derived application implementation, as it has impact on the maximum operating frequency. Throughout this article, we propose a novel placer, based on genetic algorithm, targeting to FPGAs. Rather than relevant approaches, which are executed sequentially, the new placer exhibits inherent parallelism, which can benefit from multicore processors. Experimental results prove the effectiveness of this solution, as it achieves average reduction of execution runtime and application’s delay by 67× and 16%, respectively.

H. Sidiropoulos, K. Siozios, P. Figuli, D. Soudris, M. Hubner and J. Becker, “JITPR: A Framework for Supporting Fast Application’s Implementation onto FPGAs“, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 6, No. 2, Article 7, pp. 7:1-7-12, August 2013 (download local copy from here).

The execution runtime usually is a headache for designers performing application mapping onto reconfigurable architectures. In this article we propose a methodology, as well as the supporting toolset, targeting to provide fast application implementation onto reconfigurable architectures with the usage of a Just-In-Time (JIT) compilation framework. Experimental results prove the efficiency of the introduced framework, as we reduce the execution runtime compared to the state-of-the-art approach on average by 53.5×. Additionally, the derived solutions achieve higher operation frequencies by 1.17×, while they also exhibit significant lower fragmentation ratios of hardware resources.

H. Sidiropoulos, K. Siozios, P. Figuli, D. Soudris and M. Hubner, “On Supporting Efficient Partial Reconfiguration with Just-In-Time Compilation“, International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 328-335, 2012, Shanghai (download local copy from here).

Partial reconfiguration is possible to deliver virtually unlimited hardware resources since it enables dynamic allocation and de-allocation of tasks onto a reconfigurable architecture, while the rest tasks continue to operate. However, in order to benefit from this flexibility, partial reconfiguration has to be appropriately applied. Among others, the placement of partial configuration data is a critical issue since it affects the fragmentation of hardware resources. In this paper we introduce a novel methodology for supporting partial reconfiguration with the usage of a Just-in-Time (JIT) Compilation framework. Experimental results with a number of benchmarks showed that the introduced solution performs application P&R 7.34× faster, as compared to the state-of-the-art tools, while it also leads to significant lower fragmentation of hardware resources.

K. Siozios and D. Soudris, “A Tabu-based Partitioning And Layer Assignment Algorithm for 3-D FPGAs“, IEEE Embedded Systems Letters, Vol. 3, No. 3, pp. 97-100, Sept. 2011 (download local copy from here).

Integrating more functionality in a smaller form factor with higher performance and lower power consumption is pushing semiconductor technology scaling to its limits. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronics products. This letter introduces a TSV-aware partitioning algorithm that enables higher performance for application implementation onto 3-D field-programmable gate arrays (FPGAs). Unlike other algorithms that minimize the number of connections among layers, our solution leads to a more efficient utilization of the available (fabricated) interlayer connectivity. Experimental results show average reductions in delay and power consumption, as compared to similar 3-D computer-aided design (CAD) tools, about 28% and 26%, respectively.

K. Siozios and D. Soudris, “A Temperature-Aware Placement and Routing Algorithm Targeting 3D FPGAs“, International Conference on Very Large Scale Integration (VLSI-SoC), pp. 221-231, Oct. 2008, Rhodes, Greece (download local copy from here).

In current reconfigurable architectures, the interconnect structures increasingly contribute to the delay and power consumption budget. The demand for increased clock frequencies and logic availability (smaller area foot print) makes the problem even more important, leading among others to rapid elevation in power density. Three-dimensional (3D) architectures are able to alleviate this problem by accommodating a number of functional layers, each of which might be fabricated in different technology. Since power consumption is a critical challenge for implementing applications onto reconfigurable hardware, a novel temperature-aware placement and routing (P&R) algorithm targeting 3D FPGAs, is introduced. The proposed algorithm achieves to redistribute the switched capacitance over identical hardware resources in a rather “balanced” profile, reducing among others the number of hotspot regions, the maximal values of power sources at hotspots, as well as the percentage of device area that consumes high power. For evaluation purposes, the proposed approach is realized as a new CAD tool, named 3DPRO (3D-Placement-and-Routing-Optimization), which is part of the complete framework, named 3D MEANDER. Comparing to alternative solutions, the proposed one reduces the percentage of silicon area that operates under high power by 63%, while it leads to energy savings (about 9%), with an almost negligible penalty in application’s delay ranging from 1% up to 5%.

K. Siozios, D. Soudris and A. Thanailakis, “A Novel Allocation Methodology for Partial and Dynamic Bitstream Generation of FPGA Architectures“, Journal of Circuits, Systems, and Computers (JCSC), Vol. 19, No. 3, pp. 701-717, May 2010 (download local copy from here).

Partial re-configuration is the process of configuring a portion of a FPGA while the rest of the device is still running/operating. This paper proposes a novel allocation methodology for realizing applications with partial and dynamic features on FPGAs. The methodology was implemented as a manager that incorporates two stages: the first one modifies the configuration data of each partial bitstream by replacing the associated application’s functionalities (or slices), its goal being to compact the slice distribution, while keeping the same functionality. The second one determines the appropriate spatial location over the FPGA device where the previously optimized configuration data should be placed. The proposed manager is device independent, since it derives partial configuration data that can program dynamically any island-style or hierarchical FPGA. For demonstration purposes, the proposed manager was implemented as part of an existing bitstream generator tool, named DAGGER (part from the MEANDER framework) targeting to Virtex-like architectures.

K. Siozios and D. Soudris , “A Power-Aware Placement and Routing Algorithm Targeting 3D FPGAs“, Journal of Low-Power Electronics (JOLPE), Vol. 4, No. 3, pp. 275-289, Dec. 2008 (download local copy from here).

In current reconfigurable architectures, the interconnect structures increasingly contribute to the delay and power consumption budget. The demand for increased clock frequencies and logic availability (smaller area foot print) makes the problem even more important, leading among others to rapid elevation in power density. Three-dimensional (3D) architectures are able to alleviate this problem by accommodating a number of functional layers, each of which might be fabricated in different technology. Since power consumption is a critical challenge for implementing applications onto reconfigurable hardware, a novel power-aware placement and routing (P&R) algorithm targeting to 3D FPGAs, is introduced. The proposed algorithm achieves to redistribute the switched capacitance over identical hardware resources in a rather “balanced” profile, reducing among others the number of hotspot regions, the maximal values of power sources at hotspots, as well as the percentage of device area that consumes high power. For evaluation purposes, the proposed approach is realized as a new CAD tool, named 3DPRO (3D-Placement-and-Routing-Optimization), which is part of the complete framework, named 3D MEANDER. Comparing to alternative solutions, the proposed one reduces the percentage of silicon area that operates under high power by 63%, while it leads to energy savings (about 9%), with an almost negligible penalty in application’s delay ranging from 1% up to 5%.

K. Siozios and D. Soudris, “An Efficient Approach for Managing Power Consumption Hotspots Distribution on 3D FPGAs“, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 439–448, Sep. 2008, Lisbon, Portugal (download local copy from here).

Using new silicon technologies, increasing logic densities and clock frequencies on FPGAs lead to rapid elevation in power density. Since the power consumption is a critical challenge for application implementation, a novel power-aware partitioning, placement and routing (P&R) algorithm targeting to 3D FPGAs, is introduced. The proposed methodology achieves to redistribute the switched capacitance over the hardware resources in a rather ”balanced” profile, reducing among others the maximal on-chip temperatures. Due to the relation between switched capacitance and power consumption, the proposed P&R algorithm can be considered as a power management approach. This algorithm is realized as part of 3DPRO tool. Comparing to alternative P&R solutions, we eliminate the area on hotspots about 68%, while we achieve savings in delay and energy consumption about 9% and 11% in average, respectively.

K. Siozios and D. Soudris, “A Novel Methodology for Temperature-Aware Placement and Routing of FPGAs“, IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 55-60, May 2007, Porto Alegre, Brazil (download local copy from here).

Shrinking silicon technologies, increasing logic densities and clock frequencies on FPGA lead to rapid elevation in power density, which are translated to higher on-chip temperature. Recently, the FPGA industry (e.g. Xilinx, Altera) recognized the dominance of the heat problem as one of its key design issues, which should be tackled immediately. In this paper, considering a novel temperature-aware placement and routing algorithm, a systematic methodology to achieve a more “balanced” temperature distribution in the whole FPGA device, is introduced. Since the temperature is straightforward-related with the FPGA hardware resources switching activity, the main goal of the proposed methodology is to manipulate appropriately the switching activity appeared on different regions of the FPGA. Using the temperature-aware algorithm, we redistribute the switching activity over the FPGA resources, resulting into a rather “balanced” profile. Comparing with a conventionally-placed and routed FPGA (e.g. VPR), we proved that up to 33% temperature reduction in hotspots can be achieved with negligible side effects in circuit delay, energy/power consumption and silicon area. The proposed methodology is fully-supported by the software tool called EX- VPR

D. Soudris, K. Tatas, K. Siozios, G. Koutroumpezis, S. Nikolaidis, S. Siskos, N. Vasiliadis, V. Kalenteridis, H. Pournara and I. Pappas, “AMDREL: A Novel Low-Energy FPGA Architecture and Supporting CAD Tool Design Flow“, Chapter 3 in “Fine – and Coarse-Grain Reconfigurable Systems“, Editors: S. Vassiliadis and D. Soudris, pp. 153-180, Springer, 2007 (download local copy from here).

This chapter describes a complete system for the implementation of digital logic in a fine-grain reconfigurable platform (FPGA). The energy-efficient FPGA architecture is designed and simulated in STM 0.18μm CMOS technology. The detailed design and circuit characteristics of the Configurable Logic Block and the interconnection network are determined and evaluated in terms of energy, delay and area. A number of circuit-level low-power techniques are employed because power consumption is the primary concern. Additionally, a complete tool framework for the implementation of digital logic circuits in FPGA platforms is introduced. The framework is composed of i) non-modified academic tools, ii) modified academic tools and iii) new tools. The developed tool framework supports a variety of FPGA architectures. Qualitative and quantitative comparisons with existing academic and commercial architectures and tools are provided, yielding promising results.

K. Siozios, K. Tatas, D. Soudris and A. Thanailakis, “A Novel Methodology for Designing High-Performance and Low-Energy FPGA Interconnection Architecture“, ACM International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 224, 2006, Monterey, USA (download local copy from here).

FPGA architecture characteristics and capabilities changed and improved significantly the last years, from a homogeneous and regular architecture to a heterogeneous (or piece-wise homogeneous) and piece-wise regular architecture. The platform-based design allows to designer to build a customized FPGA architecture, using specific blocks, depending on the application domain requirements. Due to the fact that about 70-90% of typical an FPGA is occupied by routing resources, many researchers have spent much effort on minimizing energy consumption and on achieving higher frequencies. Their clear message was the fact that the interconnection structure dominates the total power dissipation and performance. A typical interconnection network of FPGA consists of the Switch Boxes (SBs) and the wire segments. In this paper, we propose a novel methodology for designing a high-performance and low-energy routing architecture of an island style-based FPGA platform. The basic idea behind the new methodology is to choose the corresponding performance and energy efficient combination of multiple SBs and the appropriate wire segment, taking into account the considered application-domain characteristics. More specifically, the proposed methodology selects the appropriate combination of SBs, depending on the localized performance and energy consumption requirements of each specific region of FPGA architecture. The efficiency of an SB and wire segment is characterized by analyzing parameters such as performance, energy dissipation and the number of required tracks for successful application routing. Using MCNC benchmarks, extensive comparison study proves the effectiveness of the proposed approach achieving performance increase of 52% and reduction of energy consumption 12%, in average.

K. Siozios, D. Soudris and A. Thanailakis, “Efficient Power Management Strategy of FPGAs Using a Novel Placement Technique“, IFIP International Conference on Very Large Scale Integration (VLSI-SoC), pp. 204-209, Oct. 2006, Nice, France (download local copy from here).

Power consumption is one of the major headaches, which should be tackled by the designers. Also, the parameters that affect significantly the performance and power are the configurable logic blocks (CLBs) and the interconnection components. A novel approach for efficient implementation of applications onto reconfigurable architectures is introduced. The main goal of this technique is to spread out the power consumption across the whole device, as well as to minimize it, achieving a more uniform power consumption map across the whole FPGA. This approach is based on finding the optimal CLB placement according to resource utilization map. The proposed methodology can be applied for mapping applications with an efficient power management strategy. Furthermore, the proposed placement algorithm reduces the total power consumption, the leakage power, the total energy and silicon area. The proposed methodology is fully-supported by the software tool called EX-VPR. The result of applying this placement strategy is the power consumption reduction about 5%, while we distribute the power consumption with a rather “uniformly” fashion across the whole device minimizing the power spikes.

K. Siozios, G. Koutroumpezis, K. Tatas, N. Vasiliadis, V. Kalenteridis, H. Pournara, I. Pappas, D. Soudris, A. Thanailakis, S. Nikolaidis and S. Siskos, “A Novel FPGA Architecture and an Integrated Framework of CAD Tools for Implementing Applications“, IEICE Transactions on Information and Systems, Vol. E88-D, No. 7, pp. 1369-1380, July 2005 (download local copy from here).

A complete system for the implementation of digital logic in a Field-Programmable Gate Array (FPGA) platform is introduced. The novel power-efficient FPGA architecture was designed and simulated inSTM 0.18μm CMOS technology. The detailed design and circuit characteristics of the Configurable Logic Block, the interconnection network, the switch box and the connection box were determined and evaluated in terms of energy, delay and area. A number of circuit-level low-power techniques were employed because power consumption was the primary con-cern. Additionally, a complete tool framework for the implementation of digital logic circuits in FPGA platforms is introduced. Having as inputVHDL description of an application, the framework derives the reconfiguration bitstream of FPGA. The framework consists of: i) non-modified academic tools, ii) modified academic tools and iii) new tools. Further-more, the framework can support a variety of FPGA architectures. Qualitative and quantitative comparisons with existing academic and commercial architectures and tools are provided, yielding promising results.

K. Siozios, G. Koutroumpezis, K. Tatas, D. Soudris, and A. Thanailakis, “DAGGER: A Novel Generic Methodology for FPGA Bitstream Generation and its Software Tool Implementation“, International Parallel and Distributed Processing Symposium, pp. 165b, 2005, Colorado, USA (download local copy from here).

A novel bitstream generation algorithm and its software implementation are introduced. Although this tool was developed for the configuration of AMDREL FPGA reconfigurable platform, it could be used to program any other compatible device. This tool is the only one known academic implementation for FPGA configuration with such features. Among them are the run-time-, partial- and dynamic-reconfiguration, the memory management, the bitstream compression and encryption, the read-back technique, the bitstream reallocation, the used low-power techniques as well as the graphical user interface.

K. Siozios, G. Koutroumpezis, K. Tatas, D. Soudris, and A. Thanailakis, “A Novel FPGA Configuration Bitstream Generation Algorithm and Tool Development“, International Conference on Field Programmable Logic and Applications (FPL), pp. 1116-1118, Aug. 2004, Belgium (download local copy from here).

A novel configuration bitstream generation tool for a custom FPGA platform is presented. It can support a variety of devices of similar architecture. The tool exhibits technology independence and is easily modifiable. The tool also allows partial reconfiguration as long as the target platform also does.

Platform Virtualization

A. Nanos, A. Kretsis, C. Mainas, G. Ntoutsos, A. Ferikoglou, D. Danopoulos, A. Kokkinis, D. Masouros, K. Siozios, P. Soumplis, P. Kokkinos, J.J. Vegas-Olmos, E. Varvarigos, “Hardware-Accelerated FaaS for the Edge-Cloud Continuum”, International Conference on Network Protocols (ICNP), pp.1–6, 2023, Reykjavik, Iceland.

We present an end-to-end solution to facilitate the seamless execution of hardware-accelerated compute-intensive tasks on heterogeneous hardware platforms spanning the Cloud-Edge continuum. Our approach includes a programming interface, orchestration, application management components, the vAccel framework, and a library of hardware-accelerated kernels. These components enable a Function-as-a-Service (FaaS) based operational flow that supports numerous diverse use cases while minimizing the time required for the developer to integrate their code and for the vendor to provide hardware acceleration capabilities to end users. Experimental results showcase the merits of our approach.

A. Kokkinis, A. Nanos and K. Siozios, “Enabling an Isolated and Energy-Aware Deployment of Computationally Intensive Kernels on Multi-Tenant Environments”, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, Samos, Greece, 2023 (download local copy from here).

Nowadays, hardware acceleration can be used as a service for maximizing the applications’ performance and achieve significant speedup in time-critical scenarios. FPGA devices inherently consume less power than GPUs and HPC systems and are candidate solutions for performing low-energy yet high-performance computations. However, hardware acceleration services require a private, isolated and flexible execution of the accelerators in multi-tenant environments without compromising the platform’s energy and performance efficiency. In this paper we aim to address this issue by proving an end-to-end methodology for the generation, virtualization and deployment of High-Level Synthesis accelerators in multi-tenant environments. We leverage approximate computing techniques and utilize the vAccel framework. Our proposed methodology was evaluated on the Xilinx Alveo U50 acceleration card, achieving energy savings up to 5.2x compared to the initial non energy optimized and non virtualized designs.

E. Sotiriou-Xanthopoulos, L. Masing, S. Xydis, K. Siozios, J. Becker and D. Soudris, “OpenCL-based Virtual Prototyping and Simulation of Many-Accelerator Architectures“, ACM Transactions on Embedded Computing Systems (TECS), Vol. 17, No. 5, pp. Article 86, pp.1-27, Nov. 2018 (download local copy from here).

Heterogeneous architectures featuring multiple hardware accelerators have been proposed as a promising solution for meeting the ever-increasing performance and power requirements of embedded systems. However, the existence of numerous design parameters may result in different architectural schemes and thus in extra design effort. To address this issue, OpenCL-based frameworks have been recently utilized for FPGA programming, to enable the portability of a source code to multiple architectures. However, such OpenCL frameworks focus on RTL design, thus not enabling rapid prototyping and abstracted modeling of complex systems. Virtual Prototyping aims to overcome this problem by enabling the system modeling in higher abstraction levels. This article combines the benefits of OpenCL and Virtual Prototyping, by proposing an OpenCL-based prototyping framework for data-parallel many-accelerator systems, which (a) creates a SystemC Virtual Platform from OpenCL, (b) provides a co-simulation environment for the host and the Virtual Platform, (c) offers memory and interconnection models for parallel data processing, and (d) enables the system evaluation with alternative real number representations (e.g., fixed-point or 16-bit floating-point).

I. Koutras, K. Maragos, D. Diamantopoulos, K. Siozios, D. Soudris, “On supporting rapid prototyping of embedded systems with reconfigurable architectures“, Integration, the VLSI Journal, Vol. 58, June 2017, pp. 91-100 (download local copy from here).

Reducing time-to-market while improving product quality is a big challenge. This paper proposes a software-supported framework for rapid prototyping that offers a concurrent fast hardware/software system-level design. The introduced framework enables the constant evaluation and verification of the prototype under development, while it provides automatic functionality mapping to hardware via High-Level Synthesis techniques. We evaluate our framework and its software instantiation with a computer vision algorithm. Based on our experimentation, we show that our approach reduces the development time by almost 64×, it prunes the hardware design space by 34×, while maintaining designs that trade-off high Quality-of-Report on the Pareto frontier.

P. Figuli, W. Ding, S. Figuli, K. Siozios, D. Soudris, and J. Becker, “Parameter Sensitivity in Virtual FPGA Architectures“, International Symposium on Applied Reconfigurable Computing (ARC), pp. 141-153, Netherlands, April 3-7, 2017 (download local copy from here).

Virtual FPGAs add the benefits of increased flexibility and application portability on bitstream level across any underlying commercial off-the-shelf FPGAs at the expense of additional area and delay overhead. Hence it becomes a priority to tune the architecture parameters of the virtual layer. Thereby, the adoption of parameter recommendations intended for physical FPGAs can be misleading, as they are based on transistor level models. This paper presents an extensive study of architectural parameters and their effects on area and performance by introducing an extended parameterizable virtual FPGA architecture and deriving suitable area and delay models. Furthermore, a design space exploration methodology based on these models is carried out. An analysis of over 1400 benchmark-runs with various combinations of cluster and LUT size reveals high parameter sensitivity with variances up to ±95.9% in area and ±78.1% in performance and a discrepancy to the studies on physical FPGAs.

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos and D. Soudris, “An Integrated Exploration and Virtual Platform Framework for Many-Accelerator Heterogeneous Systems“, ACM Transactions on Embedded Computing Systems (TECS), Vol. 15, No. 3, Article 43, 26 pages, July 2016 (download local copy from here).

The recent advent of many-accelerator systems-on-chip (SoC), driven by the need for maximizing throughput and power efficiency, has led to an exponential increase in the hardware/software co-design complexity. The reason of this increase is that the designer has to explore a vast number of architectural parameter combinations for each single accelerator, as well as inter-accelerator configuration combinations under specific area, throughput, and power constraints, given that each accelerator has different computational requirements. In such a case, the design space size explodes. Thus, existing design space exploration (DSE) techniques give poor-quality solutions, as the design space cannot be adequately covered in a fair time. This problem is aggravated by the very long simulation time of the many-accelerator virtual platforms (VPs). This article addresses these design issues by (a) presenting a virtual prototyping solution that decreases the exploration time by enabling the evaluation of multiple configurations per VP simulation and (b) proposing a DSE methodology that efficiently explores the design space of many-accelerator systems. With the use of two fully developed use cases, namely an H.264 decoding server for multiple video streams and a parallelized denoising system for MRI scans, we show that the proposed DSE methodology either leads to Pareto points that dominate over those of a typical DSE scenario or finds new solutions that might not be found by the typical DSE. In addition, the proposed virtual prototyping solution leads to DSE runtime reduction reaching 10 × for H.264 and 5 × for Rician denoise.

E. Sotiriou-Xanthopoulos, L. Masing, K. Siozios, G. Economakos, D. Soudris, J. Becker, “An OpenCL-based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures“, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), pp. 372-377, July 2016, Samos, Greece (download local copy from here).

The increasing performance and power requirements in embedded systems has lead to a variety of heterogeneous hardware architectures, featuring many different types of processing elements. This heterogeneity however induces extra effort on system development and programming. To address this heterogeneity, OpenCL provides a portable programming model which enables the use of one source code in various architectures featuring different types of processors. Also, such systems impose higher design complexity due to the existence of an increased number of hardware components. Virtual Prototyping aims to alleviate this issue by enabling the hardware modeling in higher abstraction levels. This paper combines the benefits of OpenCL with Virtual Prototyping, by proposing an OpenCL-based framework for rapid prototyping, which (a) automatically derives a virtual prototype from an OpenCL code; (b) executes the OpenCL application by running the host program along with the hardware simulation; and (c) proposes a design flow for faster system evaluation, as compared to state-of-the-art FPGA-based flow. Using a set of OpenCL benchmarks, it is shown that the proposed framework enables faster prototyping by up to 18x, as compared to state-of-the-art flow.

K. Siozios, I. Savvidis and D. Soudris, “A Framework for Exploring Alternative Fault-Tolerant Schemes Targeting 3-D Reconfigurable Architectures“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), July 2016, pp. 336-341, Samos, Greece (download local copy from here).

For decades computer architects pursued one primary goal: performance. Transistor scaling has translated into remarkable gains in operating frequency and reduction in power consumption. However, increased complexity from the device to architecture levels impose several new challenges, including a decrease in dependability/reliability due to physical failures. Reconfigurable platforms are highly susceptible to scaling related complexity, typically resulting in higher power consumption as compared to application-specific integrated circuits. The concern becomes far more important in the 3-D integrated circuit (IC) domain as vertically stacked blocks exhibit increased thermal resistance to the heat sink. The degradation in dependability becomes an important design challenge, not only for safety critical systems, but for the majority of architectures. In this paper, a framework used to explore alternative fault-tolerant schemes is proposed that masks the degradation in reliability for 3-D FPGA platforms. Simulation results at the RTL level highlight the benefits of the introduced solution, as the maximum operating frequency and power consumption are improved by 33% and 26%, respectively, as compared to similar state-of-the-art solutions.

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos and D. Soudris, “Rapid prototyping and Design Space Exploration methodologies for many-accelerator systems“, International Conference in Field-Programmable Logic and Applications (FPL), pp. 1–2, Sept. 2015, London, England (download local copy from here).

The ever-growing design complexity of modern embedded systems and the need for lower energy consumption have lead to design techniques which target to bridge the gap between the designer’s productivity and the design complexity. In particular, Virtual Prototyping enables the system modeling and simulation in multiple abstraction levels, while the automated Design Space Exploration (DSE) targets to find optimized design solutions in a reasonable time. However, there is the need for more efficient techniques for prototyping and co-simulation, as the rapid simulation has become a stringent requirement. In addition, as emerging heterogeneous architectures expose even higher design complexity, typical DSE techniques may not achieve high-quality design solutions. Towards this direction, the proposed design flow introduces (a) a set of prototyping techniques which target to faster but accurate simulation, also supporting the system co-simulation with other environments, and (b) a number of DSE methodologies for high-complexity computation and communication architectures.

E. Sotiriou-Xanthopoulos, G. Shalina, P. Figuli, K. Siozios, G. Economakos, J. Becker, “A Power Estimation Technique for Cycle-Accurate Higher-Abstraction SystemC-based CPU Models“, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), pp. 70–77, July 2015, Samos, Greece (download local copy from here).

Due to the ever-increasing complexity of embedded system design and the need for rapid system evaluations in early design stages, the use of simulation models known as Virtual Platforms (VPs) has been of utmost importance as they enable system modeling at higher abstraction levels. Since a typical VP features multiple interdependent components, VP libraries have been utilized in order to provide off-the-shelf models of commonly-used hardware components, such as CPUs. However, CPU power estimation is not adequately supported by existing VP libraries. In addition, existing power characterization techniques require architectural details which are not always available in early design stages. To address this issue, this paper proposes a technique for power annotation of CPU models targeting SystemC/TLM libraries in order to enable the accurate power estimation at higher abstraction levels. By using a set of benchmarks on a power-annotated SystemC/TLM model of Xilinx Microblaze soft-processor, it is shown that the proposed approach can achieve accurate power estimation in comparison to the real-system power measurements as the estimation error ranges from 0.47% up to 6.11% with an average of 2%.

D. Diamantopoulos, E. Sotiriou-Xanthopoulos, K. Siozios, G. Economakos and D. Soudris, “Plug&Chip: A Framework for Supporting Rapid Prototyping of 3D Hybrid Virtual SoCs“, ACM Transactions on Embedded Computing Systems, Vol. 13, No. 5s, Article 168, pp. 1-25, Nov. 2014 (download local copy from here).

In the embedded system domain there is a continuous demand towards providing higher flexibility for application development. This trend strives for virtual prototyping solutions capable of performing fast system simulation. Among other benefits, such a solution supports concurrent hardware/software system design by enabling to start developing, testing, and validating the embedded software substantially earlier than has been possible in the past. Towards this direction, throughout this article we introduce a new framework, named Plug&Chip, targeting to support rapid prototyping of 2D and 3D digital systems. In contrast to other relevant approaches, our solution provides higher flexibility by enabling incremental system design, while also handling platforms developed with the usage of 3D integration technology.

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios and G. Economakos, “Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 1-8, July 2014, Samos, Greece (download local copy from here).

Modern multiprocessor heterogeneous systems incorporating multiple hardware accelerators on chip have resulted in an excessive increase in the complexity of hardware/software co-design. Designers have now to explore a design space including both per-accelerator architectural parameters as well as inter-accelerator combinations, i.e. different design configurations among the allocated accelerators, as each accelerator instance has different computational requirements, according to different input data, while throughput and area constraints should be met as well. Under such a system scenario, virtual platform prototyping suffers from increased design time phases, since it requires an exponentially larger number of evaluations to succeed adequate coverage of the design space. In this paper, we propose a co-design framework on top of virtual prototyping solution, customized for many-accelerator heterogeneous systems. The proposed framework defines separate configurations for each accelerator component of the virtual platform, instead of using only one common configuration, thus succeeding to meet both the area and the throughput constraints. In addition, as the design space size increases exponentially, the proposed framework utilizes process-based reconfigurable SystemC modules to intelligently bypass the non-productive simulation stages, thus delivering faster hardware/software co-design cycles. A case study emulating an heterogeneous server system for simultaneous video decoding of multiple streams shows the efficiency of the proposed approach, delivering design solutions with up to 1.58× improved area or 1.59× improved throughput, while achieving simulation time gains of 40%.

K. Siozios, D. Soudris and M. Hubner, “A Framework for Customizing Virtual 3-D Reconfigurable Platforms at Run-Time“, International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 183-188, May 2014, Phoenix, USA (download local copy from here).

Existing application domains exhibit variations in terms of complexity, performance and power consumption, whereas their efficient implementation onto general-purpose reconfigurable platforms is not always a viable solution. Towards this goal, throughout this paper, we introduce a software-supported framework for supporting efficient customization of these platforms. Rather than similar approaches, where the phase (design-time), our solution provides post-fabrication customization of architectural parameters based on application’s inherent requirements through a virtualization layer. For evaluation purposes, the introduced framework was applied to 3-D reconfigurable architectures. Experimental results with applications from various domains prove the effectiveness of our solution, as we achieve average delay and power reduction by 1.43X and 1.15X , respectively, as compared to the existing way for application implementation.

H. Sidiropoulos, K. Siozios and D. Soudris, “A Framework for Mapping Dynamic Virtual Kernels onto Heterogeneous Reconfigurable Platforms“, International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 170-175, May 2014, Phoenix, USA (download local copy from here).

Field Programmable Gate Arrays (FPGAs) promise a low power flexible alternative for today’s market heterogeneous systems. In order to be widely accepted, novel solutions and approaches are required for fast and flexible application implementation. In this paper we propose a methodology, as well as the supporting toolflow targeting to provide fast implementation of multiple applications onto heterogeneous FPGAs. For this purpose we introduce the concept of dynamic virtual kernels. Experimental results prove the efficiency of the introduced solution, as we achieve application’s mapping 30× faster on average compared to a state-of-art approach, with negligible performance degradation. Additionally, we enable the dynamic mapping of multiple applications onto a single FPGA with only a small penalty of 4.7% in the maximum operation frequency of those applications compared with our reference solution.

E. Xanthopoulos-Sotiriou, S. Xydis, K. Siozios, G. Economakos and D. Soudris, “Effective Platform-Level Exploration for Heterogeneous Multicores Exploiting Simulation-Induced Slacks“, Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM), pp. 13-16, Jan. 2014, Vienna, Austria (download local copy from here).

Heterogeneous Multi-Processor Systems-on-Chip (MPSoC) exhibit increased design complexity due to numerous architectural parameters and hardware/software partitioning schemes. Automated Design Space Exploration (DSE) becomes an essential design procedure to discover optimized solutions in a reasonable time. For high-quality DSE, the accurate solution evaluation is a strong requirement. To this direction, High-Level Synthesis (HLS) can be used for the characterization of the design solutions. In this paper, we propose (a) a platform design methodology that exploits simulation-induced slacks generated by avoiding simulation re-initializations and exploits the gained time for HLS, and (b) a DSE tool-flow which takes into account multiple HW/SW partitioning schemes and intelligently schedules system evaluations. Experimental results show that the proposed methodology achieves 17% simulation improvements together with 77% higher accuracy, in comparison to a typical exploration approach.

H. Sidiropoulos, P. Figuli, K. Siozios, D. Soudris and J. Becker, “A Platform-Independent Runtime Methodology for Mapping Multiple Applications onto FPGAs Through Resource Virtualization“, International Conference on Field-Programmable Logic and Applications (FPL), pp. 1-4, Sept. 2013, Porto, Portugal (download local copy from here).

Field programmable Gate Arrays (FPGAs) promise a low power flexible alternative for implementing parallel applications. Compared to CPUs and GPUs, they suffer from slow development cycles due to the high complexity of application development and hardware incompatibilities. Towards this direction, we propose a platform-independent methodology and the supporting framework targeting efficient run-time application mapping onto FPGAs. Experimental results show that the introduced solution performs application placement and routing of multiple applications without any performance penalty as compared to state of art tools. Scalability of the framework was verified by mapping up to 73 applications per minute when it is executed on an 8 core system.

E. Sotiriou-Xanthopoulos, K. Siozios, G. Economakos and D. Soudris, “A Process-Based Reconfigurable SystemC Module for Simulation Speedup“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 72-79, July 2013, Samos, Greece (download local copy from here).

As Multi-Processor Systems-on-Chip (MPSoC) architectures become more and more complex, Design Space Exploration (DSE) becomes the only viable solution for finding the pareto-optimal designs. To evaluate each solution with real dataset, DSE has to simulate the design under test, which is modeled as a Virtual Platform usually written in SystemC. However, the simulation is a very slow task which includes non-productive time periods like system initialization, while the platform re-compilation also imposes a significant overhead. In this paper, a Process-based Reconfigurable Module is used in order to bypass the non-productive simulation parts, thus accelerating the simulation. The effectiveness of the proposed methodology is proved with a series of computationally intensive multimedia applications, where the simulation time improvements reach 34% on average.

D. Diamantopoulos, K. Siozios, E. Sotiriou-Xanthopoulos, G. Economakos and D. Soudris, “HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips“, International Symposium on Parallel & Distributed Processing Workshops and Phd Forum, pp. 2194–2199, May 2013, Cambridge, MA (download local copy from here).

In embedded system domain there is a continuous trend towards providing higher flexibility for application development. This imposes that the development of distinct components cannot be though as affordable for System-on-Chip platforms, whereas a more holistic approach is necessary for deriving optimal solutions. At the same time, the requirement for integrating more functionality in a smaller form factor, or the integration into single chip different technologies (e.g. memory, logic and sensors) pushes traditional semiconductor technology scaling to its limits. Three-dimensional (3-D) chip stacking is touted as the silver bullet technology that can keep Moore’s momentum and fuel the next wave of consumer electronic products. Apart from this flexibility, up to now there is a lack of tools, where designers can effectively produce these new platforms. This paper introduces a new design paradigm, named Hybrid Virtual System-on-Chip, in order to support rapid evaluation of different technologies for IC product development. Our framework initiates from SystemC, whereas the target architecture consists of a 3-D chip. Rather than similar approaches which mainly are based on academic tools, the 3-D HVSoCs is evaluated with the usage of Cadence tools.

Network-on-Chip (NoC)

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos and D. Soudris, “A Framework for Interconnection-Aware Domain-Specific Many-Accelerator Synthesis“, ACM Transactions on Embedded Computing Systems (TECS), Vol. 16, No. 1, Article 8, 26 pages, October 2016 (download local copy from here).

Many-accelerator Systems-on-Chip (SoC) have recently emerged as a promising platform paradigm that combines parallelization with heterogeneity, in order to cover the increasing demands for high performance and energy efficiency. To exploit the full potential of many-accelerator systems, automated design verification and analysis frameworks are required, targeted to both computational and interconnection optimization. Accurate simulation of interconnection schemes should use real stimuli, which are produced from fully functional nodes, requiring the prototyping of the processing elements and memories of the many-accelerator system. In this article, we argue that the Hierarchical Network-on-Chip (HNoC) scheme forms a very promising solution for many-accelerator systems in terms of scalability and data-congestion minimization. We present a parameterizable SystemC prototyping framework for HNoCs, targeted to domain-specific many-accelerator systems. The framework supports the prototyping of processing elements, memory modules, and underlying interconnection infrastructure, while it provides an API for their easy integration to the HNoC. Finally, it enables holistic system simulation using real node data. Using as a case study a many-accelerator system of an MRI pipeline, an analysis on the proposed framework is presented to demonstrate the impact of the system parameters on the system. Through extensive experimental analysis, we show the superiority of HNoC schemes in comparison to typical interconnection architectures. Finally, we show that, adopting the proposed many-accelerator design flow, significant performance improvements are achieved, from 1.2 × up to 26 × , as compared to a x86 software implementation of the MRI pipeline.

E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios and G. Economakos, “A virtual platform for exploring hierarchical interconnection for many-accelerator systems“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 384–389, July 2015, Samos, Greece (download local copy from here).

The advent of many-accelerator Systems-on-Chip (SoC), as a result of the ever increasing demands for high performance and energy efficiency, has lead to the need for new interconnection schemes among the system components, which minimize the communication overhead. Towards this need, Hierarchical Networks-on-Chip (HNoCs) can provide an efficient communication paradigm for such systems: Each node is an autonomous sub-network including the hardware accelerators needed by the respective application thread, thus retaining data locality and minimizing congestion. However, HNoC design may lead to exponential increase in the design space size, due to the numerous parameter combinations of the sub-networks and the overall HNoC. In addition, the need for a prototyping framework supporting HNoC simulation with real stimuli is crucial for the accurate system evaluation. Therefore, the goal of this paper is to present (a) a SystemC framework for cycle-accurate simulation of Hierarchical NoCs, accompanied with a NoC API for node mapping on the HNoC; and (b) an exploration flow that targets to reduce the increased design space size. By using the Rician Denoising algorithm for MRI scans as a case study, the proposed DSE flow could achieve up to 2× and 1.48× time and power improvements respectively, as compared to a typical DSE flow.

“Designing 2D and 3D Network-on-Chip Architectures“, Editors: K. Tatas, K. Siozios, A. Jantsch and D. Soudris, Springer Publishing, 2014 (download local copy from here).

E. Sotiriou-Xanthopoulos, D. Diamantopoulos, K. Siozios, G. Economakos and D. Soudris, “A Framework for Rapid Evaluation of Heterogeneous 3-D NoC Architectures“, Microprocessors and Microsystems (MICPRO), Vol. 38, No.4, pp.292-303, June 2014 (download local copy from here).

The scalability of communication infrastructure in modern Integrated Circuits (ICs) becomes a challenging issue, which might be a significant bottleneck if not carefully addressed. Towards this direction, the usage of Networks-on-Chip (NoC) is a preferred solution. In this work, we propose a software-supported framework for quantifying the efficiency of heterogeneous 3-D NoC architectures. In contrast to existing approaches for NoC design, the introduced heterogeneous architecture consists of a mixture of 2-D and 3-D routers, which reduces the delay and power consumption with a slight impact on packet hops. More specifically, the experimental results with a number of DSP applications show the effectiveness of the introduced methodology, as we achieve on average 25% higher maximum operation frequency and 39% lower power consumption compared to the uniform 3-D NoCs.

K. Tatas, K. Siozios, A. Bartzas, C. Kyriacou and D. Soudris, “A Novel Prototyping and Evaluation Framework for NoC-based MPSoC“, International Journal of Adaptive, Resilient and Autonomic Systems (IJARAS), Vol. 4, No. 3, pp. 1-24, 2013 (download local copy from here).

This paper presents a framework for high-level exploration, Register Transfer-Level (RTL) design and rapid prototyping of Network-on-Chip (NoC) architectures. From the high-level exploration, a selected NoC topology is derived, which is then implemented in RTL using an automated design flow. Furthermore, for verification purposes, appropriate self-checking testbenches for the verification of the RTL and architecture files for the semi-automatic implementation of the system in Xilinx EDK are also generated, significantly reducing design and verification time, and therefore Non-Recurring Engineering (NRE) cost. Simulation and FPGA implementation results are given for four case studies multimedia applications, proving the validity of the proposed approach.

K. Siozios, I. Anagnostopoulos and D. Soudris, “Multiple Vdd on 3D NoC Architectures“, International Conference on Electronics, Circuits, and Systems (ICECS), pp. 831-834, Dec. 2010, Athens, Greece (download local copy from here).

The communication problem is a challenge issue for Integrated Circuits (ICs), which usually becomes a bottleneck for performance improvement. Three-dimensional integration (3D), as well as network-on-chip (NoC), are two recent design approaches that promise to alleviate the consequences of interconnection degradation. This paper introduces a new methodology for power-efficient application mapping onto 3D NoC-based devices. By clustering into the same router, IP cores with similar communication demands, it is possible to achieve reasonable energy savings while meeting timing constraints. Experimental results prove the efficiency of the proposed methodology since we achieve energy savings and temperature reduction up to 19% and 11%, respectively.

K. Siozios, I. Anagnostopoulos and D. Soudris, “A High-Level Mapping Algorithm Targeting 3D NoC Architectures with Multiple Vdd“, IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 444–445, July 2010, Kefalonia, Greece (download local copy from here).

The communication problem in modern ICs becomes a challenge issue. This paper introduces a high-level mapping algorithm targeting to low-power 3D NoC devices. By appropriately assigning application’s functionalities to layers with different supply voltages we achieve reasonable energy savings and temperature reduction. Additionally, our methodology supports real-time adaption on different traffic scenarios. Experimental results show that energy savings up to 19% are feasible, without any area and delay overhead, as compared to architectures powered by only one supply voltage.

A. Richard, D. Milojevic, F. Robert, A. Bartzas, A. Papanikolaou, K. Siozios and D. Soudris, “Fast Design Space Exploration Environment Applied on NoC’s for 3D-Stacked MPSoC’s“, International Conference on Architecture of Computing Systems (ARCS), pp. 1–6, Hannover, Germany, Feb. 2010 (download local copy from here).

In this paper we present a design methodology and associated tool chain for efficient design of complex MPSoC architectures implemented using 3D-Stacked Integrated Circuits (3D-SIC). The proposed framework is based on a three step methodology that combines relatively accurate high-level, and two more accurate low-level prototyping tools. The high-level exploration tool, Nessie, developped at the ULB, allows designers to quickly simulate several system architecture and application scenarios. Using Nessie, the designer can easily explore many different system level options before deciding on the design space points he would like to explore in more details. In this paper, the high-level estimations are subsequently validated using an existing C++, transaction-level, 3D-SIC aware NoC simulator. Then, floorplaning and global routing of the system are performed using a novel tool for 3D physical prototyping. The low-level performance metrics of the system are derived from the resulting physical prototype and can be compared to the results predicted by Nessie. We demonstrate our approach using the example of a fairly complex MPSoC platform dedicated to advanced high-performance and low power video coding applications (AVC/H.264 encoder). The MPSoC platform is prototyped as traditional 2D-IC and then as 3D-SIC design using various 3D stack configurations and stack assignment schemes.

A. Bartzas, K. Siozios, and D. Soudris, “Three Dimensional Network-on-Chip Architectures“, Book Chapter “Networks on Chips: Theory and practice“, Editors: F. Gebali, H. Elmiligi, M. W. El-Kharashi, pp. 1–28, CRC Press, 2008 (download local copy from here).

The implementation of networks-on-chip (NoC) technology in VLSI integration presents a variety of unique challenges. To deal with specific design solutions and research hurdles related to intra-chip data exchange, engineers are challenged to invoke a wide range of disciplines and specializations while maintaining a focused approach. Networks-on-Chips: Theory and Practice facilitates this process, detailing the NoC paradigm and its benefits in separating IP design and functionality from chip communication requirements and interfacing. It starts with an analysis of 3-D NoC architectures and progresses to a discussion of NoC resource allocation, processor traffic modeling, and formal verification, with an examination of protocols at different layers of abstraction. An exploration of design methodologies, CAD tool development, and system testing, as well as communication protocol, the text highlights important emerging research issues.

A. Bartzas, N. Skalis, K. Siozios, D. Soudris, “Exploration of Alternative Topologies for Application-Specific 3D Networks-on-Chip“, Workshop on Application Specific Processors (WASP), October 4, 2007 (download local copy from here).

Three dimensional (3D) Network-on-Chip (NoC) architectures combine the benefits of new integration technologies with NoC-style interconnection of large number of IP cores in a single package. In this work, we propose a fully software-supported exploration methodology capable of defining pattern-based, alternative, interconnection topologies for application-specific multi-layered 3D NoC architectures. The focus of our exploration is on the number of vertical interconnects (or through silicon vias) connecting grids of different layers, considering the mesh and torus architectures. Existing 3D NoCs assume that every router of a grid can communicate directly with the neighboring routers of the same grid and with the ones of the adjacent layers. We show that this full vertical connectivity is not needed. The exploration
methodology is able to evaluate pattern-based 3D topologies and propose the ones that meet the design constraints best. We evaluate the exploration employing and extending the WormSim NoC simulator and feeding it with various types of traffic. In this way, we achieve a decrease in the number 3D routers and in the number of vertical vias, resulting in a decrease in the area occupied by the switch blocks, reducing energy dissipation and paying a negligible penalty in the latency of the 3D NoC.

Energy Harvesting and Power Optimization

K. Kozalakis, I. Sofianidis, V. Gogolou, V. Konstantakos, K. Siozios, S. Siskos, T. Laopoulos, “Wide-range light harvesting module for autonomous sensor nodes“, IEEE Access, Vol. 10, pp. 32180-32189, 2022, (download local copy from here).

A large number of autonomous devices is nowadays supported by renewable and green energy sources. A vital sub-circuit in such systems is the power converter circuit, which should efficiently transform and store the available energy. In order to obtain the maximum efficiency under varying energy conditions, various maximum power point tracking (MPPT) methods are used. In this work a complete harvesting module with battery management and MPPT is presented, suitable for a plethora of autonomous applications. A novel, low-complexity and ultra-low power consumption design is proposed, which offers very wide operating voltage and power range with high MPPT efficiency and very low power consumption. It can be combined with different harvesters, such as thermoelectric generators or photovoltaic panels and is able to work under widely varying energy conditions. As supported by experimental results, the proposed module covers a very wide working input power range, from 40 μW up to 4 W, as well as a very wide input voltage range, from 650 mV up to 2.8 V with 96.5% average MPPT efficiency and a total power consumption of 3.9 μW at 3.6 V. The module relies on an embedded ultra-low power microcontroller unit (MCU) to perform the power management and MPPT operations, which can also be used for extra tasks (e.g., sensor reading). Using the proposed module, an autonomous sensor node was built, able to acquire acceleration measurements, and wirelessly communicate with a remote user in order to send an alert or stream the acquired sensor data in real time.

A. Stefanou, K. Siozios, A. Hatzopoulos, “Design of a 10-Bit, 2GS/S Current-Steering Digital-to-Analog Converter with On-Line Current Calibration“, IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, pp. 1319-1322, 2022 (download local copy from here).

This paper presents a 2 GS/s digital-to-analog converter designed at 1V in 28nm, targeting an ethernet PHY IP. The 10-bit current-steering digital-to-analog converter (DAC) is a segmented implementation, comprising 6+4 bits of binary-weighted and thermometer coded sections for the LSB and MSB codes respectively. The converter is based on a unit cell with cascoded current source and always-ON switches, to enhance static linearity performance and minimize dynamic errors. The calibration approach for amplitude correction partitions the DAC unit cells in larger blocks and compensate the tail current source variation with an internal current DAC (CALDAC). The design approach demonstrates good linearity with SFDR in the region of 70 dB in low-frequency input signals, which degrades to 49.4 dB at Nyquist frequency, and the DAC shows a maximum DNL of less than 0.5. The maximum differential output swing is 0.35 Vpp.

K. Kozalakis, V. Gogolou, V. Konstantakos, K. Siozios, S. Siskos and T. Laopoulos, “A Low-complexity, FPGA-based Maximum Power Point Tracking Circuit, for Wide Range Light-energy Harvesting“, 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), pp. 694-699, 2021 (download local copy from here).

This work presents the design of a low-complexity FPGA-based maximum power point tracking (MPPT) circuit for wide power range light-energy harvesting applications. The circuit relies on a self-oscillating dc-dc boost (SOB) converter for the power conversion, which is controlled via an analog voltage. A custom-made digital-to-analog (D/A) converter circuit is used to produce the required analog voltage, while a resistor-free current-to-frequency (C/F) converter circuit provides feedback of the SOB converter’s output power. The FPGA-based digital part of the circuit controls the D/A converter and implements the perturb and observe MPPT logic using the frequency feedback of the C/F converter. The circuit was constructed and measured using discrete components. The digital MPPT control has been described in Verilog HDL and implemented on an Intel Cyclone 10CL016 FPGA. The circuit works in a very wide input power range from 15 uW up to 1 W with constant efficiency and is suitable for light energy harvesting both at indoor and outdoor lighting conditions. Experimental results prove this implementation to be suitable for on chip integration.

K. Kozalakis, I. Sofianidis, V. Konstantakos, K. Siozios and S. Siskos, “73.5 uW Indoor-Outdoor Light Harvesting System with Global Maximum Power Point Tracking“, Journal of Low-Power Electronics and Applications, Vol. 11, pp. 1-10, February 2021 (download local copy from here).

This work introduces a light harvesting system with battery management. In contrast to relevant solutions that operate in limited ranges, the proposed system covers a wide operating input power range from 10 uW up to 300 mW. Specifically, experimental results highlight that, combined with a 73 × 94 mm flexible light harvester, it can harness light in a range from 50 LUX (indoor lighting) up to 120,000 LUX (outdoor lighting). The introduced system consists of a boost converter and an ultra-low power microcontroller (MCU). The MCU performs Global Maximum Power Point Tracking (GMPPT), using a resistor-free time-based input power sensing method, to calculate the input power of the converter, which does not interfere with the operation of the boost converter. The efficiency of the GMPPT system was evaluated with detailed experimentation, where we achieved 99.75% average GMPPT tracking efficiency while consuming only 73.5 uW at 4.2 V.

K. Tatas, K. Siozios, D. Soudris, and A. Thanailakis, “Power-Efficient Implementations of Multimedia Applications on Reconfigurable Platforms“, International Conference on Field Programmable Logic and Applications (FPL), pp. 1032-1035, Sep. 1-3, 2003, Lisbon, Portugal (download local copy from here).

The power-efficient implementation of motion estimation algorithms on a system comprised by an FPGA and an external memory is presented. Low power consumption is achieved by implementing an optimum on-chip memory hierarchy inside the FPGA, and moving the bulk of required memory transfers from the internal memory hierarchy instead of the external memory. Comparisons among implementations with and without this optimization, prove that great power efficiency is achieved while satisfying performance constraints.

K. Tatas, K. Siozios, N. Vasiliadis, D. J. Soudris, S. Nikolaidis, S. Siskos, and A. Thanailakis, “FPGA Architecture Design and Toolset for Logic Implementation“, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 607-616, Sep. 2003,Torino, Italy (download local copy from here).

In this paper, the design of an embedded FPGA architecture (i.e. configurable logic blocks) is presented and a complete tool-supported design flow starting from architecture level (i.e. RT-level) and ending with the derivation of the reconfiguration bitstream for the FPGA programming is introduced. The proposed design flow consists of new and modified and extended academic tools. In particular, new tools were developed in order to complement certain critical steps in the implementation flow, since existing academic tools do not combine for a cohesive and complete flow. The remaining design steps are implemented by modified existing academic tools. The FPGA architecture and the tool development is an interactive task, depending on what architectures can be supported by the tools. Using this design support tool set, we designed and simulated in 0.18 TSMC technology an FPGA architecture. More specifically, the detailed design characteristics of the Configurable Logic Block Architecture as well as the interconnect network are determined. Finally, experimental results in terms of energy consumption and delay are given.

K. Tatas, K. Siozios, D. Soudris, K. Masselos, K. Potamianos, S. Blionas and A. Thanailakis, “Power Optimization Methodology for Multimedia Applications Implementation on Reconfigurable Platforms“, International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 430-439, Sep. 2003, Torino, Italy (download local copy from here).

A methodology for the power-efficient implementation of multimedia kernels based on reconfigurable hardware (FPGA) is introduced. The methodology combines various types of algorithmic transformations and high-level memory hierarchy exploration with register-transfer level design and implementation. An FPGA with an external memory was used for obtaining experimental results which prove the viability of the methodology. Comparisons among implementations with and without this optimization, prove that great power efficiency is achieved.

Internet-of-Things (IoT) & CyberPhysical Systems (CPS)

C. Sad, D. Masouros and K. Siozios,, “FEARLESS: a Federated Reinforcement Learning Orchestrator for Serverless Edge Swarms”, in IEEE Embedded Systems Letters, 2024, DOI:10.1109/LES.2024.3410892 (download local copy from here).

The rise of edge computing, characterized by swarms of edge devices, marks a significant shift in cloudedge computing landscapes, moving data processing closer to the source of data generation. However, this paradigm introduces complexities in orchestration, as traditional centralized methods become inadequate for effectively managing distributed, dynamic edge environments. In this article, we introduce FEARLESS, a distributed orchestration framework tailored for swarms of edge devices. FEARLESS employs a vertical federated reinforcement learning approach to efficiently orchestrate function invocation requests in serverless swarms. Experimental results demonstrate that FEARLESS significantly reduces the QoS violations of the scheduled tasks by up to 57% , compared to a centralized “least-CPU-utilization” and a “local-execution” approach, while it also achieves approximately up to 20% average total energy reduction.

D. Kanelis, V.Liolios, F.Papadopoulou, M.Rodopoulou, D.Kampelopoulos, K. Siozios and C.Tananaki,, “Decoding the Behavior of a Queenless Colony Using Sound Signals”, Biology (Basel), Vol. 12, No. 11, pp. 1392, Oct. 2023 (download local copy from here).

Honeybees are reported to be the most vital pollinators of agricultural and horticultural crops. However, their widespread decline has raised great attention to the need to monitor their activity in order to identify the causes and implement countermeasures. The recording and analysis of signals used by honeybees for their communication in their hive can be a very helpful tool to the beekeepers for the remote control of the hives. Thus, in the present study, we used a set of sound recording data taken inside the hives to automatically detect the sounds of the bees over a certain period, to distinguish between queenright and queenless states, and to find the gradual changes in the queenless state. Unlike what was commonly believed, noticeable changes in the sound signals of all experimental colonies were observed just one hour after the queens’ removal from the hive, while the sound signals were intensified over a period of 5 h, after which the transmitted signal stabilized to the equivalent of a queenless state. The colonies seemed to return to their normal sounds 9-10 days after the reintroduction of the queens in the hives. Our study concluded that timely intervention of the queen’s absence combined with the immediate intervention of the beekeeper may be a determining factor in mitigating the adverse effects that occur from the queen’s loss.

D. Watari, C. Marantos, I. Taniguchi, F. Catthoor, K. Siozios, D. Soudris and T. Onoye, “Online Energy Management Framework for Smart Buildings with Low-Complexity Estimators”, IEEE Embedded Systems Letters (download local copy from here).

This letter proposes an online energy management framework for smart buildings. Two low-complexity estimators for thermal comfort and electrical load are investigated and integrated into an optimization framework. The proposed framework optimizes operations of a battery system and multiple heating, ventilation, and air conditioning (HVAC) systems to minimize energy consumption and power peak while maximizing occupants’ thermal comfort. The effectiveness of the proposed framework is demonstrated using data measured in actual campus buildings in terms of system costs, thermal comfort, and computational complexity. The results show that the proposed framework can reduce electricity costs by 5.7% compared to the baseline.

V. Serasidis, I. Sofianidis, G. Margaritis, C. Sad, V. Konstantakos and K. Siozios, “ICE: A Low-Cost IoT Platform Targeting Real-Time Anonymous Visitors Flow Tracking at Museums” IEEE 32nd International Symposium on Industrial Electronics (ISIE), Helsinki, Finland, pp. 1-6, June 2023 (download local copy from here).

Museums are uniquely positioned to blend education and recreation in ways that can both challenge and catalyze communities. During the last years, IoT technology has revolute the way that museums enable visitors to engage with cultural institutions. Among others, sensor-based technology enables museums to provide visitor flow solutions that provide smart people tracking. Typically these systems rely to Bluetooth and WiFi beacons with accuracy around 1 meter. Throughout this paper, we introduce a low-cost IoT platform that relies on Ultra-WideBand technology to enable real-time accurate indoor positioning and navigation. Experimental results with different scenarios highlight the superiority of proposed platform, since the mean error between estimated and actual path can be up to 15cm, which in turn is sufficient to enable new services to museums of the future.

C. Marantos, C. Lamprakos, K. Siozios and D. Soudris, “Model-Free HVAC Optimizer based on Reinforcement Learning”, IEEE 32nd International Symposium on Industrial Electronics (ISIE), Helsinki, Finland,, pp. 1-8, June 2023 (download local copy from here).

Recently, there is a continues demand for embedded systems that automate buildings’ operation, such as the control of Heating Ventilation and Air-Conditioning system (HVAC) operation. These systems exhibit increased complexity and their operation relies less on human decision-making and more on computational intelligence. The efficiency of these systems is usually bounded by the orchestrators’ flexibility to optimize simultaneously multiple, and usually contrary, objectives. This paper introduces a novel framework for designing model-free orchestrators targeting to optimize the operation of HVAC systems, is introduced. The proposed orchestrator relies on Reinforcement Learning in order to support self-adaptive customization. Experimental results highlight the superiority of introduced orchestrator, as it achieves comparable performance to state-of-the-art relevant controllers without any prior detailed modeling.

A. Kokkinis, G. Zervakis, K. Siozios, M. B. Tahoori, and J. Henkel, “Hardware-Aware Automated Neural Minimization for Printed Multilayer Perceptrons”, Design, Automation and Test in Europe (DATE), April 17-21, 2023 (download local copy from here).

The demand of many application domains for flexibility, stretchability, and porosity cannot be typically met by the silicon VLSI technologies. Printed Electronics (PE) has been introduced as a candidate solution that can satisfy those requirements and enable the integration of smart devices on consumer goods at ultra low-cost enabling also in situ and on-demand fabrication. However, the large features sizes in PE constraint those efforts and prohibit the design of complex ML circuits due to area and power limitations. Though, classification is mainly the core task in printed applications. In this work, we examine, for the first time, the impact of neural minimization techniques, in conjunction with bespoke circuit implementations, on the area-efficiency of printed Multilayer Perceptron classifiers. Results show that for up to 5 % accuracy loss up to 8× area reduction can be achieved.

D. Watari, I. Taniguchi, F. Catthoor, C. Marantos, K. Siozios, E. Shirazi, D. Soudris and T. Onoye, “Thermal Comfort Aware Online Energy Management Framework for a Smart Residential Building“, Proceedings in Design, Automation and Testing in Europe (DATE), Feb. 2021 (download local copy from here).

Energy management in buildings equipped with renewable energy is vital for reducing electricity costs and maximizing occupant comfort. Despite several studies on the scheduling of appliances, a battery, and heating, ventilating, and air-conditioning (HVAC), there is a lack of a comprehensive and time-scalable approach that integrates predictive information such as renewable generation and thermal comfort. In this paper, we propose an online energy management framework to incorporate the optimal energy scheduling and prediction model of PV generation and thermal comfort by the model predictive control (MPC) approach. The energy management problem is formulated as coordinated three optimization problems covering a fast and slow time-scale.This reduces the time complexity without a significant negative impact on the global nature and quality of the result. Experimental results show that the proposed framework achieves optimal energy management that takes into account the trade-off between the electricity bill and thermal comfort.

K. Balaskas and K. Siozios, “Fatigue Detection Using Deep Long Short-Term Memory Autoencoders“, International Conference on Modern Circuits and Systems Technologies, Thessaloniki, 5-7 July, 2021 (download local copy from here).

Efficient time series data mining techniques are an essential part of real world measurement systems and can yield meaningful results from unlabeled data by taking advantage of feature extraction principles. In this paper, we perform kinematic analysis on time series data from IMU sensors for fatigue detection on runners, using several unsupervised machine learning techniques. We propose a robust feature extraction scheme composed of an LSTM Autoencoder, to exploit the advantages of recurrent neural networks and the data compression capabilities of an Autoencoder. The proposed model combines the advantages of several clustering algorithms for accurate fatigue detection in real time, making it suitable for implementation in an embedded device. Experimental evaluation of the feature extraction algorithms showcased their capabilities to produce meaningful features, overcoming the obstacle of extremely limited training data. The inference procedure yielded successful detection in 43% of our representative sample, indicating the efficiency of our model in extracting robust features from unseen kinematic data.

C. Marantos, K. Siozios and D. Soudris, “Rapid Prototyping of Low-Complexity Orchestrator Targeting CyberPhysical Systems: The Smart-Thermostat Usecase“, IEEE Transactions on Control Systems Technology, Vol. 28, No. 5, pp. 1831-1845, Sept. 2020 (download local copy from here).

Recently, a new generation of systems with integrated computational and physical capabilities, also known as CyberPhysical Systems (CPSs), has been introduced. The control of these systems often results in very high-order models imposing great challenges to the analysis and design problems. In the context of this paper, a decision-making mechanism for these systems is proposed. Moreover, we introduce a virtual prototyping framework for the physical implementation and customization of these orchestrators. For evaluation purposes, the introduced solution is applied to design a low-cost smart thermostat in a microgrid environment. Experimental results highlight the superiority of introduced orchestrator, as it achieves comparable performance to state-of-the-art relevant decision-making approaches, but with considerable lower computational and storage complexities.

K. Siozios and S. Siskos, “A Low-Complexity Framework for Distributed Energy Market Targeting Smart-Grid“, Design, Automation & Test in Europe (DATE), April 2019, pp. 878-883, Florence, Italy (download local copy from here).

With the increasing connection of distributed energy resources, traditional energy consumers are becoming prosumers, who can both dissipate and generate energy in a smart-grid environment. This enables the wide adoption of dynamic pricing scheme, where demand and price forecast are applied for estimating energy cost and loads scheduling. Throughout this paper we propose a Peer-to-Peer (P2P) platform, as well as a light-weighted system orchestrator based on game theory to support the energy trading. Additionally, we discuss the hardware implementation of the proposed solution onto a low-cost reconfigurable device. Experimental results based on real data validate the efficiency of proposed framework, as we achieve considerable energy savings (on average cost reduction by 87%) compared to the corresponding cost from the main-grid.

“IoT for Smart Grids: Design Challenges and Paradigms“, Editors: K. Siozios, D. Anagnostos, D. Soudris and E. Kosmatopoulos, Springer International Publishing, 2019 (download local copy from here).

K. Balaskas and K. Siozios, “ECG Analysis and Heartbeat Classification Based on Shallow Neural Networks“, International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4, Thessaloniki, Greece, 2019 (download local copy from here).

Accurate detection of cardiac pathological events is an important part of electrocardiogram (ECG) evaluation and subsequent correct treatment of the patient. For this purpose, several adaptive filter structures were proposed during the past decades for noise cancellation and arrhythmia detection. Currently there are a lot of devices on the market that analyze ECGs, such as patient monitors, stress test systems, and Holter analysis systems, that are able to detect beats and classify arrhythmia. This paper proposes a system for ECG analysis and heartbeat classification. The proposed solution relies on a combination of machine learning algorithm and a wavelet transformation in order to maximize its performance with the minimum possible training phase. Experimental results with public available data for arrhythmia indicate the efficiency in classifying heartbeats, whereas its low-computational and memory requirements makes it suitable for being implemented as part of an embedded (IoT) system.

C. Marantos, C. Lamprakos, V. Tsoutsouras, K. Siozios and D. Soudris, “Towards plug&play Smart Thermostats inspired by Reinforcement Learning“, Workshop on INTelligent Embedded Systems Architectures and Applications (INTESA), pp. 39-44, Oct. 2018, Turin, Italy (download local copy from here).

Buildings are immensely energy-demanding and this fact is enhanced by the expectation of even more increment of energy consumption in the future. In order to mitigate this problem, a low-cost, flexible and high-quality Decision-Making Mechanism for supporting the tasks of a Smart Thermostat is proposed. Energy efficiency and thermal comfort are the two primary quantities regarding control performance of a building’s HVAC system. Apart from demonstrating a conflicting relationship, they depend not only on the building’s dynamics, but also on the surrounding climate and weather, thus rendering the problem of finding a long-term control scheme hard, and of stochastic nature. The introduced mechanism is inspired by Reinforcement Learning techniques and aims at satisfying both occupants’ thermal comfort and limiting energy consumption. In contrast to to existing methods, this approach focuses on a plug&play solution, that does not require detailed building models and is applicable to a wide variety of buildings as it learns the dynamics using gathered information from the environment. The proposed control mechanisms were evaluated via a well-known building simulation framework and implemented on ARM-based, low-cost embedded devices.

“Cyber-Physical Systems: Decision Making Mechanisms and Applications“, Editors: K. Siozios, D. Soudris and E. Kosmatopoulos, River Publishers Series in Circuits and Systems, ISBN: 9788793609099, Nov. 2017 (download local copy from here).

H. Marantos, K. Siozios and D. Soudris, “A Flexible Decision-Making Mechanism Targeting Smart Thermostats“, IEEE Embedded Systems Letters, Vol. 9, No. 4, pp. 105-108, Dec. 2017 (download local copy from here).

Buildings are immensely energy-demanding and are expected to consume even more in the near future. The operation of cooling/heating mechanisms highly contribute to this parameter, since nonoptimal configuration at temperature set-points usually leads to increased energy cost, as well as violations at occupant’s thermal comfort. In this letter, we introduce a flexible decision-making mechanism for supporting the proper configuration of these devices. The competitive advantage of our solution is the remarkable lower computational complexity without any degradation at the quality of derived decisions.

P. Danassis, K. Siozios, C. Korkas, D. Soudris and E. Kosmatopoulos, “A Low-Complexity Control Mechanism Targeting Smart Thermostats“, Energy and Buildings, Vol. 139, No. 15, pp. 340-350, March 2017 (download local copy from here).

This paper introduces a low-cost, high-quality Decision Making Mechanism for supporting the tasks of temperature regulation of existing HVAC installations in a smart building environment. It incorporates Artificial Neural Networks and Fuzzy Logic in order to improve the occupants’ thermal comfort while maintaining the total energy consumption. Contrary to existing approaches, it focuses in achieving significantly low computational complexity, which in turn enables its hardware implementation onto low-cost embedded platforms, such the ones used in smart thermostats. Both the software components and hardware implantation are described in detail. To demonstrate its effectiveness, the proposed method was compared to ruled-based controllers, as well as state-of-the-art control techniques. A simulation model was developed using the EnergyPlus building simulation suite, a detailed modeled micro-grid environment of buildings located in Chania Greece and historic weather and energy pricing data. Simulation results validate the effectiveness of our approach.

D. Masouros, I. Bakolas, V.Tsoutsouras, K. Siozios and D. Soudris, “From Edge To Cloud: Design and Implementation of a Healthcare Internet of Things Infrastructure“, International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 1-6, 2017, Thessaloniki, Greece (download local copy from here).

Lately, the advancement in circuit technology combined with the design of low cost embedded devices have resulted in an infiltration of the latter into everyday humans’ lives. To exploit the full potential of ubiquitous embedded devices, a network is used for their inter-communication, offering advanced real-time monitoring. This paradigm, known as Internet of Things (IoT), is steadily consolidated and promises to offer a wide variety of applications. However, with the adoption of IoT, new challenges arise, such as the design of architectures able to support the requirements of the new applications. Towards this goal, we explore a three layered architecture, able to acquire, process and store Healthcare data as well as to provide real-time decision making. We use ECG signal arrhythmia detection as our use case evaluation scenario, and compare different techniques for wireless communication, storage and data classification. Experimental results show that, our architecture provides real-time decision making, with an average delay of 15 μs and that different communication technologies achieved to provide up to 10% lower power consumption on the monitoring devices.

C. Marantos, I.S. Paraskevas, K. Siozios, J. Mothe, C. Menou and D. Soudris, “FabSpace 2.0: A platform for application and service development based on Earth Observation data“, International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4, Thessaloniki, 2017 (download local copy from here).

The goal of FabSpace 2.0 project is to transform Universities to Open Innovation Centers for their corresponding region and improve their societal contribution to the socio-economic and environmental performance. For this purpose, the project provides to students, researchers, as well as to citizens or even companies, with a framework and a digital platform which gives open access to a range of geodata, i.e. Earth Observation data, but also computing and visualization tools. By this way, it is expected that the participants can work in an open co-working space, where they can interact with each other. Additionally, the participants will be assisted to assess the business aspects of any potential idea. The outcome of such an activity is the increase of possibility for participants to find a revolutionary concept, which can result to new and sustainable applications and/or services. Moreover the target group exceeds the usual Earth Observation professionals, and focuses also in professionals of other disciplines, where the use of Earth Observation data can create novel technologies.

K. Siozios, P. Danassis, N. Zompakis, E. Kosmatopoulos and D. Soudris, “Supporting Decision Making for Large-Scale IoTs: Trading Accuracy with Computational Complexity“, Book Chapter in “Components and Services for IoT Platforms: Paving the Way for IoT Standards“, Editors: G. Keramidas, N. Voros and M. Hubner, pp. 233–250, Springer International Publishing, Sept. 2017 (download local copy from here).

As systems continue to evolve they rely less on human decision-making and more on computational intelligence. This trend in conjunction to the available technologies for providing advanced sensing, measurement, process control, and communication lead towards the new field of Internet-of-Things (IoT). IoT systems are expected to play a major role in the design and development of future engineering platforms with new capabilities that far exceed today’s levels of autonomy, functionality, and usability. Although these systems exhibit remarkable characteristics, their design and implementation is a challenging issue, as numerous (heterogeneous) components and services have to be appropriately designed. The problem of designing efficient IoT becomes far more challenging in case the target system has to meet also timing constraints. This chapter discusses an advanced framework for implementing decision-making mechanisms for large-scale IoT platforms. In order to depict the efficiency of introduced framework, it was applied to customize the building’s cooling and heating in a smart-grid environment. For this purpose, a number of connected smart thermostats are employed, which should facilitate intelligent control to fulfill occupants’ needs, such as the energy consumption and the comfort level in a building environment. Towards this direction, appropriate mechanisms that enable smart thermostats to have the capability to monitor their own performance, to classify, to learn, and to take proper actions, were developed in a systematic way. Experimentation with various configuration setups highlights the superior of introduced solution compared to static temperature values, as well as existing control techniques. Additionally, the significant low computational complexity enables the sufficient implementation of this mechanism as part of a low-cost embedded system, which can be integrated into existing smart thermostats.

N. Zompakis and K. Siozios, “A Framework for Reducing the Modeling and Simulation Complexity of Cyberphysical Systems“, International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pp. 360–365, July 2015, Samos, Greece (download local copy from here).