Publications


Below is a list of publications that have been produced throughout the duration of the CloudLightning Project.

 

Self-Healing in a Decentralised Cloud Management System

Authors

P. Stack (UCC), H. Xiong (UCC), D. Mersel, M. Makhloufi, G. Terpend, D. Dong (UCC)

Abstract

With the advent of heterogeneous resources and increasing scale, present cloud environments are becoming more and more complex. In order to manage heterogeneous cloud infrastructures at scale, in a reliable and robust manner, systems and services with autonomic behaviours are advantaging. In this paper, self-healing concepts are introduced for autonomic cloud management. A layered master-slave structure is proposed, providing the reliability and high availability for a decentralised, hierarchical cloud architecture.

Open Access: No

Read

Characterization of hardware in self-managing self-organizing Cloud environment

Authors

C. K. Filelis-Papadopoulos (DUTH), E. N. G. Grylonakis (DUTH), P. E. Kyziropoulos (DUTH), G. A. Gravvanis (DUTH), J. P. Morrison (UCC)

Abstract

During the last decade, multiple applications and services have been migrated to Cloud computing infrastructures. Cloud infrastructures offer flexibility in terms of the variety of applications they can service. Moreover, integrity of data and virtually unlimited storage space are attractive features, especially for end-users requiring massive amounts of storage. Recently, many HPC applications have also been migrated to the Cloud. Such applications include Oil and Gas exploration, Genomics and Ray-tracing. However, the problem of underutilization of computational resources, as well as the choice of adequate computational equipment as a function of input data, computational work, pricing and energy consumption poses a major problem in modern Cloud environment. A technique for the characterization of hardware with respect to application and hardware parameters, e.g. computational efficiency versus power consumption is proposed. The technique is based on indexes built upon ratios to baseline hardware with respect to three of the applications involved in the CloudLightning project: Oil and Gas, Ray-Tracing, Dense and Sparse matrix Computations.

Open Access: No

Read

Cloud Deployment and Management of Dataflow Engines

Authors

N. Trifunovic (Maxeler), H. Palikareva (Maxeler), T. Becker (Maxeler), G. Gaydadjiev (Maxeler)

Abstract

Maxeler Technologies successfully commercialises high-performance computing systems based on dataflow technology. Maxeler dataflow computers have been deployed in a wide range of application domains including financial data analytics, geoscience and low-latency transaction processing. In the context of cloud computing steadily growing acceptance in new domains, we illustrate how Maxeler dataflow systems can be integrated and employed in a self-organising self-managing heterogeneous cloud environment.

Open Access: No

Read

On the power consumption modeling for the simulation of Heterogeneous HPC clouds

Authors

K. Giannoutakis (CERTH), A. Makaratzis (CERTH), D. Tzovaras (CERTH), C. K. Filelis-Papadopoulos (DUTH), George A. Gravvanis (DUTH)

Abstract

During the last years, except from the traditional CPU based hardware servers, hardware accelerators are widely used in various HPC application areas. More specifically, Graphics Processing Units (GPUs), Many Integrated Cores (MICs) and Field-Programmable Gate Arrays (FPGAs) have shown a great potential in HPC and have been widely mobilized in supercomputing. With the adoption of HPC from cloud environments, the realization of HPC-Clouds is evolving since many vendors provide HPC capabilities on their clouds. With the increase of the interest on clouds, there has been an analogous increase in cloud simulation frameworks.

Cloud simulation frameworks offer a controllable environment for experimentation with various workloads and scenarios, while they provide several metrics such as server utilization and power consumption. For providing these metrics, cloud simulators propose mathematical models that estimate the behavior of the underlying hardware infrastructure. This paper focuses on the power consumption modeling of the main compute elements of heterogeneous HPC servers, i.e. CPU servers and pairs of CPU-accelerators. The modeling approaches of existing cloud simulators are examined and extended, while new models are proposed for estimating the power consumption of accelerators.
Open Access: No

Read

CloudLightning Simulation and Evaluation Roadmap

Authors

C. K. Filelis-Papadopoulos (DUTH), G. A. Gravvanis (DUTH), J. P. Morrison (UCC)

Abstract

The CloudLightning (CL) system, designed in the frame of the CloudLightning project, is a service-oriented architecture for the emerging large scale heterogeneous cloud. It facilitates a clear distinction between service-lifecyle management and resource-lifecycle management. This separation of concerns is used to make resource management issues tractable at scale and to enable functionality that is currently not naturally covered by the cloud paradigm. In particular, the CL project seeks to maximize computational efficiency of the cloud in a number of specific ways; by exploiting prebuilt HPC environments, by dynamically building HPC instances, by improving server utilization, by reducing power consumption and by improving service delivery. Given the scale and complexity of this project, its utility can presently only be measured through simulation. This paper outlines the parameters, constraints and limitation being considered as part of the design and construction of that simulation environment.

Open Access: No

Read

Elastic Cloud Services Compliance with Gustafson’s and Amdahl’s Laws

Authors

S. Ristov, R. Prodan, D. Petcu (IeAT)

Abstract

The speedup that can be achieved with parallel and distributed architectures is limited at least by two laws: the Amdahl’s and Gustafson’s laws. The former limits the speedup to a constant value when a fixed size problem is executed on a multiprocessor, while the latter limits the speedup up to its linear value for the fixed time problems, which means that it is limited by the number of used processors. However, a superlinear speedup can be achieved (speedup greater than the number of used processors) due to insufficient memory, while, parallel and, especially distributed systems can even slowdown the execution due to the
communication overhead, when compared to the sequential one. Since the cloud performance is uncertain and it can be influenced by available memory and networks, in this paper we investigate if it follows the same speedup pattern as the other traditional distributed systems. The focus is to determine how the elastic cloud services behave in the different scaled environments. We define several scaled systems and we model the corresponding performance indicators. The analysis shows that both laws limit the speedup for a specific range of the input parameters and type of scaling. Even more, the speedup in cloud systems follows the Gustafson’s extreme cases, i.e. insufficient memory and communication bound domains.

Open Access: No

Read

Characterization of hardware in self-managing self-organizing Cloud environment

Authors

C.K. Filelis-Papadopoulos (DUTH), E.N.G. Grylonakis, P.E. Kyziropoulos, G.A. Gravvanis (DUTH) and J.P. Morrison (UCC)

Abstract

During the last decade, multiple applications and services have been migrated to Cloud computing infrastructures. Cloud infrastructures offer flexibility in terms of the variety of applications they can service. Moreover, integrity of data and virtually unlimited storage space are attractive features, especially for end-users requiring massive amounts of storage. Recently, many HPC applications have also been migrated to the Cloud. Such applications include Oil and Gas exploration, Genomics and Ray-tracing. However, the problem of underutilization of computational resources, as well as the choice of adequate computational equipment as a function of input data, computational work, pricing and energy consumption poses a major problem in modern Cloud environment. A technique for the characterization of hardware with respect to application and hardware parameters, e.g. computational efficiency versus power consumption is proposed. The technique is based on indexes built upon ratios to baseline hardware with respect to three of the applications involved in the CloudLightning project: Oil and Gas, Ray-Tracing, Dense and Sparse matrix Computations.

Open Access: No

Read

On Issues Concerning Cloud Environments in Scope of Scalable Multi-Projection Methods

Authors

B.E. Moutafis, C.K. Filelis-Papadopoulos (DUTH), G.A. Gravvanis (DUTH) and J.P. Morrison (UCC)

Abstract

Over the last decade, Cloud environments have gained significant attention by the scientific community, due to their flexibility in the allocation of resources and the various applications hosted in such environments. Recently, high performance computing applications are migrating to Cloud environments. Efficient methods are sought for solving very large sparse linear systems occurring in various scientific fields such as Computational Fluid Dynamics, N-Body simulations and Computational Finance. Herewith, the parallel multi-projection type methods are reviewed and discussions concerning the implementation issues for IaaS-type Cloud environments are given. Moreover, phenomena occurring due to the “noisy neighbor” problem, varying interconnection speeds as well as load imbalance are studied. Furthermore, the level of exposure of specialized hardware residing in modern CPUs through the different layers of software is also examined. Finally, numerical results concerning the applicability and effectiveness of multi-projection type methods in Cloud environments based on OpenStack are presented.

Open Access: No

Read

An approach for scaling cloud resource management

Authors

D.C. Marinescu, A. Paya, J. P. Morrison (UCC), S. Olariu

Abstract

Given its current development trajectory, the complexity of cloud computing ecosystems are evolving to where traditional resource management strategies will struggle to remain fit for purpose. These strategies have to cope with ever-increasing numbers of heterogeneous resources, a proliferation of new services, and a growing user-base with diverse and specialized requirements. This growth not only significantly increases the number of parameters needed to make good decisions, it increases the time needed to take these decisions. Consequently, traditional resource management systems are increasingly prone to poor decisions making. Devolving resources management decisions to the local environment of that resource can dramatically increase the speed of decisions making; moreover, the cost of gathering global information can thus be eliminated; saving communication costs. Experimental data, provided in this paper, illustrate that extant cloud deployments can be used as effective vehicles for devolved decision making. This finding strengthens the case for the proposed paradigm shift, since it does not require a change to the architecture of existing cloud systems. This shift would result in systems in which resources decide for themselves how best they can be used. This paper takes this idea to its logical conclusion and proposes a system for supporting self-managing resources in cloud environments. It introduces the concept of coalitions, consisting of collaborating resources, formed for the purpose of service delivery. It suggests the utility of restricting the interactions between the end-user and the cloud service provider to a well-defined services interface. It shows how clouds can be considered functionally, as engines for delivering an appropriate set of resources in response to service requests. And finally, since modern applications are increasingly constructed from sophisticated workflows of complex components, it shows how combinatorial auctions can be used to effectively deliver packages of resources to support those workflows.

Open Access: No

Read

Topics in cloud incident management

Authors

T. F. Fortis (IeAT), V. I. Munteanub (UVT)

Abstract

Continuous advancement of cloud technologies, alongside their ever increasing stability, adoption, and ease of use, has led to a rise in native cloud applications, possibly over a larger pool of heterogeneous resources or in multi-cloud approaches. This, in turn, brought unprecedented levels of complexity in the context of cloud computing. Such complexity may cause a series of events and incidents that are difficult to be intercepted or managed on time, in a manner that also ensures the overall Quality of Services and existing Service-Level Agreements. Our special issue presents advances in several key areas that are highly relevant for automated cloud incident management: a ‘continuous approach’ for reliable cloud native applications, novel approaches for Metal-as-a-Service, centered around an advanced reservation system, or development of a framework based on the concept of secure SLA, in order to deal with specific cloud security issues.

Open Access: No

Read

Characterizing numascale clusters with GPUs: MPI-based and GPU interconnect benchmarks

Authors

M. M. Khan (NTNU), A. C. Elster (NTNU)

Abstract

Modern HPC clusters are increasingly heterogeneous both in processor types, topologies of computing, communication and storage resources. In this paper, we describe how to use benchmarking, to characterize the high-speed interconnect, NumaConnect, associated with a shared-memory Numascale cluster system with GPUs, constituting a novel testbed at NTNU. Numascale systems include a unique node controller, NumaConnect, based on the FPGA or ASIC-based NumaChip, depending on system vendor requirements. The system’s interconnects uses AMD’s HyperTransport protocol, and provide a cache-coherent shared-memory single image operating system. Our system has, in addition, a GPU added to each server blade. Our characterizations efforts target the NumaConnect which includes an RDMA-type Block Transfer Engine (BTE). The BTE is used by Byte Transfer Libraries such as the NumaConnect BTL (NC-BTL) for message passing (MPI) or BLACS. To characterize our Numascale system, we use several benchmark suites including: our own SimpleBench that includes ping-pong, MPI-Reduce and MPI-Barrier tests; two well-known MPI benchmark suites: the NAS Parallel Benchmarks (NPB)-MPI, the OSU microbenchmarks; as well as Nvidia’s Bandwidth test for GPUs. Our results show that it is generally very beneficial to use MPI or other libraries that use the NC-BTL library. In fact, on selected OSU and NPB benchmarks, we achieve order-of-magnitude performance improvements on communication and synchronization costs on these benchmarks when using NC-BTL.

Open Access: No

Read

Machine learning-based auto-tuning for enhanced performance portability of OpenCL applications

Authors

T. L. Falch (NTNU), A. C. Elster (NTNU)

Abstract

Heterogeneous computing, combining devices with different architectures such as CPUs and GPUs, is rising in popularity and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programming such systems and offers functional portability. However, it suffers from poor performance portability, because applications must be retuned for every new device. In this paper, we use machine learning-based auto-tuning to address this problem. Benchmarks are run on a random subset of the tuning parameter spaces, and the results are used to build a machine learning-based performance model. The model can then be used to find interesting subspaces for further search. We evaluate our method using five image processing benchmarks, with tuning parameter space sizes up to 2.3 M, using different input sizes, on several devices, including an Intel i7 4771 (Haswell) CPU, an Nvidia Tesla K40 GPU, and an AMD Radeon HD 7970 GPU. We compare different machine learning algorithms for the performance model. Our model achieves a mean relative error as low as 3.8% and is able to find solutions on average only 0.29% slower than the best configuration in some cases, evaluating less than 1.1% of the search space. The source code of our framework is available at https://github.com/acelster/ML-autotuning.

Open Access: No

Read

Exposing HPC services in the Cloud: the CloudLightning Approach

Authors

I. Dragan (IeAT), T. F. Fortis (IeAT), M. Neagul (IeAT)

Abstract

Nowadays we are noticing important changes in the way High Performance Computing (HPC) providers are dealing with the demand. The growing requirements of modern data- and compute-intensive applications ask for new models for their development, deployment and execution. New approaches related with Big Data, peta- and exa-scale computing are going to dramatically change the design, development and exploitation of highly demanding applications, such as the HPC ones. Due to the increased complexity of these applications and their outstanding requirements which cannot be supported by the classical centralized cloud models, novel approaches, inspired by autonomic computing, are investigated as an alternative. In this paper, we offer an overview of such an approach, undertaken by the CloudLightning initiative. In this context, a novel cloud delivery model that offers the capabilities to describe and deliver dynamic and tailored services is being considered. This new delivery model, based on a self-organizing and self-managing approach, will allow provisioning and delivery of coalitions of heterogeneous cloud resources, built on top of the resources hosted by a cloud service provider.

Open Access: Yes

Read

A Cloud Reservation System for Big Data Applications

Authors

D. Marinescu, A. Paya, J. Morrison (UCC)

Abstract

Emerging Big Data applications increasingly require resources beyond those available from a single server and may be expressed as a complex workflow of many components and dependency relationships – each component potentially requiring its own specific, and perhaps specialized, resources for its execution. Efficiently supporting this type of Big Data application is a challenging resource management problem for existing cloud environments. In response, we propose a two-stage protocol for solving this resource management problem. We exploit spatial locality in the first stage by dynamically forming rack-level coalitions of servers to execute a workflow component. These coalitions only exist for the duration of the execution of their assigned component and are subsequently disbanded, allowing their resources to take part in future coalitions. The second stage creates a package of these coalitions, designed to support all the components in the complete workflow. To minimize the communication and housekeeping overhead needed to form this package of coalitions, the technique of combinatorial auctions is adapted from market-based resource allocation. This technique has a considerably lower overhead for resource aggregation than the traditional hierarchically organized models. We analyze two strategies for coalition formation: the first, history-based uses information from past auctions to pre-form coalitions in anticipation of predicted demand; the second one is a just-in-time that builds coalitions only when support for specific workflow components is requested.

Open Access: No

Read

Supporting Heterogeneous Pools in a Single Ceph Storage Cluster

Authors

S. Meyer (UCC), J. Morrison (UCC)

Abstract

In a general purpose cloud system efficiencies are yet to be had from supporting diverse application requirements within a heterogeneous storage system. Such a system poses significant technical challenges since storage systems are traditionally homogeneous. This paper uses the Ceph distributed file system, and in particular its concept of storage pools, to show how a storage solution can be partitioned to provide the heterogeneity needed to support the required application requirements.

Open Access: No

Read

Benchmarking the Numascale Shared Memory Cluster System with MPI

Authors

M. Khan (NTNU), A. Elster (NTNU)

Abstract

Modern clusters and HPC systems are becoming increasingly complex in their topology and in the heterogeneity of the computing devices used. Additionally, they are affected by energy usage and interconnect performances. This situation exposes the parallel applications to the difficult challenge of taking maximum advantage of the superior computing potential offered by such systems. MPI has been one of the most popular model for the parallel scientific applications over the years. In this paper, we present the evaluation results, of a couple of well-known benchmark suites, using standard OpenMPI implementation compared with the (vendor-provided) NumaConnect specific BTL (NumaConnect byte transfer layer); on a modern shared-memory multi-GPU-based Numascale cluster system. We run a series of standard benchmarks kernels and pseudo applications that are widely used to benchmark clusters and supercomputers. In particular, we present our results from running two benchmarks, one each from the OSU and the NPB benchmark suites(iBarrier, Conjugate Gradient) as representative of our benchmarking effort with the Numascale machine. Our results show an order of magnitude performance improvements on communication and synchronization costs on standard benchmarks while using NumaConnect specific NC BTL.

Open Access: No

Read

Reusing Resource Coalitions for Efficient Scheduling on the Intercloud

Authors

T. Selea (IeAT), A. Spataru (IeAT), M. Frincu (IeAT)

Abstract

The envisioned intercloud bridging numerous cloud providers offering clients the ability to run their applications on specific configurations unavailable to single clouds poses challenges with respect to selecting the appropriate resources for deploying VMs. Reasons include the large distributed scale and VM performance fluctuations. Reusing previously “successful” resource coalitions may be an alternative to a brute force search employed by many existing scheduling algorithms. The reason for reusing resources is motivated by an implicit trust in previous successful executions that have not experienced VM performance fluctuations described in many research papers on cloud performance. Furthermore, the data deluge coming from services monitoring the load and availability of resources forces a shift in traditional centralized and decentralized resource management by emphasizing the need for edge computing. In this way only meta data is sent to the resource management system for resource matchmaking. In this paper we propose a bottom-up monitoring architecture and a proof-of-concept platform for scheduling applications based on resource coalition reuse. We consider static coalitions and neglect any interference from other coalitions by considering only the historical behavior of a particular coalition and not the overall state of the system in the past and now. We test our prototype on real traces by comparing with a random approach and discuss the results by outlying its benefits as well as some future work on run time coalition adaptation and global influences.

Open Access: No

Read

On Autonomic HPC Clouds

Authors

D. Petcu (IeAT)

Abstract

The long tail of science using HPC facilities is looking nowadays to instant available HPC Clouds as a viable alternative to the long waiting queues of supercomputing centers. While the name of HPC Cloud is suggesting a Cloud service, the current HPC-as-a-Service is mainly an offer of bar metal, better named cluster-on-demand. The elasticity and virtualization benefits of the Clouds are not exploited by HPC-as-a-Service. In this paper we discuss how the HPC Cloud offer can be improved from a particular point of view, of automation. After a reminder of the characteristics of the Autonomic Cloud, we project the requirements and expectations to what we name Autonomic HPC Clouds. Finally, we point towards the expected results of the latest research and development activities related to the topics that were identified.

Open Access: Yes

Read

On the Next Generations of Infrastructure-as-a-Services

Authors

D. Petcu (IeAT), M. Fazio, R. Prodan, Z. Zhao, M. Rak

Abstract

Following the wide adoption by industry of the cloud computing technologies, we can talk about a second generation of cloud services and products that are currently under design phase. However, it is not yet clear how the third generation of cloud products and services of the next decade will look like, especially at the delivery level of Infrastructure-as-a-Service. In order to answer at least partially to such a challenging question, we initiated a literature overview and two surveys involving the members of a cluster of European research and innovation actions. The results are interpreted in this paper and a set of topics of interest for the third generation are identified.

Open Access: Yes

Read

CLOUDLIGHTNING: A Framework for a Self-organising and Self-managing Heterogeneous Cloud

Authors

Lynn, T. (DCU), Xiong, H. (UCC), Elster, A. (NTNU), McGrath, M. (Intel), Khan, M. (NTNU), Kenny, D. (DCU), Becker, T. (Maxeler), Giannoutakis, K. (CERTH), Filelis-Papadopoulos, C. (DUTH), Dong, D. (UCC), Gravvanis, G. (DUTH), Gaydadjiev, G. (Maxeler), Tzovaras, D. (CERTH), Kuppuudaiyar, P. (Intel), Neagul, M. (IeAT), Momani, B. (UCC), Natarajan, S. (Intel), Petcu, D. (IeAT), Gourinovitch, A. (DCU), Dragan, I. (IeAT) and Morrison, J. (UCC)

Abstract

As clouds increase in size and as machines of different types are added to the infrastructure in order to maximize performance and power efficiency, heterogeneous clouds are being created. However, exploiting different architectures poses significant challenges. To efficiently access heterogeneous resources and, at the same time, to exploit these resources to reduce application development effort, to make optimisations easier and to simplify service deployment, requires a re-evaluation of our approach to service delivery. We propose a novel cloud management and delivery architecture based on the principles of self-organisation and self-management that shifts the deployment and optimisation effort from the consumer to the software stack running on the cloud infrastructure. Our goal is to address inefficient use of resources and consequently to deliver savings to the cloud provider and consumer in terms of reduced power consumption and improved service delivery, with hyperscale systems particularly in mind. The framework is general but also endeavours to enable cloud services for high performance computing. Infrastructure-as-a-Service provision is the primary use case, however, we posit that genomics, oil and gas exploration, and ray tracing are three downstream use cases that will benefit from the proposed architecture.

Open Access: Yes

Read