# Highlights

# $The \, NAS \, Parallel \, Benchmarks \, for \, Evaluating \, C++ \, Parallel \, Programming \, Abstractions \, on \, Shared-Memory \, Architectures$

Júnior Löff, Dalvan Griebler, Gabriele Mencagli, Gabriell Alves de Araujo, Massimo Torquati, Marco Danelutto, Luiz Gustavo Fernandes

- The NAS Parallel Benchmarks for evaluating C++ parallel programming
- Structured parallel programming with OpenMP, Intel TBB and FastFlow
- Performance analysis of the parallel implementations on different multicore platforms

# The NAS Parallel Benchmarks for Evaluating C++ Parallel Programming Abstractions on Shared-Memory Architectures

Júnior Löff<sup>a</sup>, Dalvan Griebler<sup>a,b,\*</sup>, Gabriele Mencagli<sup>c</sup>, Gabriell Alves de Araujo<sup>a</sup>, Massimo Torquati<sup>c</sup>, Marco Danelutto<sup>c</sup> and Luiz Gustavo Fernandes<sup>a</sup>

#### ARTICLE INFO

#### Keywords: NAS Parallel Benchmarks Parallel programming Multicore architectures Performance evaluation

#### ABSTRACT

The NAS Parallel Benchmarks (NPB), originally implemented in Fortran, is a distinguished and consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models, which targets shared- and distributed-memory parallel architectures. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload (e.g., integer and floating-point units, memory bandwidth, caches). Parallel programming abstractions such as APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. Besides a thorough translation following the official reports, which we named NPB-CPP, we provide parallel implementations of all the NPB kernels and pseudo-applications targeting shared-memory architectures (commodity multicores) with OpenMP, Intel TBB, and FastFlow. A structured design of each NPB-CPP benchmark in terms of parallel patterns (notably Map and MapReduce constructs) is provided as well as a discussion is presented on how this design can be expressed for the different C++-based parallel programming abstractions selected in this paper. The results show that NPB-CPP is consistent and reliable with respect to the original NPB. It is also platform-independent and enables the parallelization using Intel TBB and FastFlow and possibly others can be evaluated in the future thanks to this work.

#### 1. Introduction

Parallel programming is a popular paradigm to design high-performance applications that leverage the capabilities offered by modern parallel hardware, both shared-memory architectures like multicores and NUMA of multicores, and distributed architectures like clusters, where shared-memory nodes are interconnected via fast networking technologies. The complexity of the available hardware has increased considerably over the years, with processors enhanced with out-of-order computing capabilities, memory hierarchies composed of different coherent private and shared levels of caches, and interconnection networks equipped with smart network interface cards used as co-processors for accelerating networking tasks.

In this ecosystem of complex hardware resources, parallel programming has evolved consequently, with the availability of frameworks that support the user in the development of parallel applications and in taking advantage of the underlying hardware in a productive way. Nonetheless, the so-called *programmability wall* [7] is still the main challenge

ORCID(s): 0000-0003-4824-4621 (J. Löff); 0000-0002-4690-3964 (D. Griebler); 0000-0002-6263-7723 (G. Mencagli); 0000-0001-8179-2318 (G.A.d. Araujo); 0000-0001-6323-3459 (M. Torquati); 0000-0002-7433-376X (M. Danelutto); 0000-0002-7506-3685 (L.G. Fernandes)

in parallel programming. Developing parallel applications productively requires *high-level programming tools* that hide the complexity of coping with complex hardware. Furthermore, they should enforce *performance portability*, e.g., they shall achieve satisfactory performance on different machines, both in terms of absolute performance like IPS/FLOPS as well as scalability with more processes/threads composing the parallel computation.

In the area of shared-memory parallel programming, parallel programming abstractions such as Intel TBB [38] (in C++) and OpenMP [29] (C++ and multi-language support) are of great popularity, with the former used for *task-based parallel programming* while the latter to annotate the sequential code with *pragma-based directives* for loop parallelizations and, from the most recent versions, providing a *task-based model* similar to TBB. In addition, the research community has proposed other families of programming tools fostering high-level abstractions to drive parallel programming. Notably, *Algorithmic Skeletons* [16, 10] (generally belonging to the field of Structured Parallel Programming), applied by some C++ research frameworks like FastFlow [1] and SkePU [14], are worthy of consideration, with active communities promoting this parallel programming style.

Assessing the effectiveness of parallel programming abstractions in terms of performance portability on new hardware architectures, which have become available day by day, requires proper *benchmark suites* composed of parallel workloads with sufficiently heterogeneous features to stress different hardware components/sub-systems (e.g., caches, floating-point units, memory bandwidth). Examples of benchmark

<sup>&</sup>lt;sup>a</sup>School of Technology, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Porto Alegre, Brazil 30332–0250

<sup>&</sup>lt;sup>b</sup>Laboratory of Advanced Research on Cloud Computing (LARCC), Três de Maio Faculty (SETREM), Três de Maio, Brazil

<sup>&</sup>lt;sup>c</sup>Department of Computer Science, University of Pisa, Pisa, Italy I-56127

<sup>\*</sup>Corresponding author at: School of Technology, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Porto Alegre, Brazil 30332–0250

 $<sup>\</sup>label{eq:continuity} $$ gabriell.araujo@edu.pucrs.br (G.A.d. Araujo); torquati@di.unipi.it (M. Torquati); marcod@di.unipi.it (M. Danelutto); luiz.fernandes@pucrs.br (L.G. Fernandes)$ 

suites are PARSEC [6], SPLASH [39], SPEC [19], which include workloads from different domains but all focusing on High-Performance Computing.

Among the existing benchmark suites, we are interested in the NAS Parallel Benchmarks [5] (briefly, NPB). NPB is a set of programs originally developed by NASA Advanced Supercomputing division to evaluate the performance of parallel machines. The benchmarks in the suite (three pseudoapplications and five basic kernels) are derived from Computational Fluid Dynamics (CFD) models and can be configured to work with predefined problem sizes (called "classes"). The available official implementation of NPB is written in Fortran, with parallelizations developed in OpenMP (based on the Fortran code) for shared-memory systems and MPI (Message Passing Interface) for clusters. The parallel workloads in the NPB suite have received great attention over the years, in different kinds of research activities targeting the performance evaluation of new parallel programming techniques and optimizations for specific architectures. An overview of such works will be given in Section 2.2. However, despite the wide use of NPB, no comprehensive porting of all the kernels and pseudo-applications in C++ has been released and made publicly available. Such porting is not only useful as a mere translation between programming languages, but it is of great interest from the research perspective because it enhances the possibility to systematically study, evaluate, and compare the performance of C++-based parallel programming abstractions, either consolidated or research ones, on a wide set of complex and real-world computational kernels like the ones in the NPB suite.

Our main contributions are the following:

- We provide a *thorough* translation (i.e. which respects the original sequential code structure in terms of data structures, loops and original programming style) of the five kernels and the three pseudo-applications from the original serial Fortran code to C++, making NPB-CPP usable and extendable also for future works.
- We provide parallel implementations of the entire benchmark suite in NPB-CPP. In addition to the original OpenMP version (currently maintained by NAS <sup>1</sup>), which is available in Fortran and directly translated, we provide parallel implementations with Intel TBB and FastFlow, aiming at covering both a consolidated and a research-based parallel programming abstraction.
- We show how the concept of structured parallel programming [25] can be applied for efficiently parallelizing NPB-CPP. We model each one of the parallel benchmarks in NPB-CPP using compositions of Map and MapReduce parallel patterns. This is a higher-level programming approach whose expressiveness and flexibility is discussed in this paper for OpenMP, Intel TBB and FastFlow.
- We provide a careful analysis of the performance obtained by NPB-CPP on different multicore machines

and compilers, demonstrating the quality of the porting from the (sequential and parallel) performance and functional correctness perspectives. Experiments are collected on a set of platforms (Intel Xeon, AMD Epyc, and IBM Power8) to show the performance portability on different multicore architectures.

A preliminary, partial version of NPB-CPP was first introduced in our previous work [18], where we presented a research on the C++ porting for the five serial kernels. The code we used was mainly based on an outdated NPB version (v2.4, released in the 2000s). With the current work, we replaced the code with a new NPB version (v3.4). More details about the changes can be found in NPB's official website <sup>2</sup>. This paper extends the previous work by: i) re-implementing all the sequential kernels to be compliant to the last NPB version (v3.4); ii) we complete the sequential porting with the three pseudo-applications that were not studied in our prior work; iii) we re-implement the five kernels to adhere to the structured parallel programming style, so building the new kernel implementations, and the three pseudo-applications parallel code, in terms of parallel patterns (Map and MapReduce), while our prior parallel implementations were based on ad-hoc parallelization approaches; iv) NPB-CPP can be configured to use all the workload sizes (classes) of the original suite, while the kernels in our prior work worked only for small class sizes (up to class B).

The source code of NPB-CPP is provided within a repository made accessible to the community<sup>3</sup>. NPB-CPP has recently been used by other independent research works as a baseline for experimental comparisons [27, 31, 22, 24, 12, 23, 15, 4], highlighting the importance of having a reliable implementation of the distinguished NPB suite.

The outline of the paper is the following. Section 2 provides a summary of the NPB kernels and pseudo-applications to make the paper self-contained, and provides an overview of past research activities where NPB has been used in the experimental evaluation for different purposes. Section 3 discusses our implementation, both the sequential porting from Fortran to C++ and the design of parallel versions using C++-based parallel programming abstractions. Section 4 shows the results of our experimental analysis, while Section 5 concludes our work.

# 2. Background

In this section, we briefly explain the basic structure of NPB in terms of kernels and pseudo-applications. Then, we will review some recent papers that used NPB.

# 2.1. NAS Parallel Benchmarks (NPB)

The NPB suite has five kernels and three pseudo-applications. The code poses several different challenges, as far as performance is concerned, such as irregular memory accesses, complex data dependencies, and short- and long-distance

hines <sup>2</sup>https://www.nas.nasa.gov/publications/npb\_changes.html

<sup>&</sup>lt;sup>3</sup>NPB-CPP GitHub repository: https://github.com/GMAP/NPB-CPP

<sup>&</sup>lt;sup>1</sup>https://www.nas.nasa.gov/publications/npb.html

data communications [5], where the former stresses data locality while the latter memory bandwidth capacity. The five kernels, all in Fortran except IS which is in C, are described as follows:

- Embarrassingly Parallel (EP). It generates a large number of Gaussian random deviates and enumerates them to compute the Gaussian deviation by utilizing the acceptance-rejection method. Finally, the number of pairs that lie in the square annulus is computed. This method is useful to measure the capacity of the floating-point operations of the target architecture [5].
- Multi Grid (MG). It utilizes the V-cycle MultiGrid method to compute a 3D scalar Poisson equation where the kernel continuously computes restriction and prolongation when alternating between coarse and fine grids. The goal is to stress short- and long-distance data communications [5, 20].
- Conjugate Gradient (CG). It computes an approximation of the smallest eigenvalue of a large, sparse, and unstructured matrix, utilizing the Conjugate Gradient method. This kernel stresses data communication mechanisms as well as memory locality and caches [20].
- Discrete 3D Fast Fourier Transform (FT). It computes a Fast Fourier Transform (FFT) of a 3D partial differential equation using the spectra and inverse methods in an iterative loop. It simulates an intensive long-distance communication [5, 20].
- Integer Sort (IS). It performs an integer sorting among a sparse set of numbers, which simulates an important computation for particle-in-cell applications. By default, it is based on the Bucket-Sorting algorithm. This kernel simulates and measures integer computation and data communication capabilities [5].

The pseudo-applications implement three different iterative methods to solve a 3D Navier-Stokes system of differential equations describing the flow of incompressible fluids. We summarize them as follows:

- Block Tri-diagonal solver (BT). It is an expensive implicit algorithm to numerically solve 3D Navier-Stokes equations. The solution is based on an Alternating Direction Implicit (ADI) factorization on a 3D matrix, which produces block-tridiagonal systems that, along each direction, solve the unknown vectors using the back substitution method [20].
- Scalar Penta-diagonal solver (SP). It uses the Beam-Warming approximate factorization to decompose the 3D matrix. The output consists of a particular case of band matrices known as Scalar Pentadiagonal matrices. Then, the tridiagonal matrix algorithm is applied over the block-tridiagonal systems and the back substitution method solves the remaining vectors [20].

• Lower-Upper Gauss-Seidel solver (LU). It utilizes the Symmetric Successive Over-Relaxation (SSOR) method, which combines two SOR computations. The latter is a variation of the Gauss-Seidel method that solves a linear system of equations. First, a forward SOR sweep is performed followed by a backward SOR sweep to update the unknown variables in reverse order.

# 2.2. NPB in prior works

The current available official NPB version is implemented in Fortran, which was elected over other languages considering its popularity in CFD applications. The selected papers for discussing and comparing are those based on C language. Previous works, such as [33, 11] and [30], have used NPB for evaluating specific architectures and systems. The authors of [33] have ported the NPB to C and re-implemented the applications in OpenCL, to leverage heterogeneous architectures equipped with GPUs. However, no information about how the conversion was performed is given. Also, the authors did not consider the largest NPB workloads (e.g., class C) because of the limited memory capacity of the GPU considered (GeForce GTX 480). More recently, a new research has extended this work by proposing an improved GPU version for OpenCL and CUDA [11]. Again, no information about the pitfalls of the porting was provided, while the focus was only on the parallelization. The performed evaluation demonstrates increased performance compared to the original version on GPU-based platforms, while no results have been presented on standard multicore platforms.

A prior work [30] presents a methodology to improve a strong scalability evaluation for OpenMP. They implemented the methodology in the PCERE (Parallel Codelet Extractor and REplayer) tool that extracts and executes OpenMP parallel regions. Their evaluation was performed on a C version developed starting from an early NPB version (2.3) [8], which was later made available. Some research works have focused on the Unified Parallel C (UPC) language, an extension of the C programming language designed for HPC platforms. A version of NPB was ported to the UPC language, and it is composed of two main parts. The first contains all the five NPB kernels and was developed by the HPC Laboratory from the George Washington University as part of the Berkeley UPC Compiler project [13, 40]. The second contains the three pseudo-applications and was developed by the NASA Ames Research Center [21]. This NPB-UPC version is distributed with the Berkeley UPC project. The five previous research works [33, 30, 11, 13, 21] put a significant effort to convert and extend the NPB towards C-based languages.

In recent studies, C-based NPB versions have been used for evaluation purposes. The work in [27] investigates the performance and potential of the parallel STL (Standard Template Library) using NPB kernels. Parallel STL is an Intel library developed using Intel TBB as a backend for parallel algorithms. This work used our previous version of the NPB ported to C++ [18], which was limited to the five kernels without the pseudo-applications. It also highlights the importance of having such sequential porting in C++. In

| Table 1                                                                  |  |
|--------------------------------------------------------------------------|--|
| Summary of past papers using NPB in their experimental evaluation. $ \\$ |  |

| Reference | Year | Benchmarks                  | NPB<br>Version | Languages | Parallel<br>Versions        | Target<br>Architecture |
|-----------|------|-----------------------------|----------------|-----------|-----------------------------|------------------------|
| [13]      | 2002 | CG,EP,FT,IS,<br>MG          | NPB 2.4        | UPC       | UPC                         | Cluster                |
| [21]      | 2009 | BT,LU,SP                    | NPB 3.3        | UPC       | UPC                         | Cluster                |
| [33]      | 2011 | BT,CG,EP,FT,<br>IS,LU,MG,SP | NPB 3.3        | С         | OpenMP,<br>OpenCL           | GPU                    |
| [30]      | 2015 | BT,CG,EP,FT,<br>IS,LU,MG,SP | NPB 2.4        | С         | OpenMP                      | Multicore              |
| [11]      | 2019 | BT,CG,EP,FT,<br>IS,LU,MG,SP | NPB 3.3        | С         | OpenMP,<br>OpenCL,<br>CUDA  | GPU                    |
| Ours      | 2020 | BT,CG,EP,FT,<br>IS,LU,MG,SP | NPB 3.4        | C++       | OpenMP,<br>FastFlow,<br>TBB | Multicore              |

our work, besides completing the NPB porting, we study the performance of NPB directly implemented using Intel TBB without the additional layer provided by Parallel STL, for a fairer comparison with OpenMP and FastFlow.

The work in [31] uses the NPB to evaluate five autoparallelizing compilers (Cetus, Par4all, Rose, ICC, and Pluto). Also in this case, the authors used our previous version of the NPB ported to C++. An optimized memory allocator for the Single-Assignment C (SaC) compiler was proposed in [37]. The authors re-implemented the NPB in SaC to evaluate their optimizations. A tool that systematically analyzes sharedmemory accesses of UPC applications has been proposed in [9]. The authors tested NPB kernels and applications to fine-tune data redistribution, and for enhancing the use of private variables for improving local accesses. A locality-aware framework for thread affinity placement based on hierarchical data locality has been presented in [3]. The authors selected four NPB kernels (IS, FT, CG, and MG) for covering a wide range of communication patterns. Hardware support mechanisms to efficiently manipulate PGAS address mapping and to improve data access overhead have been introduced in [34]. Experiments were conducted on a subset of NPB kernels.

Among the previously cited papers, we selected five papers that are closely related to our work for comparing specific features. Table 1 provides a summary of the characteristics of those five papers and this work. The first and second columns report the reference and the year of publication. The third column reports the target kernels and pseudo-applications supported. The fourth column shows the NPB version on which they are based. The fifth column shows the language used and the sixth column reports the adopted parallel programming abstractions. The seventh column highlights the targeted parallel architectures.

In past papers [33, 11], the authors have translated all the kernels and applications of the NPB 3.3 targeting the C language. In [30] and [13] authors used an outdated NPB version. In [21] authors have implemented three pseudo-applications using the NPB 3.3 version to target the UPC programming

language. Differences are also observed concerning the target architectures. Although [30] is targeting multicores as our work, their version is a raw translation to test a compiler tool. Most significantly, none of these previous works aimed to provide a consistent and generic NPB version (kernels and pseudo-applications) in C++ that is platform agnostic (see Sect. 3). Our benchmark code can be easily extended to support other computer architectures through the use of existing C++ parallel programming abstractions. Furthermore, new C++ compiler tools can also use our NPB-CPP to evaluate the impact of new code optimization and autoparallelization techniques, which represents an additional, indirect contribution of our work.

#### 2.3. Parallel Programming Abstractions

Modern systems can generate millions of data per day, that require to be processed timely. To follow this trend, parallelization is crucial for extracting the maximum performance in the underlying architecture, specially in modern multicores. On the other hand, parallel programming still lacks accessibility and is a challenge to developers, since they must deal with low-level details (e.g. scheduling, load balancing and synchronizations).

Parallel programming in multicore architectures relies on efficient abstractions enabling different parallelism exploitation patterns in a relatively easy-to-use manner by the high-level programmer. The C++ programming language already has consolidated parallel frameworks that offer the structured parallel programming paradigm. This parallel programming style is based in *algorithmic skeletons* or *parallel patterns* for hiding from the developer many of the low-level complexities intrinsic to parallelism. In the following, we introduce three popular parallel programming abstractions: OpenMP, Intel TBB and FastFlow.

*OpenMP* OpenMP [29] is a parallel programming framework for multicore systems. It is based on *pragma* annotations to be applied directly on the source code before regions that

require a significant portion of the overall execution time, like *for* and *while* loops. The OpenMP compiler is in charge of transforming the annotations in a code leveraging multicores by using threads that split the execution of the loop iterations. Recently, starting from the 4.0 standard, OpenMP supports the *task-based parallelism paradigm* with proper pragmas to create tasks and link them with dependencies. This allows the programmer to create a task, e.g. a portion of code plus its surrounding data environment, that can be scheduled for the execution on an available thread provided that all its input dependencies are satisfied.

Intel TBB TBB [38] is the Intel suite for parallel programming on multicores. In the original idea, TBB provides the programmer with abstractions to create tasks connected by dependencies and to schedule them transparently on a pool of threads. Complex work-stealing techniques have been adopted to balance the parallel workload of the underlying threads in an effective and cache-friendly manner. A higher level of abstraction has been added to represent graphs of special importance in the parallel programming practice, like pipelines of filter stages. Complex graphs can be easily developed by leveraging the FlowGraph interface, introduced in TBB from version 4.0.

FastFlow FastFlow [1] is an open source<sup>4</sup> structured parallel programming framework. It provides the application programmer with a variety of ready-to-use stream and data parallel patterns (from this the structured keyword) that may be freely composed and customized to implement complex parallel applications. FastFlow is a header-only library implemented on top of POSIX/Pthread/C++11, and parallel patterns are used by instantiating proper classes of the library. The framework has been designed to target multicore machines with two main goals in mind: performance and programmability. More recently [36], it has been extended to target GPUs and FPGAs. Recent advances include the possibility to use alternative implementations of the communications primitives (e.g., switching between blocking and non-blocking synchronizations). Furthermore, the library provides with its most recent version v3.0 an intermediate layer (called Building Blocks) for system programmers aimed at being used to develop novel parallel run-time systems for specific application domains.

# 3. Implementation

The goal of this section is twofold. First, we describe how the porting has been developed, and the principles behind our thorough translation from Fortran (C in case of IS) to C++. Then, in the second part, we show the strategies we followed for parallelizing NPB with different C++ parallel programming abstractions targeting the structured parallel programming approach based on compositions of parallel patterns. Our design choices aim at simplifing the usability

and extensibility of NPB-CPP. Therefore, we provide C++ code that can be easily converted to C-like code with minimal modification effort. Consequently, NPB-CPP is not object-oriented, however, it can be considered as a future work for evaluating modern C++ features and compare them with well-established parallel programming abstractions.

#### 3.1. C++ porting and conventions

Our porting was conducted following the official documentation [5, 20]. When implementing the C++ code of NPB, the first guideline was: every time a global array is declared in the Fortran code, we allocate it as dynamic memory in C++ since NPB uses very large arrays that may result in memory stack overflow errors. Furthermore, we implemented the global arrays to be allocated in one single dimension for all kernels and pseudo-applications. However, for internal multi-dimensional accesses, we perform conversions from linear to multidimensional indexes. NPB reports do not recommend the use of fixed multidimensional arrays because the benchmarks implement different data access patterns. In our design, we use linear arrays as the main layout for our data structures because, although it is not necessarily the best choice to optimize the cache hierarchy utilization (given the irregular access pattern of some of the benchmarks), our goal is to provide a generic code that can be extensible for a wide spectrum architectures with minimal code changes to provide custom optimizations for specific platforms.

Another important guideline was to follow as much as possible the original Fortran code semantics during the sequential code porting. However, it was not always possible to literally translate the code. In particular, we often needed to modify the ordering of accesses to arrays. The reason is that Fortran is column-major ordered (each complete column of the matrix is stored before the next one), while C++ is row-major ordered (each complete row of the matrix is stored contiguously). Furthermore, the interval of valid positions in the Fortran arrays is [1; n] while it is [0; n-1] in C++, which required straightforward changes in the loops. NPB code has also many goto statements, which have been removed by implementing while loops with proper stop conditions.

Although the IS kernel was already implemented in C, it uses some Fortran routines shared with other kernels. During our porting to C++, we re-implemented such routines in C++. The porting to the FT kernel required a substantial code refactoring effort to remove the batch mechanism which gathers a set of FFT operations to increase the computational granularity. Consequently, there is no need to manually specify the batch size parameter. Other code modifications (e.g., removing data dependencies, nested loops, low-level cache optimizations) were necessary to keep the coherence between the sequential and OpenMP code structures for FT and IS since they were the only kernels having different sequential and OpenMP versions. This contributes to have a fair performance comparison when parallelizing the code.

# 3.2. Parallel implementation

The original NPB code in Fortran is shipped with an official OpenMP parallel implementation, and this parallel

<sup>&</sup>lt;sup>4</sup>The source code of the library is publicly available at: https://github.com/fastflow

```
Map parallel pattern on FastFlow
Map parallel pattern on OpenMP
  #pragma omp parallel for
                                         ff::ParallelFor pf;
    for(i=0; i<n;i++){
                                       2. pf.parallel_for(0, n, 1, chunk, [&](int i){
        //computation
                                      4. }.
5. }
                                       5. nworkers);
Map parallel pattern on TBB
 1. tbb::parallel for( tbb::blocked_range <size_t> (0,n,chunk), [&](const
tbb::blocked_range < size_t>& r){
    for (i=r.begin(); i<r.end(); i++){
5. }):
                                        (a) Map
MapReduce parallel pattern on OpenMP
                                            MapReduce parallel pattern on FastFlow
  #pragma omp parallel for reduction (+:x)

    ff::ParallelForReduce <double> pf;

  for(i=0; i<n;i++){
                                             . pf.parallel_reduce(x,0, n, 1, chunk, [&A](int i){
  }
                                             . }, [&A](int i, double& x) { x += A[i];},
5. }
MapReduce parallel pattern on TBB
1. Z = tbb::parallel_reduce( tbb::blocked_range <size_t> (0,n), 0.0, [&](const tbb::blocked_range
 <size t>& r. double x){
   for (i=r.begin(); i<r.end(); i++){
   return x.
```

**Figure 1:** Basic routines to implement the Map and MapReduce parallel patterns using OpenMP, FastFlow, and Intel TBB.

(b) MapReduce

strategy was adopted as a reference in our C++ version owing to the OpenMP multi-language support for both Fortran and C/C++. However, in addition to this, our C++ porting allows the evaluation of strategies based on different parallel abstractions available in C++. We consider Intel TBB [38], which is a mainstream tool, and FastFlow [1], which is a pattern-based research library based on Algorithmic Skeletons [16, 10]. We chose FastFlow because we compare it against state-of-the-art solutions for shared-memory architectures like OpenMP and TBB, and the results of this comparison also help to improve other higher-level APIs and DSLs (Domain-Specific Languages) that rely on FastFlow as its parallelism runtime [17, 26, 32].

#### 3.2.1. Parallel design choices

5. }. std::plus<double>()):

The design choices and principles adopted in the implementation of our parallel versions are the following: (1) the first goal is to make the parallel implementations as faithful as possible to the structured parallel programming approach using data-parallel patterns such as Map and MapReduce (Fig. 1); (2) our second goal is to be uniform in the design, by making use of the features available in the different parallel programming abstractions. For example, although we have intrinsic mechanisms such as std::mutex in C++, we choose tbb::mutex from Intel TBB; (3) our third goal is to avoid architecture-specific optimizations and let the code be portable on a range of different platforms. However, people interested to obtain the maximum performance on specific architectures or parallel programming abstractions may in

the future easily extend NPB-CPP to implement their specific optimizations. Examples of them are the use of custom task-parallelism patterns, memory and thread affinity strategies, or even targeting different platforms like distributed architectures (e.g., clusters) or GPUs (see [4]).

OpenMP, TBB, and FastFlow provide different parallel programming APIs. In order to generalize the presentation of the parallelizations and introduce the use of structured parallel programming, we describe them using abstract *Map* and *MapReduce* data-parallel patterns as well as *Critical Sections* and *Barriers* that are all present in the different tools (each API has its own specific implementation). They are briefly introduced in the following list:

- the Map pattern [25], consists of the replication of a function that applies over all elements of an indexed set. This can be used to parallelize for loops when iterations are independent. OpenMP, TBB, and FastFlow offer an API called "parallel for" for this purpose, see Figure 1a. In OpenMP, programmers annotate parallelizable for loops using compiler pre-processor directives. In TBB and FastFlow, programmers replace parallelizable for loops using routines that are implemented by a C++ template library. The main difference is that the parallel region for OpenMP and TBB is thread-private while FastFlow executes directly the parallel Map operation. The scheduling type is optional, however, OpenMP and FastFlow, by default apply a static assignment of iterations to the underlying threads, while TBB uses a dynamic distribution;
- the MapReduce is the union of a Map and a Reduce pattern. According to [25], the Reduce pattern combines all elements from a collection and produces a single element (or a subset) using an associative binary operator. Therefore, in the MapReduce, every element of the Map is combined into a single element. This pattern can be used to parallelize for loops when iterations exhibit specific data dependencies and synchronization is required. OpenMP, TBB and FastFlow offer an API called "parallel\_reduce" as shown in Figure 1b. In OpenMP, programmers use reduction along with the parallel\_for directive for specifying the operation type and the reduced variable. This parameter only accepts predefined types. In TBB and FastFlow, programmers replace the target for loop with a specific routine receiving as arguments a set of parameters such as the lambda functions implementing the Map and the Reduce steps;
- the Barrier is a synchronization primitive for a group of threads [25]. The barrier guarantees a synchronization point where any thread must stop and cannot proceed until all other threads reach the barrier. The barrier definition considers both implicit and explicit barriers. By default, implicit barriers are found at the end of Map and MapReduce patterns on all parallel programming abstractions. OpenMP has a nowait directive that allows threads to avoid this synchronization.

Table 2
Structure of each NPB benchmark in terms of parallel patterns in FastFlow(FF), OpenMP (OMP), and Intel TBB (TBB).

| Benchmark    | Мар |     |     | Ν  | 1apRec | luce | Barriers |     |     |
|--------------|-----|-----|-----|----|--------|------|----------|-----|-----|
| Delicilliark | FF  | OMP | TBB | FF | OMP    | TBB  | FF       | OMP | TBB |
| EP           | 1   | 1   | 1   | 1  | 1      | 1    | 1        | 1   | 1   |
| MG           | 11  | 11  | 10  | 1  | 1      | 1    | 11       | 11  | 10  |
| CG           | 7   | 18  | 7   | 4  | 6      | 4    | 7        | 11  | 7   |
| FT           | 8   | 8   | 8   | 1  | 1      | 1    | 8        | 10  | 8   |
| IS           | 6   | 7   | 6   | 1  | 1      | 1    | 6        | 7   | 6   |
| ВТ           | 23  | 23  | 23  | -  | -      | -    | 23       | 23  | 23  |
| SP           | 19  | 19  | 19  | -  | -      | -    | 19       | 19  | 19  |
| LU           | 22  | 23  | 22  | 1  | 1      | 1    | 22       | 19  | 22  |

However, TBB and FastFlow do not have this option. Explicit *barriers* are implemented when synchronization is required across threads. In OpenMP we used the #pragma omp barrier. In TBB and FastFlow, we used the standard Pthread library barrier mechanisms.

Table 2 shows the number of instances of the parallel patterns and synchronization primitives that have been used in our NPB-CPP parallelization with the three parallel programming abstractions. The FastFlow and TBB versions show a different number of patterns and synchronization primitives in some kernels and pseudo-applications (MG, CG, IS, and LU). The reason is that some instances of *Map* and *MapReduce* do not bring any performance improvement (their loop body does very small computation). Better performance can be achieved with the TBB and FastFlow versions by removing such few fine-grained Map and MapReduce.

Although the same could be done in OpenMP, and because no performance penalty was observed, we decided to maintain the code of our C++ OpenMP version identical to the original OpenMP Fortran code provided by the NAS experts. Other asymmetries in the implementation with the different tools are very specific to each kernel and pseudoapplication, and we omit to describe them since they do not represent a central point for this work.

In the rest of this section, we describe the main aspects of the parallelization for each kernel and pseudo-application.

#### 3.2.2. EP kernel

The EP kernel has a single compute-intensive code region that we parallelized using a *MapReduce* with static scheduling. This choice has been applied in all our implementations for the different parallel programming abstractions. Additionally, this kernel needs a special synchronization at the end of the parallel computation. As the default *MapReduce* implementation (described in Section 3.2.1) accepts only standard types and there is a reduction over an array, we manually implemented the data synchronization in OpenMP, TBB, and FastFlow.

#### 3.2.3. MG kernel

The MG kernel uses the multigrid V-cycle operation with a residual computation. The strategy adopted is to parallelize with *Maps* the intensive computational regions of the V-cycle method, which are the restriction, prolongation, residual, and smoother routines. Then, we also parallelized some less intensive routines such as the communications along borders (with a *Map*) and the approximation to the L2 and uniform norm values (with a *MapReduce*). This last *MapReduce* needed special care since threads synchronize on two variables for different operation types (sum and max). In OpenMP, this is done with two different reduction directives, while in FastFlow and TBB we manually implement the *MapReduce*. Finally, we point out that the MG kernel exhibits limited scalability with more threads, as it will be shown in the experimental part. The reason is that the main computational step works on a small grid and requires non-local accesses to the memory, leading to poor cache exploitation.

#### 3.2.4. CG kernel

The most computationally demanding step in CG is the sparse matrix-vector multiplication q = Ap of the Conjugate Gradient method, which we parallelize using a Map. In this case, the static scheduling works well although CG has an irregular workload. This is because the workload follows a random Gaussian distribution, and when the slices are statically divided they tend to store equally balanced workload. Furthermore, we used multiple Maps and MapReduces to parallelize other less compute-intensive steps, as in Table 2. The consequence of including several Map and MapReduce is that synchronization barriers are implicitly added to the code, and this synchronization overhead dominates the small improvement obtained by introducing such parallel steps. We mitigated this in OpenMP, where some implicit barriers in the Maps were removed using the nowait directive.

#### 3.2.5. FT kernel

This kernel contains three independent symmetric FFT routines to compute each dimension, which we parallelized using a *Map*. However, communications may represent a bottleneck since the kernel must decompose slices of the main 3D matrix into a 1D local array each time it applies an FFT resolution, and then it requires to copy the results back. In addition, we manually implemented the *MapReduce* for computing the checksum in all the parallel programming abstractions. This needs special treatments because the reduce operation is applied over complex numbers.

#### 3.2.6. *IS kernel*

The IS kernel uses the bucket sorting approach. The sequential code (originally written in C) has a synchronization protocol for the parallel OpenMP version. This protocol needs to be adapted for TBB and FastFlow possibly by maintaining the code as closely as possible to the original one. To this end, we adapted the default *Map* (described in Sect. 3.2.1) by using it just to replicate the same function without specifying which elements of the indexed set are computed in parallel. Then, the scheduling is performed manually inside the threads' private region. This splits the computation and determines which indexes of the set each thread computes. An explicit *Barrier* is added after each instance of the modi-

fied *Map* for synchronization. We point out that in Table 2 the number of *Maps* in IS matches the number of *Barriers* for this reason. We also instantiated the standard *Map* in some loops, as for the sorting operation inside the buckets. In this case, we used the dynamic scheduling approach to better balance the workload between buckets. Finally, we instantiated the *MapReduce* to implement the verification function after the computation.

# 3.2.7. BT and SP pseudo-applications

Implicit methods are typically used to get an approximation of the solution in CFD equations through iterative techniques, as implemented in the BT, SP, and LU pseudoapplications. Besides the divergent factorization methods used for BT and SP, those pseudo-applications were very similarly designed. Both are originally implemented from a parallel viewpoint. Therefore, their data have already been organized to avoid dependencies. In OpenMP, FastFlow and TBB, we only instantiated sequence of *Maps* to parallelize them, as shown in Table 2. A practical example of the data organization's importance concerning performance is found in BT and SP for the residual computation routine. Dimensions x and y are implemented with a single Map due to the lack of data dependencies since data are organized along the z direction. However, the z dimension itself is fragmented to avoid data dependencies, requiring six *Maps* instead of a single one as in x and y directions. Finally, the most compute-intensive steps of both BT and SP, along with the residual computation, are the three solving functions for computing each of the three dimensions. We parallelized them instantiating the *Map* pattern in the outermost loops.

#### 3.2.8. LU pseudo-application

The LU pseudo-application implements the Symmetric Successive Over-Relaxation (SSOR) method. This algorithm extends the Gauss-Seidel method to solve a linear system of equations. Its intensive computation relies on the decomposition of the 3D matrix system in triangular lower/upper matrices and then solving this matrix system. Both computations, upper and lower, are performed in two similar steps on the same iteration. Previous studies [20, 21] suggest two main ways to parallelize the LU application: hyperplane and pipelining. In the first, points from the same hyperplane defined by l = i + j + k can be computed in parallel. The so-called *pipeline* strategy instead implements a synchronization structure using data-parallelism mechanisms to control the computational flow in a way that mimics a multidimensional *pipeline*. The LU cannot be parallelized efficiently with traditional data-parallelism techniques because all three dimensions must deal with data dependencies. This means that any modification would send a lot of update messages between threads, impairing parallel scalability.

We adapted the original OpenMP parallelization in Fast-Flow and TBB. First, we modify the *Map* as we did for the IS kernel (see Section 3.2.6). We also implemented synchronization mechanisms using *locks* to control the data flow in FastFlow and TBB, where a thread starts the computa-

tion only when its previous neighbor has finished. OpenMP uses a different approach based on  $omp\ flush$  directives. We graphically show the parallel data flow outcome in Fig. 2. Parallelism is achieved when for each advance along the k direction, a new thread starts computing the next block of elements from the indexed set.



Figure 2: The parallel data flow illustration for LU.

# 4. Experiments

The set of experiments was carried out on three different multicore platforms, namely: Xeon, Epyc, and Power8. We used the GNU compiler since it represents the most popular one in the open source community, and also the Intel compiler (ICPC) since it represent the most popular proprietary compiler. The architectural specifications and environment settings of the three machines are reported in Table 3. The NPB workload sizes are expressed through classes, where classes A, B, and C are standard test problems having about 4x size increase going from one class to the next one. We select class C for our tests because it represents a significant workload size for modern multicores [28]. For compiling the benchmarks, we specified -std=c++14 and -03. Each experiment configuration was repeated 5 times. The plots are reporting the arithmetic mean value and the standard deviation using error-bars. We also apply statistical analysis using 95% of reliability to compare the differences in the execution times. We ensure the functional correctness of the results provided by NPB-CPP using the built-in verification functions implemented in the NPB. Such functions compare the results obtained in one execution with the correct ones stored within the benchmark. The execution is successful when the result is identical or within a tolerated error range (e.g., CG tolerates less than  $10^{-10}$ ). In all cases, our C++ porting successfully passes all those verification checks.

We will first discuss the performance of the sequential versions. Second, we examine the performance of the parallel porting using the same OpenMP runtime (distributed within compilers). Then, we discuss the performance obtained by using Intel TBB (2020.1) and the FastFlow (3.0.0) library. Finally, we compare the performance with our previous work's benchmark versions.

#### 4.1. Sequential porting

In this section, the goal is to evaluate our sequential porting, observing the performance behavior of NPB-CPP compared to the original NPB. Although we only present the

Table 3
Multicore platforms and their environment settings (OS and compilers).

| Name   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Xeon   | It has a Dual-socket Intel Xeon E5-2695 Ivy Bridge CPUs running at 2.40GHz, featuring 24 cores (12 per socket). Each hyper-threaded core has 32KB private L1, 256KB private L2 and 30MB of L3 shared with the cores on the same socket. The machine has 64GB of RAM. System. Linux 4.15.0-72, Ubuntu 18.04.3LTS. Compilers. GNU gcc 9.1.0 and Intel icc 19.0.5.281                                                                                             |
| Ерус   | It has a Dual-socket AMD EPYC 7551 CPUs Zen microarchitecture running at 2.40GHz, featuring 64 cores (32 per socket). Each core has 2 HW threads, 64KB private L1, 512KB private L2 and 8MB of L3 shared with other three cores. Each socket has 4 NUMA nodes. The machine has 128GB of RAM. <b>System.</b> Linux 4.15.0-101, Ubuntu 18.04.4LTS. <b>Compiler.</b> GNU gcc 9.1.0                                                                                |
| Power8 | It has a Dual-socket IBM server 8247-42L with two Power8 processors each with ten cores organized in two CMPs of 5 cores working at 3.69GHz. Each core (8-way SMT) has private L1d and L2 caches of 64 KB and 512 KB, and a shared on-chip L3 cache of 8 MB per core. The total number of cores per CPU is 20 physical and 80 logical ones. The machine has 64 GB of RAM. <b>System.</b> Linux 4.4.0-47, Ubuntu 16.04. <b>Compiler.</b> GNU gcc version 9.1.0. |

plots for the Xeon platform, we also describe in a nutshell the outcome for the other two platforms. Fig. 3 shows the results obtained with the GNU gfortran/g++ compilers on the Xeon platform. X-axis presents the NPB benchmarks in both graphs, while the Y-axis presents the execution time in seconds (using a logarithmic scale) in one plot while the other shows the relative difference in percentage. The relative difference is obtained by normalizing NPB-CPP benchmarks' execution time with respect to the original NPB, where positive bars stand for C++ faster and negative bars the opposite.

On average, the C++ version achieves similar performance compared with the original Fortran code. The normalized difference is less than 1.5%, except for BT, LU, and FT. One of the major differences between NPB-CPP code with respect to NPB is that we dynamically allocate single dimensional arrays, while in NPB they were allocated statically using multi-dimensions. As already explained in Sect. 3, we use linear arrays to provide a generic code that can be extensible for a wide spectrum architectures. Also, since the benchmarks implement irregular data access patterns, using linear memory allocation is less intrusive than multi-dimensional arrays for this purpose. In BT, this modification has a positive impact-C++ is 7.43% faster than Fortran-while in LU the impact is negative, since C++ is 4.3% slower. Concerning FT, the C++ version is 2.15% slower than the Fortran version, mainly because of the complex type (intrinsic data type only in Fortran).

To observe the performance using different compilers on the same platform, Fig. 4 shows the execution time in seconds (using a logarithmic scale) obtained with the Intel compiler on the Xeon platform. The relative differences be-





**Figure 3:** Comparing the performance of NPB vs NPB-CPP on the Xeon platform using the GNU compiler with Class C.

tween Fortran and C++ are also shown. In this case, the differences between the two versions are more evident. The FT kernel presents the biggest difference: the C++ version is 27.49% slower than the Fortran one. The MG kernel also shows a high normalized difference: the C++ version is 18.26% faster than the Fortran version. Both FT and MG are the two benchmarks that allocate more memory and stress long-distance data communications (see Section 2.1). We observed that the most significant performance differences are in those benchmarks that are more memory intensive. Besides the FT and MG kernels, the C++ version of the BT and SP pseudo-applications are 7.13% and 5.83% slower than the Fortran ones. These pseudo-applications are faster when using ifort than gfortran, while the C++ version (g++ and icpc) is similar for both compilers, which explains the small increase in the difference. For the remaining benchmarks, the average difference is less than 3%.

The experiments conducted for the other two platforms using the GNU compiler are summarized as follows. We present the percentage numbers normalizing the difference between NPB-CPP and NPB execution times. The outcome is that NPB's benchmarks are on average 0.86% faster than





**Figure 4:** Comparing the performance of NPB vs NPB-CPP on the Xeon platform using the Intel compiler with Class C.

NPB-CPP ones in Epyc. In Power8, NPB-CPP's benchmarks are on average 0.73% faster than NPB ones. As discussed for the Xeon platform, the differences are less than 1% on average. We can conclude that the sequential porting of NPB-CPP is reliable and efficient compared to the NPB in different compilers and platforms.

#### 4.2. Parallel porting

This section aims at comparing the performance of the parallel porting on shared-memory architectures, which was first implemented with OpenMP in Fortran by the NAS experts. Our parallel porting translates from *Fortran directives* to *C++ directives* as discussed previously. Consequently, we are comparing the behavior of OpenMP parallelism between NPB-CPP and NPB benchmarks. Graphs in Fig. 5 show the execution time in seconds (using a logarithmic scale), varying the number of threads from 1 to 48 on the Xeon platform and with workload Class C. The plots have a second Y-axis to report the normalized difference in percentage between NPB-CPP and the NPB using bars. A positive difference means that the NPB-CPP version is faster than the NPB while a negative value vice-versa.

To better evaluate the differences for each parallel execution, we colored the bars with red and green colors. The assigned color is the result obtained by the p-value statistical analysis [35]. The smaller the p-value, the stronger the evidence that the null hypothesis should be rejected. In our statistical analysis, the null hypothesis ( $H_0$ ) is NPB=NPB-CPP (e.g., the NPB-CPP provides the same level of parallel performance of the original NPB). The alternative hypothesis ( $H_1$ ) is then NPB≠NPB-CPP. To reject  $H_0$ , the p-value must be less than 0.05 (this is a commonly used threshold in the literature for software experiments). When  $H_0$  (green color) is rejected, we assume  $H_1$  (red color). In this way, it is possible to quickly identify which results are significantly different from the statistical standpoint considering 95% confidence in Fig 5.

As sketched in Fig. 5, the performance among NPB and NPB-CPP benchmarks are very close. The EP and MG kernels (Figs. 5a and 5b), followed by BT and LU pseudoapplications Figs. 5f and 5h), are the benchmarks presenting more cases with significant statistical difference. For LU the primary reason is that the NPB-CPP sequential version is, on average, 4.3% slower than the NPB version. This difference persists in the parallel version (Fig. 5h). In BT, while in the NPB-CPP sequential version was 7.43% faster, the parallel version shows that NPB-CPP with OpenMP configured in one thread is 4.15% slower than NPB, which is in Fortran. The BT and SP pseudo-applications are both bounded to the sequential PDE solver. However, in the Xeon platform, only SP stops scaling around 12 threads, which explains why its results tend to diverge less.

FT and MG benchmarks presented larger performance differences. The first in favor of the NPB-CPP version whereas the second in favor of the NPB version. Something similar happens to CG, where results are different since it executes over irregular workloads that require memory locality. For IS, on average, the execution time is similar except for the low and high parallelism degree. Finally, in the EP kernel, NPB-CPP is slightly faster than NPB in all cases.

To observe how the performance is impacted by the use of different compilers, Table 4 summarizes the best parallel execution times obtained on the Xeon platform using both the GNU and Intel compilers with workload Class C. On average, the results are close to each other, except for EP, FT, and MG. EP executes 25.6% faster with icc than gcc. Both FT and MG stress long-distance data communications and their performance relies mainly on how much the compiler is capable of optimizing memory accesses.

After presenting and discussing these experiments, we can conclude that the parallel porting achieved reliable results and efficient performance compared to the original version. The statistical analysis revealed that 58.8% of the tests (combining the number of threads and benchmarks) stand equal for NPB and NPB-CPP. In addition to that, we can see that the differences were small in average. The next section will extend the parallelism performance analysis on other parallel programming abstractions.



**Figure 5:** Experimental results for NPB-CPP and NPB benchmarks parallelized with OpenMP on the Xeon platform by using the GNU compiler and workload Class C. Lines show the execution time in seconds (using log scale) while bars show the normalized difference. Bar's colors show if this difference is significant from the statistical standpoint under 95% of reliability: green color means NPB=NPB-CPP whereas red means NPB=NPB-CPP.

#### 4.3. NPB-CPP with Intel TBB and FastFlow

As the previous experiments highlighted the reliability and efficiency of the sequential and parallel porting, this section will evaluate and discuss the performance scalability outcomes of the benchmarks for other parallel programming abstractions, which now can be tested on these benchmarks due to this work. Therefore, we used as the baseline the performance results of the NPB-CPP with OpenMP (previously discussed) to compared with the NPB-CPP parallelized using Intel Threading Building Blocks (TBB) and FastFlow (FF) (see Sect. 3.2). Since our experiment environment uses NUMA architectures, thread affinity can significantly affect performance. We used the default configuration for each parallel programming abstraction, therefore, OpenMP and Intel TBB let the OS handle thread affinity while FastFlow places threads to physical cores first from 0 to N in ascend-

ing order. Table 5 reports the best speedups of each version computed over the NPB-CPP sequential version, the best execution time in seconds, the number of threads used, and the standard deviation. These metrics are presented for the three platforms.

EP is an embarrassingly parallel computation implemented by a single *MapReduce* pattern. As expected, it reaches its best speedup by using all available cores of the machines in all parallel programming tools and platforms. TBB shows the highest speedup due to its work-stealing scheduling. The MG kernel requires access to non-linear addresses in memory, preventing the best cache usage. Since Power8 has a higher memory bandwidth capacity [2], it provides better results. FF and OMP provide higher speedup, mainly due to the static scheduling policy of loop iterations adopted. The main difference between OMP and FF is that OMP creates a parallel region

Table 4 The best execution times among NPB and NPB-CPP benchmarks on the Xeon platform by using gcc/icc compilers with workload Class C.

| Bench. | Metrics    | GNU G   | СС    | Intel ICC |       |  |
|--------|------------|---------|-------|-----------|-------|--|
| Bench. | ivietrics  | NPB-CPP | NPB   | NPB-CPP   | NPB   |  |
|        | N Threads  | 48      | 48    | 48        | 48    |  |
| EP     | Time (sec) | 13.60   | 13.75 | 10.40     | 10.35 |  |
|        | Std. Dev.  | 0.01    | 0.05  | 0.03      | 0.05  |  |
|        | N Threads  | 16      | 21    | 41        | 22    |  |
| MG     | Time (sec) | 7.94    | 7.28  | 8.352     | 8.02  |  |
|        | Std. Dev.  | 0.12    | 0.24  | 0.17      | 0.24  |  |
|        | N Threads  | 48      | 48    | 47        | 24    |  |
| CG     | Time (sec) | 19.27   | 19.18 | 19.32     | 18.29 |  |
|        | Std. Dev.  | 0.18    | 0.34  | 0.67      | 0.91  |  |
|        | N Threads  | 48      | 48    | 48        | 47    |  |
| FT     | Time (sec) | 20.67   | 20.54 | 22.11     | 23.71 |  |
|        | Std. Dev.  | 0.08    | 0.23  | 0.99      | 0.40  |  |
|        | N Threads  | 48      | 48    | 47        | 47    |  |
| IS     | Time (sec) | 1.034   | 0.956 | 1.02      | 1.08  |  |
|        | Std. Dev.  | 0.01    | 0.01  | 0.01      | 0.01  |  |
|        | N Threads  | 47      | 48    | 24        | 23    |  |
| ВТ     | Time (sec) | 71.74   | 68.61 | 71.96     | 66.16 |  |
|        | Std. Dev.  | 0.59    | 0.23  | 0.89      | 0.53  |  |
|        | N Threads  | 14      | 14    | 14        | 14    |  |
| SP     | Time (sec) | 90.75   | 91.20 | 90.05     | 88.86 |  |
|        | Std. Dev.  | 1.49    | 0.69  | 0.58      | 0.74  |  |
|        | N Threads  | 47      | 47    | 45        | 45    |  |
| LU     | Time (sec) | 45.72   | 44.18 | 45.72     | 44.88 |  |
|        | Std. Dev.  | 0.29    | 0.38  | 0.22      | 0.21  |  |

where threads are always active while FF disables and enables them each time a new parallel *Map* is executed.

The CG kernel requires a large number of synchronizations, primarily because it uses many *MapReduce* for sharing partial results (see Table 2). This periodically interrupts the computation and decreases the maximum performance achievable. CG has irregular data accesses, which benefit of bigger last-level caches shared between more cores. This explains why the results are better on the Xeon and Power8 platforms. The OMP version shows the highest speedups, mainly because we were able to add nowait directives to remove implicit barriers, as depicted in Table 2.

In the FT kernel, data communication phases have a significant impact on performance. Each FFT resolution copies a slice of data from the main 3D-matrix to a local 1D-array, solves it, and then copies it back into the main matrix. In the Xeon platform, results are similar for all versions because the memory bottleneck hides other sources of overhead. However, in the Epyc and Power8 platforms, although speedups are similar, the obtained sequential execution time in the Epyc machine is an order of magnitude faster than the one obtained in the Power8. This explains the better scaling in the Power8, specifically for the TBB version. It benefits from its dynamic work-stealing scheduling policy, which achieves a good workload balancing among the threads. However, since

the computation in FT is fine-grained, the extra overhead of such dynamic scheduling does not payback on the Epyc platform. As IS kernel is also memory-bound, it explains the higher speedups in Power8. The IS sequential code runs faster in Epyc than in Power8. This is the reason why FF and TBB obtain the best speedup earlier (with 24 and 28 threads, respectively) as the execution time is short (close to 1 second) and the runtime overhead has a higher impact. In the Power8, FF's thread pinning to cores mechanism showed poor performance for this benchmark.

As already discussed, BT and SP use analogous PDE solvers. Despite both are bound by the PDE solver, taking into account that they execute a different factorization, the bottleneck occurs in different situations. This explains why BT has a higher speedup than SP and why they stop scaling when the sequential bottleneck of the solver is reached. For SP, we parallelized it by using the same strategy for all parallel programming abstractions, implementing only Maps. Therefore, the maximum speedup is very close in all implementations. For BT, this is evident in the Epyc platform while not in Xeon and Power8. Concerning the runtime systems of the parallel programming abstractions, the static iteration scheduling employed by the OMP and FF versions provide better speedup with respect to the dynamic scheduling used by the TBB version. Specifically, the execution time is reduced significantly if the iteration space is equally divided among all threads. In the Power8 platform, this happens with 160 threads.

The LU pseudo-application has been parallelized by using an implicit multidimensional pipeline implemented with *Maps*. It uses the static scheduling policy for the iteration space and proper lock mechanisms for synchronizations. In all platforms, the maximum speedup is low because the benchmark is limited by the sequential solution of the linear system of equations. For FF and TBB, the execution flow management is done in the thread scope while for OMP it is done inside the loop iteration scope.

In summary, we conclude that other parallel programming abstractions can be effectively used to parallelize these benchmarks. Although the focus was not to optimize at maximum the parallelism exploitation to find or select the best parallel programming abstraction, the results revealed interesting insights. We can observe that the performance of each parallel programming abstraction depends on the shared-memory architecture design and benchmark characteristics. OpenMP is considered the *de-facto* standard framework for these environments, however, it does not always achieve the best speedups. These insights open space for future investigations regarding parallelism optimizations.

#### 4.4. Comparison with our prior work

In this section, we provide a performance comparison between NPB-CPP and our prior work [18] which, as already explained in Section 1, was based on the five kernels only and on an outdated version of NPB.

In terms of parallel implementation, the IS and FT kernels exhibit the most significant differences between the new

**Table 5**Experimental results for NPB-CPP showing the best execution time (in seconds) and speedup using Class C for Xeon, Epyc, and Power8. The speedup is calculated using the ratio between NPB-CPP's sequential version (no parallel abstraction) and the best execution time for each tool. Colored cells highlight the winner speedup.

| D I I     | NA - I - I |       | Xeon  |       |        | Ерус   |        |        | Power8 |        |
|-----------|------------|-------|-------|-------|--------|--------|--------|--------|--------|--------|
| Benchmark | Metrics    | TBB   | FF    | OMP   | TBB    | FF     | OMP    | TBB    | FF     | OMP    |
|           | N Threads  | 48    | 48    | 48    | 128    | 128    | 124    | 160    | 160    | 160    |
| EP        | Time (sec) | 13.52 | 15.04 | 13.60 | 4.92   | 5.02   | 5.36   | 8.58   | 10.27  | 8.89   |
| EP        | Speedup    | 41.20 | 37.02 | 40.94 | 128.77 | 126.35 | 118.38 | 124.03 | 103.60 | 119.74 |
|           | Std. Dev.  | 0.01  | 0.11  | 0.01  | 0.01   | 0.01   | 0.11   | 0.06   | 0.26   | 0.29   |
|           | N Threads  | 19    | 16    | 16    | 8      | 9      | 16     | 64     | 156    | 64     |
| MG        | Time (sec) | 10.18 | 7.69  | 7.94  | 14.76  | 15.28  | 13.64  | 18.28  | 20.73  | 15.83  |
| IVIG      | Speedup    | 5.15  | 6.82  | 6.60  | 2.25   | 2.17   | 2.43   | 16.51  | 14.55  | 19.05  |
|           | Std. Dev.  | 0.03  | 0.18  | 0.12  | 0.57   | 1.36   | 1.10   | 0.14   | 0.14   | 0.02   |
|           | N Threads  | 48    | 48    | 48    | 56     | 6      | 24     | 152    | 68     | 120    |
| CG        | Time (sec) | 20.11 | 20.52 | 19.27 | 37.03  | 38.88  | 35.62  | 34.80  | 35.19  | 30.47  |
| CG        | Speedup    | 16.99 | 16.65 | 17.73 | 3.09   | 2.94   | 3.21   | 28.81  | 28.49  | 32.90  |
|           | Std. Dev.  | 0.08  | 0.52  | 0.18  | 0.52   | 1.31   | 4.66   | 0.31   | 0.36   | 0.32   |
|           | N Threads  | 48    | 48    | 48    | 56     | 124    | 128    | 160    | 152    | 140    |
| FT        | Time (sec) | 20.62 | 20.06 | 20.67 | 12.00  | 11.42  | 11.00  | 20.60  | 23.51  | 25.10  |
|           | Speedup    | 21.32 | 21.92 | 21.28 | 64.44  | 67.71  | 70.23  | 72.70  | 63.73  | 59.72  |
|           | Std. Dev.  | 0.05  | 0.11  | 0.08  | 0.08   | 0.19   | 0.16   | 0.10   | 0.25   | 0.31   |
|           | N Threads  | 48    | 47    | 48    | 28     | 24     | 124    | 140    | 160    | 148    |
| IS        | Time (sec) | 1.21  | 1.06  | 1.03  | 1.32   | 1.16   | 0.87   | 1.09   | 1.27   | 0.98   |
| .5        | Speedup    | 14.95 | 17.06 | 17.56 | 13.97  | 15.94  | 21.16  | 43.81  | 37.6   | 48.53  |
|           | Std. Dev.  | 0.02  | 0.03  | 0.01  | 0.03   | 0.08   | 0.02   | 0.02   | 0.03   | 0.04   |
|           | N Threads  | 41    | 46    | 47    | 52     | 104    | 100    | 160    | 160    | 160    |
| ВТ        | Time (sec) | 86.31 | 68.15 | 71.74 | 60.70  | 60.34  | 59.63  | 210.73 | 105.69 | 86.60  |
| 5.        | Speedup    | 13.67 | 17.32 | 16.45 | 12.84  | 12.92  | 13.07  | 29.05  | 57.93  | 70.70  |
|           | Std. Dev.  | 0.36  | 0.31  | 0.59  | 0.25   | 1.53   | 1.11   | 0.75   | 2.02   | 0.78   |
|           | N Threads  | 13    | 14    | 14    | 8      | 7      | 12     | 40     | 32     | 32     |
| SP        | Time (sec) | 97.69 | 93.35 | 90.75 | 107.81 | 113.49 | 112.21 | 166.52 | 172.85 | 171.95 |
| ]         | Speedup    | 7.82  | 8.18  | 8.41  | 4.14   | 3.94   | 3.98   | 16.78  | 16.16  | 16.25  |
|           | Std. Dev.  | 1.22  | 2.65  | 1.49  | 1.92   | 7.48   | 13.41  | 0.31   | 1.30   | 3.56   |
|           | N Threads  | 41    | 46    | 47    | 28     | 24     | 32     | 80     | 80     | 80     |
| LU        | Time (sec) | 47.96 | 43.64 | 45.72 | 47.21  | 47.89  | 47.21  | 85.89  | 88.24  | 82.92  |
| LU        | Speedup    | 17.42 | 19.14 | 18.27 | 11.45  | 11.29  | 11.45  | 34.08  | 33.17  | 35.30  |
|           | Std. Dev.  | 0.23  | 0.15  | 0.29  | 0.58   | 0.80   | 2.60   | 0.35   | 1.75   | 0.97   |

code in NPB-CPP and the old one. In IS, we implement a new parallel strategy for FF and TBB designed from scratch. This strategy uses a single Map for parallelizing the complete compute-intensive region of the sorting routine using buckets. In contrast, the previous one used four Maps and serialized the synchronization between buckets, generating extra overhead. Furthermore, the previous implementation was not implemented in TBB due to issues related to getting the thread identifiers, which was not possible with task-based parallelism for TBB. In the FT kernel, we modified some routines such as the evolve and checksum, and included others like the initialization for warming up all data before execution. We identified an extra loop suitable to be parallelized using a Map. Additionally, in the CG kernel we identified four extra Maps that provide performance benefits with large problem sizes (starting from class C).

Table 6 shows the performance improvements between our two versions. Since the code in prior work does not work on large problem sizes starting from class C, we restricted the comparison with experiments using class B. Considering this size is relatively small (see [28]), we do not expect large performance differences. The table presents the best execution times for the five kernels on the Xeon platform with the GNU compiler. For the EP kernel, the implementation in NPB-CPP is faster than the prior one in all the considered parallel programming abstractions, more distinct in FF. The main reason is that our previous implementation was affected by false-sharing problems, which we have fixed in the NPB-CPP implementation using proper padding of our data structures and arrays (configured to automatically work in any multicore platform by reading the cache line sizes from the operating system). Also in the MG kernel, the new version is faster in all the tools. The main reason is that we standardized the implementation of linear arrays to all kernels, which work better in this case for MG's fine-grained workload and irregular memory accesses. In the old implementation we used multidimensional arrays.

In CG, FF and TBB had different performance result be-

**Table 6**Comparison with our prior work: best execution times on Xeon with Class B using the GNU compiler.

| Bench. | Metrics    | Cui  | rrent V | /ork | Previous Work [18] |      |      |  |
|--------|------------|------|---------|------|--------------------|------|------|--|
|        |            | FF   | OMP     | TBB  | FF                 | OMP  | TBB  |  |
|        | Nº Threads | 47   | 48      | 48   | 48                 | 48   | 48   |  |
| EP     | Time (sec) | 3.78 | 3.42    | 3.39 | 5.07               | 3.45 | 3.41 |  |
|        | Std. Dev.  | 0.05 | 0.01    | 0.00 | 0.03               | 0.00 | 0.01 |  |
|        | Nº Threads | 16   | 14      | 10   | 21                 | 16   | 9    |  |
| MG     | Time (sec) | 0.89 | 0.87    | 1.34 | 1.01               | 0.90 | 1.40 |  |
|        | Std. Dev.  | 0.04 | 0.01    | 0.02 | 0.02               | 0.00 | 0.06 |  |
|        | Nº Threads | 23   | 48      | 35   | 48                 | 48   | 47   |  |
| CG     | Time (sec) | 9.16 | 6.86    | 7.84 | 7.94               | 6.76 | 9.04 |  |
|        | Std. Dev.  | 0.15 | 0.01    | 0.04 | 0.03               | 0.14 | 0.07 |  |
|        | Nº Threads | 48   | 48      | 45   | 48                 | 47   | 48   |  |
| FT     | Time (sec) | 4.71 | 4.80    | 5.21 | 4.99               | 5.43 | 5.36 |  |
|        | Std. Dev.  | 0.02 | 0.11    | 0.04 | 0.06               | 0.11 | 0.06 |  |
|        | Nº Threads | 48   | 48      | 48   | 48                 | 48   | -    |  |
| IS     | Time (sec) | 0.25 | 0.21    | 0.25 | 0.26               | 0.21 | -    |  |
|        | Std. Dev.  | 0.00 | 0.00    | 0.00 | 0.00               | 0.00 | -    |  |

cause we now parallelized more loops than before. In FF, we used a different scheduler than the one used in the other tools, because it works better with its thread pinning mechanism in CG. However, it payoffs with bigger workloads as shown in Table 5, where FF is similar to the others. TBB also shows a higher overhead than OMP in CG. However, TBB overhead is much more evident in FT since it is the only tool configured to use a dynamic scheduler (work-stealing). Additionally, in FT we modified considerably the code, and consequently, all the new versions are faster. The IS execution time is very small, close to 250 ms, and the overhead of the parallel runtime becomes more evident. The difference between the FF versions is close to 5%, which depends on the new parallel strategy that paybacks even more with larger workload sizes. Finally, this new version allows the Intel TBB parallelization of IS, not available before.

# 5. Conclusions

This paper provided the NPB-CPP benchmark suite with parallel implementations following a structured parallel programming approach and using OpenMP, Intel TBB, and Fast-Flow for shared-memory architectures. We presented a comprehensive discussion about the benchmarks, the implementation, and also our parallel design choices for justifying all the results. For evaluating the C++ parallel programming abstractions, we first obtained a C++ version of the NPB, named NPB-CPP. Moreover, we characterized the basic structure of the benchmarks and their distinguished importance in the research, which relies on the fact that NPB contains many irregular workloads stressing different components of the underlying multicore architecture.

We evaluated NPB-CPP by performing a set of experiments using popular platforms nowadays (Xeon, Epyc, and

Power8) with two different compilers (GNU GCC and Intel ICC). The performance evaluation was guided using statistical analysis. The results demonstrated that there are minor performance differences among the sequential versions. OpenMP versions (NPB vs NPB-CPP) have shown a reliable outcome, presenting a similar pattern of behavior. Additionally, with 95% of confidence, the statistical analysis evaluating the parallel porting revealed that, at least, 58.8% of the samples stand for NPB-CPP equal to NPB benchmarks.

Finally, NPB-CPP's extensibility and its importance to the scientific community were shown in the experiments of the parallel versions using FastFlow and TBB. Other related abstractions can be evaluated, including those that were designed for distributed and heterogeneous parallel architectures. The experiments, using Xeon, Epyc, and Power8 revealed that the performance of the kernels and pseudo-applications varies among the platforms, benchmarks, and parallel programming abstractions. The main performance bottleneck in the benchmarks was due to the caches and memory bandwidth among the platforms, resulting in different behaviors between the parallel programming abstractions. FastFlow obtained better performance for Xeon when comparing among other platforms. Intel TBB obtained better performance in a few benchmarks and platforms mainly due to its optimized scheduling for balancing the workload among the threads. Although OpenMP is considered the state-of-the-art for multicores, this work revealed many situations in which TBB and FastFlow are more optimized and can perform better.

Researchers can use NPB-CPP to improve parallel programming abstractions based on the insights already revealed in this research. Possible future works are: 1) provide NPB-CPP with other parallel programming abstractions for shared, heterogeneous, and distributed memory architectures; 2) perform experiments on other related multicore architectures such as ARM processors; and 3) assess compilers and other code optimizations techniques.

# Acknowledgments

This research is partially funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, FAPERGS 01/2017-ARD project PARAE-LASTIC (Nº 17/2551-0000871-5), FAPERGS 05/2019-PQG project PARAS (Nº 19/2551-0001895-9), and Universal MC-TIC/CNPq Nº 28/2018 project SPARCLOUD (No. 437693/2018-0).

# CRediT authorship contribution statement

Júnior Löff: Software, Investigation, Validation, Visualization, Writing - Original Draft. Dalvan Griebler: Conceptualization, Formal analysis, Project administration, Validation, Writing - Review & Editing. Gabriele Mencagli: Conceptualization, Writing - Review & Editing. Gabriell Alves de Araujo: Software, Validation, Writing - Original Draft. Massimo Torquati: Conceptualization, Validation, Resources, Writing - Review & Editing. Marco Danelutto:

Resources, Writing - Review & Editing. Luiz Gustavo Fernandes: Supervision, Funding acquisition.

# **Declaration of competing interest**

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

#### References

- Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M., 2017. Fast-flow: High-Level and Efficient Streaming on Multicore. John Wiley & Sons, Ltd. chapter 13. pp. 261–280. doi:10.1002/9781119332015.ch13.
- [2] Analytics, C., 2014. Infrastructure Matters: POWER8 vs. Xeon x86. Technical Report. Clabby Analytics.
- [3] Anbar, A., Badawy, A.H.A., Serres, O., El-Ghazawi, T., 2014. Where should the threads go? leveraging hierarchical data locality to solve the thread affinity dilemma, in: Proceedings of the International Conference on Parallel and Distributed Systems, pp. 384–391.
- [4] de Araujo, G.A., Griebler, D., Danelutto, M., Fernandes, L.G., 2020. Efficient nas parallel benchmark kernels with cuda, in: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 9–16. doi:10.1109/PDP50117.2020.00009.
- [5] Bailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., Weeratunga, S., 1994. The NAS Parallel Benchmarks. Technical Report. NASA Ames Research Center. Moffett Field, CA - USA.
- [6] Bienia, C., Kumar, S., Singh, J.P., Li, K., 2008. The parsec benchmark suite: Characterization and architectural implications, in: Proceedings of the 17 International Conference on Parallel Architectures and Compilation Techniques, ACM. pp. 72–81.
- [7] Chapman, B., Jost, G., van der Pas, R., 2007. Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, London, UK.
- [8] Compiler, O., 2017. Omni compiler project. URL: http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/.
- [9] Cong, G., Wen, H., Murata, H., Negishi, Y., 2012. Tool-assisted optimization of shared-memory accesses in upc applications, in: 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems, pp. 104–111.
- [10] Danelutto, M., Mencagli, G., Torquati, M., González-Vélez, H., Kil-patrick, P., 2020. Algorithmic skeletons and parallel design patterns in mainstream parallel programming. International Journal of Parallel Programming URL: https://doi.org/10.1007/s10766-020-00684-w, doi:10.1007/s10766-020-00684-w.
- [11] Do, Y., Kim, H., Oh, P., Park, D., Lee, J., 2019. Snu-npb 2019: Parallelizing and optimizing npb in opencl and cuda for modern gpus, in: 2019 IEEE International Symposium on Workload Characterization (IISWC), pp. 93–105.
- [12] Ejjaaouani, K., Aumage, O., Bigot, J., Mehrenberger, M., Murai, H., Nakao, M., Sato, M., 2019. INKS, a Programming Model to Decouple Performance from Algorithm in HPC Codes, in: Euro-Par 2018: Parallel Processing Workshops, Springer International Publishing, Cham. pp. 757–768.
- [13] El-Ghazawi, T., Cantonnet, F., 2002. Upc performance and potential: A npb experimental study, in: SC '02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 17–17.
- [14] Ernstsson, A., Li, L., Kessler, C., 2018. Skepu 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems. Int. J. Parallel Program. 46, 62–80. doi:10.1007/s10766-017-0490-5.
- [15] Garcia, A.M., Serpa, M., Griebler, D., Schepke, C., Fernandes, L.G., Navaux, P., 2020. The impact of cpu frequency scaling on power consumption of computing infrastructures, in: Computational Science

- and Its Applications ICCSA 2020, Springer International Publishing, Cham. pp. 142–157.
- [16] González-Vélez, H., Leyton, M., 2010. A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Software: Practice and Experience 40, 1135–1160. doi:10.1002/spe. 1026.
- [17] Griebler, D., Danelutto, M., Torquati, M., Fernandes, L.G., 2017. SPar: A DSL for High-Level and Productive Stream Parallelism. Parallel Processing Letters 27, 20.
- [18] Griebler, D., Löff, J., Mencagli, G., Danelutto, M., Fernandes, L.G., 2018. Efficient NAS benchmark kernels with C++ parallel programming, in: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 733–740.
- [19] Henning, J.L., 2006. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News 34, 1–17. doi:10.1145/1186736. 1186737.
- [20] Jin, H.Q., Frumkin, M., Yan, J., 1999. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. Technical Report. NASA Ames Research Center. Moffett Field, CA - USA.
- [21] Jin, H.Q., Hood, R., Mehrotra, P., 2009. A practical study of upc using the nas parallel benchmarks, in: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, pp. 8:1–7.
- [22] Lima, J.V.F., Raïs, I., Lefèvre, L., Gautier, T., 2019. Performance and energy analysis of openmp runtime systems with dense linear algebra algorithms. The International Journal of High Performance Computing Applications 33, 431–443. doi:10.1177/1094342018792079.
- [23] M. Garcia, A., Schepke, C., Girardi, A., 2020. Pampar: A new parallel benchmark for performance and energy consumption evaluation. Concurrency and Computation: Practice and Experience 32, e5504. doi:https://doi.org/10.1002/cpe.5504, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5504. e5504 cpe.5504.
- [24] Maliszewski, A.M., Griebler, D., Schepke, C., Ditter, A., Fey, D., Fernandes, L.G., 2018. The nas benchmark kernels for single and multi-tenant cloud instances with lxc/kvm, in: 2018 International Conference on High Performance Computing Simulation (HPCS), pp. 359–366. doi:10.1109/HPCS.2018.00066.
- [25] McCool, M., Robison, A., Reinders, J., 2012. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier Science.
- [26] Mencagli, G., Torquati, M., Griebler, D., Danelutto, M., Fernandes, L.G., 2019. Raising the parallel abstraction level for streaming analytics applications. IEEE Access 7, 131944–131961. doi:10.1109/ACCESS. 2019.2941183.
- [27] Mietzsch, N., Fuerlinger, K., 2019. Investigating performance and potential of the parallel stl using nas parallel benchmark kernels, in: 2019 International Conference on High Performance Computing Simulation (HPCS), pp. 136–144. doi:10.1109/HPCS48598.2019.9188147.
- [28] NASA, 2020. The nas parallel benchmarks. URL: https://www.nas.nasa.gov/publications/npb.html.
- [29] OpenMP Architecture Review Board, 2018. OpenMP application program interface version 5.0. URL: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.
- [30] Popov, M., Akel, C., Conti, F., Jalby, W., de Oliveira Castro, P., 2015. Pcere: Fine-grained parallel benchmark decomposition for scalability prediction, in: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 1151–1160.
- [31] Prema, S., Nasre, R., Jehadeesan, R., Panigrahi, B., 2019. A study on popular auto-parallelization frameworks. Concurrency and Computation: Practice and Experience 31.
- [32] del Rio Astorga, D., Dolz, M.F., Fernández, J., García, J.D., 2017. A generic parallel pattern interface for stream and data processing. Concurrency and Computation: Practice and Experience 29, e4175. doi:https://doi.org/10.1002/cpe.4175.
- [33] Seo, S., Jo, G., Lee, J., 2011. Performance characterization of the nas parallel benchmarks in opencl, in: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 137–148. doi:10.1109/ITSWC.2011.6114174.
- [34] Serres, O., Kayi, A., Anbar, A., El-Ghazawi, T., 2015. Enabling pgas

productivity with hardware support for shared address mapping: A upc case study. ACM Transactions on Architecture and Code Optimization 12

- [35] Taeger, D., Kuhnt, S., 2014. Statistical Hypothesis Testing with SAS and R. 1st ed., Wiley Publishing.
- [36] Torquati, M., 2019. Harnessing Parallelism in Multi/Many-Cores with Streams and Parallel Patterns. Ph.D. thesis. University of Pisa.
- [37] Vießmann, H.N., Šinkarovs, A., Scholz, S.B., 2018. Extended memory reuse an optimisation for reducing memory allocations, in: ACM International Conference Proceeding, pp. 107–118.
- [38] Voss, M., Asenjo, R., Reinders, J., 2019. Pro TBB: C++ Parallel Programming with Threading Building Blocks. 1st ed., Apress, USA.
- [39] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A., 1995. The SPLASH-2 programs: Characterization and methodological considerations. SIGARCH Comput. Archit. News 23, 24–36.
- [40] Yelick, K., 2019. Berkeley upc unified parallel c. URL: https://upc.lbl.gov/.



Júnior Löff is a M.Sc student in Computer Science at the Pontifical Catholic University of Rio Grande do Sul (PUCRS), and research member of the Parallel Applications Modeling Group (GMAP) at PUCRS. He received his B.Sc Degree in Computer Engineering from PUCRS in 2020. His research interests include: Parallel and distributed systems, high-performance applications modeling and hardware/software co-design.



Dalvan Griebler is an Associate Professor at the Pontifical Catholic University of Rio Grande do Sul (PUCRS). Also Associate Professor at Três de Maio Faculty (Setrem) and head of the Laboratory of Advanced Research on Cloud Computing (LARCC) at Setrem. He received the Ph.D. in computer science from both PUCRS and University of Pisa in 2016. His main research interests are: parallel and distributed computing, methodologies, languages and libraries for high-level parallel programming; benchmarking; and cloud computing.



Gabriele Mencagli is an Assistant Professor at the Computer Science Department of the University of Pisa, Italy. He got his Ph.D from the same University in 2012. He is co-author of more than 60 peer-reviewed papers appeared in international conferences, workshops and journals, and of one book. His research interests are in the area of Parallel and Distributed Systems, Autonomic Computing and Data Stream Processing. He is a member of the Editorial Board of Future Generation Computer Systems (Elsevier) and Cluster Computing (Springer).



Gabriell Alves de Araujo is a M.Sc student in Computer Science at the Pontifical Catholic University of Rio Grande do Sul (PUCRS), and research member of the Parallel Applications Modeling Group (GMAP) at PUCRS. He received his B.Sc Degree in Computer Science, Music, and Entrepreneurship and Innovation from the Pontifical Catholic University of Rio Grande do Sul (PUCRS) and IPA Methodist University Center. His research interests are: Parallel algorithms and programming techniques for multicore and many-core architectures.



Massimo Torquati is an Assistant Professor at the Department of Computer Science, University of Pisa, Italy. He has published more than 100 peerreviewed papers in conference proceedings and international journals, mostly in the fields of parallel and distributed programming and run-time systems for parallel computing platforms. He has been involved in several Italian, EU, and industry-supported research projects. He is the maintainer and the main developer of the FastFlow parallel programming framework.



Marco Danelutto is a Full Professor at the Computer Science Department of the University of Pisa, Italy. His main research interests are in the field of parallel programming models, in particular in the area of parallel design patterns and algorithmic skeletons. He is author of more than 150 papers appearing in refereed international journals and conferences. He is the group leader of the Parallel Programming Model group at the same university, involved in a number of national and EU funded projects (CoreGRID, GRIDcomp, Para-Phrase, REPARA, RePhrase).



Luiz Gustavo Fernandes is an Associate Professor of the graduate program in computer science (PPGCC) at the Pontifical Catholic University of Rio Grande do Sul (PUCRS). His primary research interests are Parallel and Distributed Computing, High Performance Applications Modeling, Green Computing and Parallel Programming Interfaces. Dr. Fernandes received his Ph. D. in Computer Science from the Institut National Polytechnique de Grenoble (France) in 2002. He currently leads the Parallel Applications Modeling Group (GMAP) at PUCRS.