Homework 5

Premise. This homework is optional but warmly advised to check the understanding of the student lecture-by-lecture. It should be used to improve the student preparation during the course. All the answers must be properly and clearly explained.

Exercise 1. Consider a computation $\Sigma$ at the process level composed of four processes $P_1, P_2, P_3$ and $P_4$. $P_1$ unpacks a data structure of size $R \times M^2$ integers and produces a stream of matrices $A^1, A^2, \ldots, A^R$ each one of size $M^2$ integers ($M = 512$). $P_2$, identical to $P_1$, produces a stream of matrices $B^1, B^2, \ldots, B^R$. The length of the two streams is $R = 10^3$. For each pair $A^k, B^k$ (with $k = 1, \ldots, R$) $P_3$ produces an output array $C^k$ of size $M$ according to the following computation:

```plaintext
for i = 0 to M-1 do
    C^k[i] = 0;
for j = 0 to M-1 do
    C^k[i] = F(C^k[i], A^k[i,j], B^k[j,i]);
```

Arrays $\{C^k\}_{k=1}^R$ are transmitted to $P_4$ that copies them into a private matrix $Z[R][M]$. The processing time of the sequential function $F$ is equal to $T_F = 1.5 \times 10^3 \tau$. The asynchrony degree of the channels is equal to one. The computation is executed on a multi-CMP NUMA-SMP machine with $N = 256$ PEs:

- 32 CMPs each one composed of 8 PEs, an internal crossbar and 4 MINF units. Each PE is a D-RISC pipelined CPU with a private primary cache (instructions 32K + data 32K, with blocks of $\sigma = 8$ words) and a private secondary cache of 2M. 4-parallel WW units outside the chip are interconnected to the 4 MINFs. The service time per instruction is $\tau$;
- the local memory sub-system of each CMP is composed of 4 macro-modules connected to the 4-parallel WW units through a crossbar. Each macro-module is composed of 8 memory modules with $\tau_M = 100\tau$;
- the external NUMA network is a 2-ary $n$-fly Generalized Fat Tree;
- automatic invalidation-based cache coherence with home flushing (limited to the secondary cache);
- exclusive mapping, process run-time support with RDY-ACK solution and I/O communications.

Answer the following points by providing all the necessary details in your analysis:

- **a)** study system $\Sigma$ by supposing that $P_1, P_2, P_3, P_4$ are executed on different CMPs and the PE of $P_3$ is the home node of the process run-time support data structures. Suppose that matrices $\{A^k\}_{k=1}^R$ and $\{B^k\}_{k=1}^R$ are stored in memory in such a way that the number of cache faults is minimized. Determine the ideal service times, the effective ones and the relative efficiencies of the modules according to the base latency. Calculate the optimal parallelism degree of the bottleneck ($P_3$). By supposing $R_q/R_{q=0} \sim 1$, evaluate the completion time of the program;

- **b)** study a system $\Sigma^n$ in which $P_3$ has been parallelized according to a data-parallel pattern. Suppose that $P_1, P_2, P_4$ and the set of processes of the parallel $P_3$ are executed on distinct CMPs, and that the homing strategy of the run-time support data structures is left to the programmer. Study the performance of the system according to the base latency and provide the values of the parameters $p, T_p, T_S, R_{q=0}$. Which is the most stressed secondary cache?