Finished conclusion

2024-03-26 18:07:22 +00:00 · 2024-03-26 18:07:22 +00:00 · 2a48463944
commit 2a48463944
parent 62e5879d5e
3 changed files with 71 additions and 19 deletions
--- a/tex/draft/skeleton.tex
+++ b/tex/draft/skeleton.tex
@ -49,7 +49,7 @@
 \begin{document}
 \begin{preliminary}

-\title{Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
+\title{Analysis of Software-Initiated Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}

 \author{Zhengyi Chen}

@ -547,7 +547,7 @@ static void recv_done(
 }
 \end{minted}

-\chapter{Software Coherency Latency}
+\chapter{Software Coherency Latency} \label{chapter:sw-coherency}
 Coherency must be maintained at software level when hardware cache coherency cannot be guaranteed for some specific ISA (as established in subsection \ref{subsec:armv8a-swcoherency}). There is, therefore, interest in knowing the latency of coherence-maintenance operations for performance engineering purposes, for example OS jitter analysis for scientific computing in heterogeneous clusters and, more pertinently, comparative analysis between software and hardware-backed DSM systems (e.g. \cites{Masouros_etal.Adrias.2023}{Wang_etal.Concordia.2021}). Such an analysis is crucial to being well-informed when designing a cross-architectural DSM system over RDMA.

 The purpose of this chapter is hence to provide a statistical analysis over software coherency latency in ARM64 systems by instrumenting hypothetical scenarios of software-initiated coherence maintenance in ARM64 test-benches.
@ -609,7 +609,7 @@ The primary source of experimental data come from a virtualized machine: a virtu
        \hline
        Processors & AMD Ryzen 7 4800HS (8 $\times$ 2-way SMT) \\
        \hline
-        Freuqnecy & 2.9 GHz (4.2 GHz Turbo) \\
+        Frequency & 2.9 GHz (4.2 GHz Turbo) \\
        \hline
        NUMA Topology & 1: $\{P_0,\ \dots,\ P_{15}\}$ \\
        \hline
@ -663,7 +663,7 @@ The primary source of experimental data come from a virtualized machine: a virtu
    \label{table:rose}
 \end{table}

-Additional to virtualized testbench, I have had the honor to access \texttt{rose}, a ARMv8 server rack system hosted by the \textcolor{red}{[TODO] PLACEHOLDER} at the \textit{Informatics Forum}, through the invaluable assistance of my primary advisor, \textit{Amir Noohi}, for instrumentation of similar experimental setups on server-grade bare-metal systems.
+Additional to virtualized testbench, I have had the honor to access \texttt{rose}, a ARMv8 server rack system hosted by \href{https://systems-nuts.com}{\textit{Systems Nuts Research Group}} at the \textit{Informatics Forum}, through the invaluable assistance of my primary advisor, \textit{Amir Noohi}, for instrumentation of similar experimental setups on server-grade bare-metal systems.

 The specifications of \texttt{rose} is listed in table \ref{table:rose}.

@ -1194,25 +1194,68 @@ On the other hand, it may be argued that analyzing software coherency operation

 On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems.

-Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations. 
+Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also strongly disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations. 

-\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
-
-% \paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
-% Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table!
-
-% Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency)
-
-% Should experiment over a variety of hardware. rose is system-ready which supports HW coherency, so prob. not reflective of anything real. Maybe take a raspberry pi now that they have PCIe. Regardless, ARM with PCIe without system-readyness is growing, so may be of more significance in future?
-
-% Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not.
-
-% Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
+\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize that latencies differ over similar experimental setups between the 2 subsections of \ref{sec:sw-coherency-results}. We strongly suspect that this is due to uncontrolled power supply to host machine allowing the \textit{System Management Unit} of the host system to downclock or alter the performance envelope of the host CPU. Similar correlation between power source and CPU performance across different \textit{Zen 2} chipset laptops have been observed \cite{Salter.AMD-Zen2-Boost-Delay.2020}. Though the reduced performance envelope would result in worse ARM64 emulation performance, the relative performances observed in figures \ref{fig:coherency-op-per-page-alloc} and \ref{fig:coherency-op-multi-page-alloc} should still hold, as this power management quirk should result in performance deduction of similar order-of-magnitude across instructions (in terms of latency via reduced clock frequencies). Nevertheless, furthrer studies should rely on controlled power source to eliminate variance caused by system power management functionalities.

 \chapter{Conclusion}
-\section{Summary}
+This thesis hence accomplishes the following: 
+\begin{itemize}
+    \item {
+        It provides an timeline of development in software distributed shared memory systems throughout the ages, from the early (but still inspiring) \textit{Munin} to contemporary developments due to RDMA hardware -- \textit{MENPS}. Using this timeline, it introduces a novel approach to DSM systems that take a heterogeneous-multiprocessing view to the traditional DSM system problem, which serves as the rationale/context behind the primary contributions of this thesis.
+    }
+    \item {
+        It underscores the interaction between the two coherence ``domains''\footnotemark[5] relevant to a DSM system -- the larger domain (between different nodes in a DSM abstraction) depends on the correct behaviors of the smaller domain (within each node, between RDMA NIC and the CPU) to exhibit correct consistency behaviors with regards to the entire DSM system. From here, it focuses on cache coherence in ARMv8 ISA systems after establishing that x86-64 systems already define DMA as transparently cache coherent.
+    }
+    \item {
+        It describes the implementation of software-initiated ARMv8-A cache coherence operations inside the contemporary Linux kernel, which the thesis (and its contextual project) focuses on due to it being open-source and popular across all computing contexts. Specifically, it pinpoints the exact procedures relevant to cache coherence maintenance due to DMA in Linux kernel and explains its interoperation with the upstream DMA-capable hardware drivers. 
+    }
+    \item {
+        It establishes a method to re-export architecture-specific assembly routines inside the Linux kernel as dynamically-traceable C symbols and constructs a kernel module wrapper to conduct a series of experiments that explore the relationship between software coherence routine latency and allocation size. From this, it establishes that latency of routines grows with size of memory subspace to be made coherent, but with a non-linear growth rate.
+    }
+\end{itemize}

-\section{Future Work}
+\footnotetext[5]{Not to be confused with ARM's definition of \textit{coherence domain} -- though theoretically similar. Here, a \textit{domain} refers to a level of abstraction where, given a set of nodes, each constituent node is \emph{internally} coherent as a whole but not guaranteed to be coherent with each other. Reused the term for lack of better descriptors.}
+
+\section{Future \& Unfinished Work}
+The main contribution of this thesis had swayed significantly since the beginning of semester 1\footnotemark[6]. During this thesis's incubation, the following directions had been explored which the author hope may serve as pointers for future contributions for the in-kernel, RDMA-based DSM system: 
+
+\footnotetext[6]{Educationally speaking it's, well, educative, but it would be a lie to claim this did not hurt morale.}
+
+\paragraph*{Cache/Page Replacement Policies wrt. DSM Systems} Much like how this thesis proposed that \emph{2 coherence domains exist for a DSM system -- inter-node and intra-node}, the cache replacement problem also exhibits a (theoretical) duality: 
+\begin{itemize}
+    \item {
+        \textbf{Intra-node} cache replacement problem -- i.e., \emph{page replacement problem} inside the running OS kernel -- is made complex by the existence of kernel-level DSM allowing for multitudes of replacements available parallel to the traditional page swap mechanism. 
+
+        Consider, for example, that \texttt{kswapd} scans some page for replacement. Instead of the traditional swapping mechanism which solely considers intra-node resources, we may instead establish swap files over RDMA-reachable resources such that, at placement time, we have the following options: 
+        \begin{enumerate}
+            \item {
+                intra-node \texttt{zram}\footnotemark[7]
+            }
+            \item {
+                inter-node \texttt{zram} over RDMA
+            }
+            \item {
+                intra-node swapfile on-disk
+            }
+        \end{enumerate}
+
+        Consequently, even swapped page placement becomes an optimization problem! To the author's knowledge, the Linux kernel currently does not support dynamic selection of swap target -- a static ordering is defined inside \texttt{/etc/fstab}, instead. 
+    }
+    \item {
+        \textbf{Inter-node} cache replacement problem, which arises because we may as well bypass \texttt{kswapd} altogether when pages are already transferred over the \textit{HMM} mechanism. This leads to one additional placement option during page replacement:
+        \begin{enumerate}
+            \setcounter{enumi}{3}
+            \item inter-node page transfer via DSM-on-RDMA
+        \end{enumerate}
+
+        Because of significant overhead incurred by the Linux swap mechanism, this option may likely be the most lightweight for working set optimization. Interoperation between this mechanism and existing \texttt{kswapd}, however, is non-trivial.
+    }
+\end{itemize}
+
+\footnotetext[7]{A compressed ramdisk abstraction in Linux. See \url{https://docs.kernel.org/admin-guide/blockdev/zram.html}}
+
+\paragraph*{RDMA-Optimized Coherence Protocol} A coherence protocol design (and corresponding consistency model target) had been drafted during this thesis's creation. However, designing a correct and efficient coherence protocol that takes advantage of RDMA's one-sided communication proved to be non-trivial. We identify, however, that \emph{single-writer protocols} models well the device-CPU dichotomy to memory access and eases the protocol design significantly, as is the case for \textit{Hotpot} \cite{Shan_Tsai_Zhang.DSPM.2017}.


 % \bibliographystyle{plain}