Finished abstract, intro

This commit is contained in:
Zhengyi Chen 2024-03-26 19:35:32 +00:00
parent 2a48463944
commit 37728272e8
2 changed files with 17 additions and 28 deletions

View file

@ -49,7 +49,7 @@
\begin{document}
\begin{preliminary}
\title{Analysis of Software-Initiated Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
\title{Analysis of Software-Maintained Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
\author{Zhengyi Chen}
@ -77,7 +77,7 @@
\date{\today}
\abstract{
\textcolor{red}{[TODO] \dots}
Advancements in network interface hardware and operating system capabilities sometimes render historically unpopular computer system architectures feasible. One unusual example is software DSM, which may potentially regain its relevance as a unique solution to hardware acceleration resource sharing beyond hypervisor-level allocation, feasible via exploiting existing features in the Linux kernel such as RDMA networking, \textit{heterogeneous memory management} alongside RDMA-capable network interfaces. However, building such a DSM system between compute nodes of different ISAs is yet more non-trivial. We particularly note that, unlike x86, many RISC ISAs (e.g., ARMv8, RISC-V) do not guarantee cache coherence between CPU and DMA engine on a hardware level. On the other hand, such systems define cache coherence operations on an instruction-level which nevertheless consumes CPU-time. To better advise DSM design with regards to such systems, we measure the latency of emulated non-hardware cache-coherent ARMv8 processor across a variety of scenarios, focusing solely on Linux's source code implementation and instrumentation mechanisms. We find that the latency of software-initiated cache coherency operations grows as the size of the address subspace to perform coherency operation on grows, though such relation exhibit as a non-linear correlation between contiguous allocation size and latency.
}
\maketitle
@ -123,25 +123,14 @@ from the Informatics Research Ethics committee.
\end{preliminary}
\chapter{Introduction}
\dots
This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:
\begin{itemize}
\item {
The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
}
\item {
Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
}
\end{itemize}
\chapter{Background}\label{chapter:background}
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within one cluster node, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017} -- all while using the same \textit{libc} function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement is automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
The rest of the chapter is structured as follows:
This thesis paper hence builds upon an ongoing research effort in implementing an in-kernel DSM system on top of tightly coupled cluster where \textit{HMM} (\textit{Heterogeneous Memory Management}) abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation. More specifically, this thesis explores the latency incurred by OS-initiated software cache coherency maintenance procedures common across all (R)DMA programs. The findings in this thesis is expected to inform the software coherence protocol and consistency model design of the in-kernel DSM system for accelerator-sharing purposes under a reusable, simple testing framework.
\chapter{Background}\label{chapter:background}
We introduce the following aspects pertaining to the in-kernel DSM project within this chapter:
\begin{itemize}
\item {
We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors.
@ -1176,7 +1165,7 @@ We identify the following weaknesses within our experiment setup that undermines
\paragraph*{Do Instrumented Statistics Reflect Real Latency?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which:
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which:
\begin{enumerate}
\item {
is (at least) called under process context.
@ -1190,16 +1179,16 @@ Similar context is also observed for upstream function calls, for example \\ \te
On the other hand, it may be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance.
\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance.
On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems.
Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also strongly disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations.
Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also strongly disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations.
\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize that latencies differ over similar experimental setups between the 2 subsections of \ref{sec:sw-coherency-results}. We strongly suspect that this is due to uncontrolled power supply to host machine allowing the \textit{System Management Unit} of the host system to downclock or alter the performance envelope of the host CPU. Similar correlation between power source and CPU performance across different \textit{Zen 2} chipset laptops have been observed \cite{Salter.AMD-Zen2-Boost-Delay.2020}. Though the reduced performance envelope would result in worse ARM64 emulation performance, the relative performances observed in figures \ref{fig:coherency-op-per-page-alloc} and \ref{fig:coherency-op-multi-page-alloc} should still hold, as this power management quirk should result in performance deduction of similar order-of-magnitude across instructions (in terms of latency via reduced clock frequencies). Nevertheless, furthrer studies should rely on controlled power source to eliminate variance caused by system power management functionalities.
\chapter{Conclusion}
This thesis hence accomplishes the following:
This thesis hence accomplishes the following:
\begin{itemize}
\item {
It provides an timeline of development in software distributed shared memory systems throughout the ages, from the early (but still inspiring) \textit{Munin} to contemporary developments due to RDMA hardware -- \textit{MENPS}. Using this timeline, it introduces a novel approach to DSM systems that take a heterogeneous-multiprocessing view to the traditional DSM system problem, which serves as the rationale/context behind the primary contributions of this thesis.
@ -1208,7 +1197,7 @@ This thesis hence accomplishes the following:
It underscores the interaction between the two coherence ``domains''\footnotemark[5] relevant to a DSM system -- the larger domain (between different nodes in a DSM abstraction) depends on the correct behaviors of the smaller domain (within each node, between RDMA NIC and the CPU) to exhibit correct consistency behaviors with regards to the entire DSM system. From here, it focuses on cache coherence in ARMv8 ISA systems after establishing that x86-64 systems already define DMA as transparently cache coherent.
}
\item {
It describes the implementation of software-initiated ARMv8-A cache coherence operations inside the contemporary Linux kernel, which the thesis (and its contextual project) focuses on due to it being open-source and popular across all computing contexts. Specifically, it pinpoints the exact procedures relevant to cache coherence maintenance due to DMA in Linux kernel and explains its interoperation with the upstream DMA-capable hardware drivers.
It describes the implementation of software-initiated ARMv8-A cache coherence operations inside the contemporary Linux kernel, which the thesis (and its contextual project) focuses on due to it being open-source and popular across all computing contexts. Specifically, it pinpoints the exact procedures relevant to cache coherence maintenance due to DMA in Linux kernel and explains its interoperation with the upstream DMA-capable hardware drivers.
}
\item {
It establishes a method to re-export architecture-specific assembly routines inside the Linux kernel as dynamically-traceable C symbols and constructs a kernel module wrapper to conduct a series of experiments that explore the relationship between software coherence routine latency and allocation size. From this, it establishes that latency of routines grows with size of memory subspace to be made coherent, but with a non-linear growth rate.
@ -1218,16 +1207,16 @@ This thesis hence accomplishes the following:
\footnotetext[5]{Not to be confused with ARM's definition of \textit{coherence domain} -- though theoretically similar. Here, a \textit{domain} refers to a level of abstraction where, given a set of nodes, each constituent node is \emph{internally} coherent as a whole but not guaranteed to be coherent with each other. Reused the term for lack of better descriptors.}
\section{Future \& Unfinished Work}
The main contribution of this thesis had swayed significantly since the beginning of semester 1\footnotemark[6]. During this thesis's incubation, the following directions had been explored which the author hope may serve as pointers for future contributions for the in-kernel, RDMA-based DSM system:
The main contribution of this thesis had swayed significantly since the beginning of semester 1\footnotemark[6]. During this thesis's incubation, the following directions had been explored which the author hope may serve as pointers for future contributions for the in-kernel, RDMA-based DSM system:
\footnotetext[6]{Educationally speaking it's, well, educative, but it would be a lie to claim this did not hurt morale.}
\paragraph*{Cache/Page Replacement Policies wrt. DSM Systems} Much like how this thesis proposed that \emph{2 coherence domains exist for a DSM system -- inter-node and intra-node}, the cache replacement problem also exhibits a (theoretical) duality:
\paragraph*{Cache/Page Replacement Policies wrt. DSM Systems} Much like how this thesis proposed that \emph{2 coherence domains exist for a DSM system -- inter-node and intra-node}, the cache replacement problem also exhibits a (theoretical) duality:
\begin{itemize}
\item {
\textbf{Intra-node} cache replacement problem -- i.e., \emph{page replacement problem} inside the running OS kernel -- is made complex by the existence of kernel-level DSM allowing for multitudes of replacements available parallel to the traditional page swap mechanism.
\textbf{Intra-node} cache replacement problem -- i.e., \emph{page replacement problem} inside the running OS kernel -- is made complex by the existence of kernel-level DSM allowing for multitudes of replacements available parallel to the traditional page swap mechanism.
Consider, for example, that \texttt{kswapd} scans some page for replacement. Instead of the traditional swapping mechanism which solely considers intra-node resources, we may instead establish swap files over RDMA-reachable resources such that, at placement time, we have the following options:
Consider, for example, that \texttt{kswapd} scans some page for replacement. Instead of the traditional swapping mechanism which solely considers intra-node resources, we may instead establish swap files over RDMA-reachable resources such that, at placement time, we have the following options:
\begin{enumerate}
\item {
intra-node \texttt{zram}\footnotemark[7]
@ -1240,7 +1229,7 @@ The main contribution of this thesis had swayed significantly since the beginnin
}
\end{enumerate}
Consequently, even swapped page placement becomes an optimization problem! To the author's knowledge, the Linux kernel currently does not support dynamic selection of swap target -- a static ordering is defined inside \texttt{/etc/fstab}, instead.
Consequently, even swapped page placement becomes an optimization problem! To the author's knowledge, the Linux kernel currently does not support dynamic selection of swap target -- a static ordering is defined inside \texttt{/etc/fstab}, instead.
}
\item {
\textbf{Inter-node} cache replacement problem, which arises because we may as well bypass \texttt{kswapd} altogether when pages are already transferred over the \textit{HMM} mechanism. This leads to one additional placement option during page replacement: