Added defs

2024-03-13 22:53:08 +00:00 · 2024-03-13 22:53:08 +00:00 · 54b3b3064a
commit 54b3b3064a
parent 017006fe4e
2 changed files with 32 additions and 2 deletions
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -1,17 +1,21 @@
 \documentclass{article}
+\usepackage[english]{babel}
 \usepackage[utf8]{inputenc}
 \usepackage[dvipsnames]{xcolor}
 \usepackage{biblatex}
 \usepackage{graphicx}
 \usepackage[justification=centering]{caption}
 \usepackage{hyperref}
+\usepackage{amsthm}

 \addbibresource{background_draft.bib}
+\theoremstyle{definition}
+\newtheorem{definition}{Definition}

 \begin{document}
 Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.

-Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
+Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.

 This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:

@ -566,8 +570,34 @@ We identify that home-based protocols are conceptually straightforward compared
 The advent of high-speed RDMA-capable network interfaces introduce introduce opportunities for designing more performant DSM systems over RDMA (as established in \ref{sec:msg-passing}). Orthogonally, RDMA-capable NICs on a fundamental level perform direct memory access over the main memory to achieve one-sided RDMA operations to reduce the effect of OS jittering on RDMA latencies. For modern computer systems with cached multiprocessors, this poses a potential cache coherence problem on a local level -- because RDMA operations happen concurrently with regards to memory accesses by CPUs, which stores copies of memory data in cache lines which may \cites{Kjos_etal.HP-HW-CC-IO.1996}{Ven.LKML_x86_DMA.2008} or may not \cites{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018}{Corbet.LWN-NC-DMA.2021} be fully coherent by the DMA mechanism, any DMA operations performed by the RDMA NIC may be incoherent with the cached copy of the same data inside the CPU caches (as is the case for accelerators, etc.). This issue is of particular concern to the kernel development community, who needs to ensure that the behaviors of DMA operations remain identical across architectures regardless of support of cache-coherent DMA \cite{Corbet.LWN-NC-DMA.2021}. Likewise existing RDMA implementations which make heavy use of architecture-specific DMA memory allocation implementations, implementing RDMA-based DSM systems in kernel also requires careful use of kernel API functions that ensure cache coherency as necessary.

 \subsection{Cache Coherence in ARMv8}
-We specifically focus on the implementation of cache coherence in ARMv8. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8 architecture (and many other popular ISAs e.g. RISC-V) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. 
+We specifically focus on the implementation of cache coherence in ARMv8. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8 architecture (and many other popular ISAs e.g. RISC-V) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. ARMv8 defines a hierarchical model for coherency organization to support \textit{heterogeneous} and \textit{asymmetric} multi-processing systems \cite{ARM.ARMv8-A.v1.0.2015}.

+\begin{definition}[cluster]
+    A \textit{cluster} defines a minimal cache-coherent region for Cortex-A53 and Cortex-A57 processors. Each cluster usually comprises of 1 or more core as well as a shared last-level cache.
+\end{definition}
+
+\begin{definition}[sharable domain]
+    A \textit{sharable domain} defines a vendor-defined cache-coherent region. Sharable domains can be \textit{inner} or \textit{outer}, which limits the scope of broadcast coherence messages to \textit{point-of-unification} and \textit{point-of-coherence}, respectively.
+
+    Usually, the \textit{inner} sharable domain defines the domain of all (closely-coupled) processors inside a heterogeneous multiprocessing system (see \ref{def:het-mp}); while the \textit{outer} sharable domain defines the largest memory-sharing domain for the system (e.g. DMA bus).
+\end{definition}
+
+\begin{definition}[Point-of-Unification]
+    The \textit{point-of-unification} under ARMv8 defines a level of coherency such that all sharers inside the \textbf{inner} sharable domain see the same copy of data.
+\end{definition}
+
+\begin{definition}[Point-of-Coherence]\label{def:poc}
+    The \textit{point-of-coherence} under ARMv8 defines a level of coherency such that all sharers inside the \textbf{outer} sharable domain see the same copy of data.
+\end{definition}
+
+Using these definitions, a vendor could build \textit{heterogeneous} and \textit{asymmetric} multi-processing systems as follows:
+\begin{definition}[Heterogeneous Multiprocessing]\label{def:het-mp}
+    A \textit{heterogeneous multiprocessing} system incorporates ARMv8 processors of diverse microarchitectures that are fully coherent with one another, running the same system image.
+\end{definition}
+
+\begin{definition}[Asymmetric Multiprocessing]
+    A \textit{asymmetric multiprocessing} system needs not contain fully coherent processors. For example, a system-on-a-chip may contain a non-coherent co-processor for secure computing purposes \cite{ARM.ARMv8-A.v1.0.2015}.
+\end{definition}

 % Experiment: ...
 % Discussion: (1) Linux and DMA and RDMA (2) replacement and other ideas...