Added table

2024-03-11 21:02:07 +00:00 · 2024-03-11 21:02:07 +00:00 · 81262392e0
commit 81262392e0
parent 53aeec8c67
3 changed files with 143 additions and 99 deletions
--- a/tex/misc/background_draft.bib
+++ b/tex/misc/background_draft.bib
@ -511,3 +511,53 @@
  author={ARM},
  year={2015}
 }
@inproceedings{Zhang_etal.GiantVM.2020,
  title={Giantvm: A type-ii hypervisor implementing many-to-one virtualization},
  author={Zhang, Jin and Ding, Zhuocheng and Chen, Yubin and Jia, Xingguo and Yu, Boshi and Qi, Zhengwei and Guan, Haibing},
  booktitle={Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
  pages={30--44},
  year={2020}
 }
@book{Holsapple.DSM64.2012,
  title={DSM64: A Distributed Shared Memory System in User-Space},
  author={Holsapple, Stephen Alan},
  year={2012},
  publisher={California Polytechnic State University}
 }
@inproceedings{Eisley_Peh_Shang.In-net-coherence.2006,
  title={In-network cache coherence},
  author={Eisley, Noel and Peh, Li-Shiuan and Shang, Li},
  booktitle={2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)},
  pages={321--332},
  year={2006},
  organization={IEEE}
 }
@inproceedings{Schoinas_etal.Sirocco.1998,
  title={Sirocco: Cost-effective fine-grain distributed shared memory},
  author={Schoinas, Ioannis and Falsafi, Babak and Hill, Mark D and Larus, James R and Wood, David A},
  booktitle={Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. 98EX192)},
  pages={40--49},
  year={1998},
  organization={IEEE}
 }
@article{Schaefer_Li.Shiva.1989,
  title={Shiva: An operating system transforming a hypercube into a shared-memory machine},
  author={Li, Kai and Schaefer, Richard},
  year={1989}
 } or was the order of authors other way around?
@article{Fleisch_Popek.Mirage.1989,
  title={Mirage: A coherent distributed shared memory design},
  author={Fleisch, Brett and Popek, Gerald},
  journal={ACM SIGOPS Operating Systems Review},
  volume={23},
  number={5},
  pages={211--223},
  year={1989},
  publisher={ACM New York, NY, USA}
 }
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -2,116 +2,51 @@
 \usepackage[utf8]{inputenc}
 \usepackage[dvipsnames]{xcolor}
 \usepackage{biblatex}
 \usepackage{graphicx}
 \usepackage[justification=centering]{caption}
 \usepackage{hyperref}
 \addbibresource{background_draft.bib}
 \begin{document}
-Though large-scale cluster systems remain the dominant solution for request and
+Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
 data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
 there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
 efficient heterogeneous computation with tighter-coupled heterogeneous nodes
 providing (hardware) acceleration for one another
 \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}
 Orthogonally, within the scope of one motherboard,
 \emph{heterogeneous memory management (HMM)} enables the use of
 OS-controlled, unified memory view across both main memory and device memory
 \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
 function calls as one would with SMP programming, the underlying complexities of
 memory ownership and data placement automatically managed by the OS kernel.
 On the other hand, while HMM promises a distributed shared memory approach towards
 exposing CPU and peripheral memory, applications (drivers and front-ends) that
 exploit HMM to provide ergonomic programming models remain fragmented and
 narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus
 on exposing global address space abstraction to GPU memory -- a largely
 non-coordinated effort surrounding both \textit{in-tree} and proprietary code
 \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}.
 Limited effort have been done on incorporating HMM into other variants of
 accelerators in various system topologies.
-Orthogonally, allocation of hardware accelerator resources in a cluster computing
+Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
 environment becomes difficult when the required hardware accelerator resources
 of one workload cannot be easily determined and/or isolated as a ``stage'' of
 computation. Within a cluster system there may exist a large amount of
 general-purpose worker nodes and limited amount of hardware-accelerated nodes.
 Further, it is possible that every workload performed on this cluster asks for
 hardware acceleration from time to time, but never for a relatively long time.
 Many job scheduling mechanisms within a cluster
 \emph{move data near computation} by migrating the entire job/container between
 general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019}
 {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs
 large overhead -- accelerator nodes which strictly perform computation on data in memory
 without ever needing to touch the container's filesystem should not have to install
 the entire filesystem locally, for starters. Moreover, must \emph{all} computations be
 performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over
 fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups,
 result in negligible impact on tail latencies but high impact on throughput when
 bandwidth is maximized.
-This thesis paper builds upon an ongoing research effort in implementing a
+This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:
 tightly coupled cluster where HMM abstractions allow for transparent RDMA access
 from accelerator nodes to local data and migration of data near computation,
 leveraging different consistency model and coherency protocols to amortize the
 communication cost for shared data. More specifically, this thesis explores the
 following:
 \begin{itemize}
    \item {
-        The effect of cache coherency maintenance, specifically OS-initiated,
+        The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
        on RDMA programs.
    }
    \item {
-        Implementation of cache coherency in cache-incoherent kernel-side RDMA
+        Implementation of cache coherency in cache-incoherent kernel-side RDMA clients.
        clients.
    }
    \item {
-        Discussion of memory models and coherence protocol designs for a
+        Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
        single-writer, multi-reader RDMA-based DSM system.
    }
 \end{itemize}
 The rest of the chapter is structured as follows:
 \begin{itemize}
    \item {
-        We identify and discuss notable developments in software-implemented
+        We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors.
        DSM systems, and thus identify key features of contemporary advancements
        in DSM techniques that differentiate them from their predecessors.
    }
    \item {
-        We identify alternative (shared memory) programming paradigms and
+        We identify alternative (shared memory) programming paradigms and compare them with DSM, which sought to provide transparent shared address space among participating nodes.
        compare them with DSM, which sought to provide transparent shared
        address space among participating nodes.
    }
    \item {
-        We give an overview of coherency protocol and consistency models for
+        We give an overview of coherency protocol and consistency models for multi-sharer DSM systems.
        multi-sharer DSM systems.
    }
    \item {
-        We provide a primer to cache coherency in ARM64 systems, which
+        We provide a primer to cache coherency in ARM64 systems, which \emph{do not} guarantee cache-coherent DMA, as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
        \emph{do not} guarantee cache-coherent DMA,
        as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
    }
 \end{itemize}
 \section{Experiences from Software DSM}
-A majority of contributions to software DSM systems come from the 1990s
+A majority of contributions to software DSM systems come from the 1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These developments follow from the success of the Stanford DASH project in the late 1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a multiprocessor that first proposed the \textit{directory-based protocol} for cache coherence, which stores the ownership information of cache lines to reduce unnecessary communication that prevented previous multiprocessors from scaling out \cite{Lenoski_etal.Stanford_DASH.1992}.
 \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}
 {Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These
 developments follow from the success of the Stanford DASH project in the late
 1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a
 multiprocessor that first proposed the \textit{directory-based protocol} for
 cache coherence, which stores the ownership information of cache lines to reduce
 unnecessary communication that prevented previous multiprocessors from scaling out
 \cite{Lenoski_etal.Stanford_DASH.1992}.
-While developments in hardware DSM materialized into a universal approach to
+While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}{Lu_etal.MPI_vs_DSM_over_cluster.1995}.
 cache-coherence in contemporary many-core processors (e.g., \textit{Ampere
 Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered
 computing languished in favor of loosely-coupled nodes performing data-parallel
 computation, communicating via message-passing. Bandwidth limitations with the
 network interfaces of the late 1990s was insufficient to support the high traffic
 incurred by DSM and its programming model
 \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}
 {Lu_etal.MPI_vs_DSM_over_cluster.1995}.
 New developments in network interfaces provides much improved bandwidth and latency
 compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
@ -156,21 +91,7 @@ cheaply synchronize states between unshared address spaces -- a much desired
 property for highly scalable, loosely-coupled clustered systems.
 \subsection{Treadmarks: Multi-Writer Protocol}
-\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system
+\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system developed in 1996, which featured an intricate \textit{interval}-based multi-writer protocol that allows multiple nodes to write to the same page without false-sharing. The system follows a release-consistent memory model, which requires the use of either locks (via \texttt{acquire}, \texttt{release}) or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval} represents a time period in-between page creation, \texttt{release} to another processor, or a \texttt{barrier}; they also each correspond to a \textit{write notice}, which are used for page invalidation. Each \texttt{acquire} message is sent to the statically-assigned lock-manager node, which forwards the message to the last releaser. The last releaser computes the outstanding write notices and piggy-backs them back for the acquirer to invalidate its own cached page entry, thus signifying entry into the critical section. Consistency information, including write notices, intervals, and page diffs, are routinely garbage-collected which forces cached pages in each node to become validated.
 developed in 1996, which featured an intricate \textit{interval}-based
 multi-writer protocol that allows multiple nodes to write to the same page
 without false-sharing. The system follows a release-consistent memory model,
 which requires the use of either locks (via \texttt{acquire}, \texttt{release})
 or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval}
 represents a time period in-between page creation, \texttt{release} to another
 processor, or a \texttt{barrier}; they also each correspond to a
 \textit{write notice}, which are used for page invalidation. Each \texttt{acquire}
 message is sent to the statically-assigned lock-manager node, which forwards the
 message to the last releaser. The last releaser computes the outstanding write
 notices and piggy-backs them back for the acquirer to invalidate its own cached
 page entry, thus signifying entry into the critical section. Consistency
 information, including write notices, intervals, and page diffs, are routinely
 garbage-collected which forces cached pages in each node to become validated.
 Compared to \textit{Treadmarks}, the system described in this paper uses a
 single-writer protocol, thus eliminating the concept of ``intervals'' --
@ -501,7 +422,7 @@ is a major part of many studies in DSM systems throughout history
 {Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020}
 {Couceiro_etal.D2STM.2009}.
-\subsection{Common Consistency Models}
+% \subsection{Common Consistency Models}
 % ... should I even write this section? imo it's too basic for anyone to read
 % and really just serves as a means to increase word count
@ -533,7 +454,73 @@ to its performance benefits (e.g., in terms of coherence costs
 consistency models, sometimes due to improved productivity offered to
 programmers \cite{Kim_etal.DeX-upon-Linux.2020}.
-% Probably include a table here?
+\begin{table}[h]
    \centering
    \begin{tabular}{|l|c c c c c c|}
        \hline
        % ...
            & Sequential
            & TSO
            & PSO
            & Release
            & Acquire
            & Scope \\
        \hline
        Home; Invalidate
            & \cites{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}{Zhang_etal.GiantVM.2020}
            &
            &
            & \cites{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}
            & \cites{Holsapple.DSM64.2012}
            & \cites{Hu_Shi_Tang.JIAJIA.1999} \\
        \hline
        Home; Update
            & & & & & & \\
        \hline
        Float; Invalidate
            &
            &
            &
            & \cites{Endo_Sato_Taura.MENPS_DSM.2020}
            &
            & \\
        \hline
        Float; Update
            & & & & & & \\
        \hline
        Directory; Inval.
            & \cites{Wang_etal.Concordia.2021}
            &
            &
            &
            &
            & \\
        \hline
        Directory; Update
            & & & & & & \\
        \hline
        Dist. Dir.; Inval.
            & \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991}
            &
            & \cites{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}
            & \\
        \hline
        Dist. Dir.; Update
            &
            &
            &
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
            &
            & \\
        \hline
    \end{tabular}
    \caption{
        Coherence Protocol vs. Consistency Model in Selected Disaggregated Memory Studies. ``Float'' short for ``floating home''. Studies selected for clearly described consistency model and coherence protocol.
    }
    \label{table:1}
 \end{table}
 We especially note the role of balancing productivity and performance in terms
 of selecting the ideal consistency model for a system. It is common knowledge
@ -563,8 +550,15 @@ data representation over disaggregated memory over network when compared to
 contemporary DSM approaches.
 \subsection{Coherence Protocol}
-Coherence protocols, then, becomes the means over which DSM systems implement
+Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:1} shows, DSM studies tends to implement write-invalidated coherence either via a \textit{home-based} or \textit{directory-based} protocol, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.
-their consistency model guarantees.
+
 \subsubsection{Home-Based Protocols}
 \textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.
 We identify that home-based protocols are conceptually straightforward when compared to directory-based protocols, centering communications over storage of distributed metadata (in this case, regarding the manager node for each shared memory object). This leads to
 \subsubsection{Directory-Based Protocols}
 To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}{Schoinas_etal.Sirocco.1998}{Eisley_Peh_Shang.In-net-coherence.2006}{Hong_etal.NUMA-to-RDMA-DSM.2019}.
 \subsection{DMA and Cache Coherence}
 % Because this thesis specifically studies cache coherence in ARMv8, we