unnamed_ba_thesis/tex/draft/skeleton.tex

% UG project example file, February 2022
%   A minior change in citation, September 2023 [HS]
% Do not change the first two lines of code, except you may delete "logo," if causing problems.
% Understand any problems and seek approval before assuming it's ok to remove ugcheck.
\documentclass[logo,bsc,singlespacing,parskip]{infthesis}
\usepackage{ugcheck}

% Include any packages you need below, but don't include any that change the page
% layout or style of the dissertation. By including the ugcheck package above,
% you should catch most accidental changes of page layout though.

\usepackage{microtype} % recommended, but you can remove if it causes problems
% \usepackage{natbib} % recommended for citations
\usepackage[utf8]{inputenc}
\usepackage[dvipsnames]{xcolor}
\usepackage{hyperref}
\usepackage[justification=centering]{caption}
\usepackage{graphicx}
\usepackage[english]{babel}
% -> biblatex
\usepackage{biblatex} % full of mischief
\addbibresource{mybibfile.bib}
% <- biblatex
% -> nice definition listings
\usepackage{csquotes}
\usepackage{amsthm}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
% <- definition
% -> code listing
% [!] Requires external program: pypi:pygment
\usepackage{minted}
\usemintedstyle{vs}
% <- code listing

\begin{document}
\begin{preliminary}

\title{Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}

\author{Zhengyi Chen}

% CHOOSE YOUR DEGREE a):
% please leave just one of the following un-commented
% \course{Artificial Intelligence}
%\course{Artificial Intelligence and Computer Science}
%\course{Artificial Intelligence and Mathematics}
%\course{Artificial Intelligence and Software Engineering}
%\course{Cognitive Science}
\course{Computer Science}
%\course{Computer Science and Management Science}
%\course{Computer Science and Mathematics}
%\course{Computer Science and Physics}
%\course{Software Engineering}
%\course{Master of Informatics} % MInf students

% CHOOSE YOUR DEGREE b):
% please leave just one of the following un-commented
%\project{MInf Project (Part 1) Report}  % 4th year MInf students
%\project{MInf Project (Part 2) Report}  % 5th year MInf students
\project{4th Year Project Report}        % all other UG4 students


\date{\today}

\abstract{
This skeleton demonstrates how to use the \texttt{infthesis} style for
undergraduate dissertations in the School of Informatics. It also emphasises the
page limit, and that you must not deviate from the required style.
The file \texttt{skeleton.tex} generates this document and should be used as a
starting point for your thesis. Replace this abstract text with a concise
summary of your report.
}

\maketitle

\newenvironment{ethics}
   {\begin{frontenv}{Research Ethics Approval}{\LARGE}}
   {\end{frontenv}\newpage}

\begin{ethics}
% \textbf{Instructions:} \emph{Agree with your supervisor which
% statement you need to include. Then delete the statement that you are not using,
% and the instructions in italics.\\
% \textbf{Either complete and include this statement:}}\\ % DELETE THESE INSTRUCTIONS
% %
% % IF ETHICS APPROVAL WAS REQUIRED:
% This project obtained approval from the Informatics Research Ethics committee.\\
% Ethics application number: ???\\
% Date when approval was obtained: YYYY-MM-DD\\
% %
% \emph{[If the project required human participants, edit as appropriate, otherwise delete:]}\\ % DELETE THIS LINE
% The participants' information sheet and a consent form are included in the appendix.\\
% %
% IF ETHICS APPROVAL WAS NOT REQUIRED:
% \textbf{\emph{Or include this statement:}}\\ % DELETE THIS LINE
This project was planned in accordance with the Informatics Research
Ethics policy. It did not involve any aspects that required approval
from the Informatics Research Ethics committee.

\standarddeclaration
\end{ethics}


\begin{acknowledgements}
Jordanian River to the Mediterranean Sea, maybe\dots
\end{acknowledgements}


\tableofcontents
\end{preliminary}


\chapter{Introduction}
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.

Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.

This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:

\begin{itemize}
    \item {
        The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
    }
    \item {
        Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
    }
\end{itemize}

The rest of the chapter is structured as follows:
\begin{itemize}
    \item {
        We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors.
    }
    \item {
        We identify alternative (shared memory) programming paradigms and compare them with DSM, which sought to provide transparent shared address space among participating nodes.
    }
    \item {
        We give an overview of coherency protocol and consistency models for multi-sharer DSM systems.
    }
    \item {
        We provide a primer to cache coherency in ARM64 systems, which \emph{do not} guarantee cache-coherent DMA, as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
    }
\end{itemize}

\section{Experiences from Software DSM}
A majority of contributions to software DSM systems come from the 1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These developments follow from the success of the Stanford DASH project in the late 1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a multiprocessor that first proposed the \textit{directory-based protocol} for cache coherence, which stores the ownership information of cache lines to reduce unnecessary communication that prevented previous multiprocessors from scaling out \cite{Lenoski_etal.Stanford_DASH.1992}.

While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}{Lu_etal.MPI_vs_DSM_over_cluster.1995}.

New developments in network interfaces provides much improved bandwidth and latency compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC, scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}. Similar results have been observed for \textit{APACHE Spark} \cite{Lu_etal.Spark_over_RDMA.2014} and \textit{SMBDirect} \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a resurgence of interest in software DSM systems and programming models \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.

\subsection{Munin: Multi-Consistency Protocol}
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older developments in software DSM systems. The authors of Munin identify that \textit{false-sharing}, occurring due to multiple processors writing to different offsets of the same page triggering invalidations, is strongly detrimental to the performance of shared-memory systems. To combat this, Munin exposes annotations as part of its programming model to facilitate multiple consistency protocols on top of release consistency. An immutable shared memory object across readers, for example, can be safely copied without concern for coherence between processors. On the other hand, the \textit{write-shared} annotation explicates that a memory object is written by multiple processors without synchronization -- i.e., the programmer guarantees that only false-sharing occurs within this granularity. Annotations such as these explicitly disables subsets of consistency procedures to reduce communication in the network fabric, thereby improving the performance of the DSM system.

Perhaps most importantly, experiences from Munin show that \emph{restricting the flexibility of programming model can lead to more performant coherence models}, as exhibited by the now-foundational \textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data processing frameworks such as \textit{Hadoop MapReduce} \cite{WEB.APACHE..Apache_Hadoop.2023} and \textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory [based on]\dots transformations rather than\dots updates to shared state'' \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to cheaply synchronize states between unshared address spaces -- a much desired property for highly scalable, loosely-coupled clustered systems.

\subsection{Treadmarks: Multi-Writer Protocol}
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system developed in 1996, which featured an intricate \textit{interval}-based multi-writer protocol that allows multiple nodes to write to the same page without false-sharing. The system follows a release-consistent memory model, which requires the use of either locks (via \texttt{acquire}, \texttt{release}) or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval} represents a time period in-between page creation, \texttt{release} to another processor, or a \texttt{barrier}; they also each correspond to a \textit{write notice}, which are used for page invalidation. Each \texttt{acquire} message is sent to the statically-assigned lock-manager node, which forwards the message to the last releaser. The last releaser computes the outstanding write notices and piggy-backs them back for the acquirer to invalidate its own cached page entry, thus signifying entry into the critical section. Consistency information, including write notices, intervals, and page diffs, are routinely garbage-collected which forces cached pages in each node to become validated.

Compared to \textit{Treadmarks}, the system described in this paper uses a single-writer protocol, thus eliminating the concept of ``intervals'' -- with regards to synchronization, each page can be either in-sync (in which case they can be safely shared) or out-of-sync (in which case they must be invalidated/updated). This comes with the following advantage:

\begin{itemize}
    \item Less metadata for consistency-keeping.
    \item More adherent to the CPU-accelerator dichotomy model.
    \item Much simpler coherence protocol, which reduces communication cost.
\end{itemize}

In view of the (still) disparate throughput and latency differences between local and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018}, the simpler coherence protocol of single-writer protocol should provide better performance on the critical paths of remote memory access.

\subsection{Hotpot: Single-Writer \& Data Replication}
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply distributed shared memory techniques on persistent memory to provide ``transparent memory accesses, data persistence, data reliability, and high availability''. Leveraging on persistent memory devices allow DSM applications to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017}, ensuring both distributed cache coherence and data reliability at the same time \cite{Shan_Tsai_Zhang.DSPM.2017}.

We specifically discuss the single-writer portion of its coherence protocol. The data reliability guarantees proposed by the \textit{Hotpot} system requires each shared page to be replicated to some \textit{degree of replication}. Nodes who always store latest replication of shared pages are referred to as ``owner nodes'', which arbitrate other nodes to store more replications in order to reach the degree of replication quota. At acquisition time, the acquiring node asks the access-management node for single-writer access to shared page, who grants it if no other critical section exists, alongside list of current owner nodes. At release time, the releaser first commits its changes to all owner nodes which, in turn, commits its received changes across lesser sharers to achieve the required degree of replication. These two operations are all acknowledged back in reverse order. Once all acknowledgements are received from owner nodes by commit node, the releaser tells them to delete their commit logs and, finally, tells the manager node to exit critical section.

The required degree of replication and logged commit transaction until explicit deletion facilitate crash recovery at the expense of worse performance over release-time I/O. While the study of crash recovery with respect to shared memory systems is out of the scope of this thesis, this paper provides a good framework for a \textbf{correct} coherence protocol for a single-writer, multiple-reader shared memory system, particularly when the protocol needs to cater for a great variety of nodes each with their own memory preferences (e.g., write-update vs. write-invalidate, prefetching, etc.).

\subsection{MENPS: A Return to DSM}
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable interconnects as a proof-of-concept that DSM systems and programming models can be as efficient as \textit{partitioned global address space} (PGAS) using today's network interfaces. It builds upon \textit{TreadMark}'s \cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to a \textit{floating home-based} protocol, based on the insight that diff-transfers across the network is comparatively costly compared to RDMA intrinsics -- which implies preference towards local diff-merging. The home node then acts as the data supplier for every shared page within the system.

Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of \textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup in some of the computation tasks, while achieving much better productivity due to DSM's support for transparent caching, etc. \cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that DSM systems are at least as viable as traditional PGAS/message-passing frameworks for scientific computing, also corroborated by the resurgence of DSM studies later on\cite{Masouros_etal.Adrias.2023}.

\section{PGAS and Message Passing}
While the feasibility of transparent DSM systems over multiple machines on the network has been made apparent since the 1980s, predominant approaches to ``scaling-out'' programs over the network relies on the message-passing approach \cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold:

\begin{enumerate}
    \item {
        Programmers would rather resort to more intricate, more predictable approaches to scaling-out programs over the network \cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies manual/controlled data sharding over nodes, separation of compute and communication ``stages'' of computation, etc., which benefit performance analysis and engineering.
    }
    \item {
        Enterprise applications value throughput and uptime of relatively computationally inexpensive tasks/resources \cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy scalability of tried-and-true, latency-inexpensive applications. Studies in transparent DSM systems mostly require exotic, specifically-written programs to exploit global address space, which is fundamentally at odds in terms of reusability and flexibility required.
    }
\end{enumerate}

\subsection{PGAS}
\textit{Partitioned Global Address Space} (PGAS) is a parallel programming model that (1) exposes a global address space to all machines within a network and (2) explicates distinction between local and remote memory \cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks, for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used as backends to provide the PGAS model over various network interfaces/platforms (e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022} {WEB.HPE.Chapel_Platforms-v1.33.2023}.

Notably, implementation of a \emph{global} address space across machines on top of machines already equipped with their own \emph{local} address space (e.g., cluster nodes running commercial Linux) necessitates a global addressing mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014}, for example, utilizes a 128-bit ``global pointer'' to encode global memory object/segment ID and access flags in the upper 64 bits and virtual addresses in the lower 64 bits for each (slice of) memory object allocated within the PGAS model. A \textit{non-collective} PGAS object is allocated entirely local to the allocating node's memory, but registered globally. Consequently, a single global pointer is recorded in the runtime with corresponding permission flags for the context of some user-defined group of associated nodes. Comparatively, a \textit{collective} PGAS object is allocated such that a partition of the object (i.e., a sub-array of the repr) is stored in each of the associated node -- for a $k$-partitioned object, $k$ global pointers are recorded in the runtime each pointing to the same object, with different offsets and (intuitively) independently-chosen virtual addresses. Note that this design naturally requires virtual addresses within each node to be \emph{pinned} -- the allocated object cannot be re-addressed to a different virtual address, thus preventing the global pointer that records the local virtual address from becoming spontaneously invalidated.

Similar schemes can be observed in other PGAS backends/runtimes, albeit they may opt to use a map-like data structure for addressing instead. In general, despite both PGAS and DSM systems provide memory management over remote nodes, PGAS frameworks provide no transparent caching and transfer of remote memory objects accessed by local nodes. The programmer is still expected to handle data/thread movement manually when working with shared memory over network to maximize their performance metrics of interest.

\subsection{Message Passing}
\label{sec:msg-passing}
\textit{Message Passing} remains the predominant programming model for parallelism between loosely-coupled nodes within a computer system, much as it is ubiquitous in supporting all levels of abstraction within any concurrent components of a computer system. Specific to cluster computing systems is the message-passing programming model, where parallel programs (or instances of the same parallel program) on different nodes within the system communicate via exchanging messages over network between these nodes. Such models exchange programming model productivity for more fine-grained control over the messages passed, as well as more explicit separation between communication and computation stages within a programming subproblem.

Commonly, message-passing backends function as \textit{middlewares} -- communication runtimes --  to aid distributed software development \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend expose facilities for inter-application communication to frontend developers while transparently providing security, accounting, and fault-tolerance, much like how an operating system may provide resource management, scheduling, and security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. This is the case for implementing the PGAS programming model, which mostly rely on common message-passing backends to facilitate orchestrated data manipulation across distributed nodes. Likewise, message-passing backends, including RDMA API, form the backbone of many research-oriented DSM systems \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.

Message-passing between network-connected nodes may be \textit{two-sided} or \textit{one-sided}. The former models an intuitive workflow to sending and receiving datagrams over the network -- the sender initiates a transfer; the receiver copies a received packet from the network card into a kernel buffer; the receiver's kernel filters the packet and (optionally) \cite{FreeBSD.man-BPF-4.2021} copies the internal message into the message-passing runtime/middleware's address space; the receiver's middleware inspects the copied message and performs some procedures accordingly, likely also involving copying slices of message data to some registered distributed shared memory buffer for the distributed application to access. Despite it being a highly intuitive model of data manipulation over the network, this poses a fundamental performance issue: because the process requires the receiver's kernel AND userspace to exert CPU-time, upon reception of each message, the receiver node needs to proactively exert CPU-time to move the received data from bytes read from NIC devices to userspace. Because this happens concurrently with other kernel and userspace routines in a concurrent system, a preemptable kernel may incur significant latency if the kernel routine for packet filtering is pre-empted by another kernel routine, userspace, or IRQs.

Comparatively, a ``one-sided'' message-passing scheme, for example RDMA, allows the network interface card to bypass in-kernel packet filters and perform DMA on registered memory regions. The NIC can hence notify the CPU via interrupts, thus allowing the kernel and the userspace programs to perform callbacks at reception time with reduced latency. Because of this advantage, many recent studies attempt to leverage RDMA APIs for improved distributed data workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014} {Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020} {Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kaxiras_etal.DSM-Argos.2015}.

\section{Consistency Model and Cache Coherence}
Consistency model specifies a contract on allowed behaviors of multi-processing programs with regards to a shared memory \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious conflict, which consistency models aim to resolve, lies within the interaction between processor-native programs and multi-processors, all of whom needs to operate on a shared memory with heterogeneous cache topologies. Here, a well-defined consistency model aims to resolve the conflict on an architectural scope. Beyond consistency models for bare-metal systems, programming languages \cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007} {Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024} and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for parallel access to shared memory on top of program order guarantees to explicate program behavior under shared memory parallel programming across underlying implementations.

Related to the definition of a consistency model is the coherence problem, which arises whenever multiple actors have access to multiple copies of some datum, which needs to be synchronized across multiple actors with regards to write-accesses \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. While less relevant to programming language design, coherence must be maintained via a coherence protocol \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020} in systems of both microarchitectural and network scales. For DSM systems, the design of a correct and performant coherence protocol is of especially high priority and is a major part of many studies in DSM systems throughout history \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996} {Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020} {Couceiro_etal.D2STM.2009}.

\subsection{Consistency Model in DSM}
Distributed shared memory systems with node-local caching naturally implies the existence of the consistency problem with regards to contending read/write accesses. Indeed, a significant subset of DSM studies explicitly characterize themselves as adhering to one of the well-known consistency models to better understand system behavior and to provide optimizations in coherence protocols \cites{Amza_etal.Treadmarks.1996}{Hu_Shi_Tang.JIAJIA.1999} {Carter_Bennett_Zwaenepoel.Munin.1991}{Endo_Sato_Taura.MENPS_DSM.2020} {Wang_etal.Concordia.2021}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kim_etal.DeX-upon-Linux.2020}, each adhering to a different consistency model to balance between communication costs and ease of programming.

In particular, we note that DSM studies tend to conform to either release consistency \cites{Amza_etal.Treadmarks.1996}{Endo_Sato_Taura.MENPS_DSM.2020} {Carter_Bennett_Zwaenepoel.Munin.1991} or weaker \cite{Hu_Shi_Tang.JIAJIA.1999}, or sequential consistency \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991} {Wang_etal.Concordia.2021}{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}, with few works \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018} pertaining to moderately constrained consistency models in-between. While older works, as well as works which center performance of their proposed DSM systems over existing approaches \cites{Endo_Sato_Taura.MENPS_DSM.2020} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}, favor release consistency due to its performance benefits (e.g., in terms of coherence costs \cite{Endo_Sato_Taura.MENPS_DSM.2020}), newer works tend to adopt stricter consistency models, sometimes due to improved productivity offered to programmers \cite{Kim_etal.DeX-upon-Linux.2020}.

\begin{table}[h]
    \centering
    \begin{tabular}{|l|c c c c c c|}
        \hline
        % ...
            & Sequential
            & TSO
            & PSO
            & Release
            & Acquire
            & Scope \\
        \hline
        Home; Invalidate
            & \cites{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}{Zhang_etal.GiantVM.2020}
            &
            &
            & \cites{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}
            & \cites{Holsapple.DSM64.2012}
            & \cites{Hu_Shi_Tang.JIAJIA.1999} \\
        \hline
        Home; Update
            & & & & & & \\
        \hline
        Float; Invalidate
            &
            &
            &
            & \cites{Endo_Sato_Taura.MENPS_DSM.2020}
            &
            & \\
        \hline
        Float; Update
            & & & & & & \\
        \hline
        Directory; Inval.
            & \cites{Wang_etal.Concordia.2021}
            &
            &
            &
            &
            & \\
        \hline
        Directory; Update
            & & & & & & \\
        \hline
        Dist. Dir.; Inval.
            & \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991}
            &
            & \cites{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}
            & \\
        \hline
        Dist. Dir.; Update
            &
            &
            &
            & \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
            &
            & \\
        \hline
    \end{tabular}
    \caption{
        Coherence Protocol vs. Consistency Model in Selected Disaggregated Memory Studies. ``Float'' short for ``floating home''. Studies selected for clearly described consistency model and coherence protocol.
    }
    \label{table:1}
\end{table}

We especially note the role of balancing productivity and performance in terms of selecting the ideal consistency model for a system. It is common knowledge that weaker consistency models are harder to program with, at the benefit of less (implied) coherence communications resulting in better throughput overall -- provided that the programmer could guarantee correctness, a weaker consistency model allows for less invalidation of node-local cache entries, thereby allowing multiple nodes to compute in parallel on (likely) outdated local copy of data such that the result of the computation remains semantically correct with regards to the program. This point was made explicit in \textit{Munin} \cite{Carter_Bennett_Zwaenepoel.Munin.1991}, where (to reiterate) it introduces the concept of consistency ``protocol parameters'' to annotate shared memory access pattern, in order to reduce the amount of coherence communications necessary between nodes computing in distributed shared memory. For example, a DSM object (memory object accounted for by the DSM system) can be annotated with ``delayed operations'' to delay coherence operations beyond any write-access, or shared without ``write'' annotation to disable write-access over shared nodes, thereby disabling all coherence operations with regards to this DSM object. Via programmer annotation of DSM objects, the Munin DSM system explicates the effect of weaker consistency in relation to the amount of synchronization overhead necessary among shared memory nodes. To our knowledge, no other more recent DSM works have explored this interaction between consistency and coherence costs on DSM objects, though relatedly \textit{Resilient Distributed Dataset (RDD)} \cite{Zaharia_etal.RDD.2012} also highlights its performance and flexibility benefits in opting for an immutable data representation over disaggregated memory over network when compared to contemporary DSM approaches.

\subsection{Coherence Protocol}
Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:1} shows, DSM studies tends to implement write-invalidated coherence under a \textit{home-based} or \textit{directory-based} protocol framework, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.

\subsubsection{Home-Based Protocols}
\textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.

We identify that home-based protocols are conceptually straightforward compared to directory-based protocols, centering communications over storage of global metadata (in this case ownership of each shared memory object). This leads to greater flexibility in implementing coherence protocols. A shared memory object at its creation may be made known globally via broadcast, or made known to only a subset of nodes (0 or more) via multicast. Likewise, metadata storage could be cached locally to each node and invalidated alongside object invalidation or fetched from a fixed node with respect to one object. This implementation flexibility is further taken advantage of in \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017}, which refines the ``home node'' concept into \textit{owner node} to provide replication and persistence, in addition to adopting a dynamic home protocol similar to that of \cite{Endo_Sato_Taura.MENPS_DSM.2020}.

\subsubsection{Directory-Based Protocols}
\textit{Directory-based} protocols instead take a shared database approach by denoting each shared memory object with a globally shared entry describing ownership and sharing status. In its non-distributed form (e.g., \cite{Wang_etal.Concordia.2021}), a global, central directory is maintained for all nodes in network for ownership information: the directory hence becomes a bottleneck for imposing latency and bandwidth constraints on parallel processing systems. Comparatively, a distributed directory scheme may delegate responsibilities across all nodes in network mostly in accordance to sharded address space \cites{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. Though theoretically sound, this scheme performs no dynamic load-balancing for commonly shared memory objects, which in the worst case would function exactly like a non-distributed directory coherence scheme. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}{Schoinas_etal.Sirocco.1998}{Eisley_Peh_Shang.In-net-coherence.2006}{Hong_etal.NUMA-to-RDMA-DSM.2019}.

\subsection{DMA and Cache Coherence}
The advent of high-speed RDMA-capable network interfaces introduce introduce opportunities for designing more performant DSM systems over RDMA (as established in \ref{sec:msg-passing}). Orthogonally, RDMA-capable NICs on a fundamental level perform direct memory access over the main memory to achieve one-sided RDMA operations to reduce the effect of OS jittering on RDMA latencies. For modern computer systems with cached multiprocessors, this poses a potential cache coherence problem on a local level -- because RDMA operations happen concurrently with regards to memory accesses by CPUs, which stores copies of memory data in cache lines which may \cites{Kjos_etal.HP-HW-CC-IO.1996}{Ven.LKML_x86_DMA.2008} or may not \cites{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018}{Corbet.LWN-NC-DMA.2021} be fully coherent by the DMA mechanism, any DMA operations performed by the RDMA NIC may be incoherent with the cached copy of the same data inside the CPU caches (as is the case for accelerators, etc.). This issue is of particular concern to the kernel development community, who needs to ensure that the behaviors of DMA operations remain identical across architectures regardless of support of cache-coherent DMA \cite{Corbet.LWN-NC-DMA.2021}. Likewise existing RDMA implementations which make heavy use of architecture-specific DMA memory allocation implementations, implementing RDMA-based DSM systems in kernel also requires careful use of kernel API functions that ensure cache coherency as necessary.

\subsection{Cache Coherence in ARMv8-A}
We specifically focus on the implementation of cache coherence in ARMv8-A. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8-A architecture (and many other popular ISAs, for example \textit{RISC-V}) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. ARMv8 defines a hierarchical model for coherency organization to support \textit{heterogeneous} and \textit{asymmetric} multi-processing systems \cite{ARM.ARMv8-A.v1.0.2015}.

\begin{definition}[cluster]
    A \textit{cluster} defines a minimal cache-coherent region for Cortex-A53 and Cortex-A57 processors. Each cluster usually comprises of 1 or more core as well as a shared last-level cache.
\end{definition}

\begin{definition}[sharable domain]
    A \textit{sharable domain} defines a vendor-defined cache-coherent region. Sharable domains can be \textit{inner} or \textit{outer}, which limits the scope of broadcast coherence messages to \textit{point-of-unification} and \textit{point-of-coherence}, respectively.

    Usually, the \textit{inner} sharable domain defines the domain of all (closely-coupled) processors inside a heterogeneous multiprocessing system (see \ref{def:het-mp}); while the \textit{outer} sharable domain defines the largest memory-sharing domain for the system (e.g. inclusive of DMA bus).
\end{definition}

\begin{definition}[Point-of-Unification]\label{def:pou}
    The \textit{point-of-unification} (\textit{PoU}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{inner} sharable domain see the same copy of data.

    Consequently, \textit{PoU} defines a point at which every core of a ARMv8-A processor sees the same (i.e., a \emph{unified}) copy of a memory location regardless of accessing via instruction caches, data caches, or TLB.
\end{definition}

\begin{definition}[Point-of-Coherence]\label{def:poc}
    The \textit{point-of-coherence} (\textit{PoC}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{outer} sharable domain see the same copy of data.

    Consequently, \textit{PoC} defines a point at which all \textit{observers} (e.g., cores, DSPs, DMA engines) to memory will observe the same copy of a memory location.
\end{definition}

\subsubsection{Addendum: \textit{Heterogeneous} \& \textit{Asymmetric} Multiprocessing}
Using these definitions, a vendor could build \textit{heterogeneous} and \textit{asymmetric} multi-processing systems as follows:
\begin{definition}[Heterogeneous Multiprocessing]\label{def:het-mp}
    A \textit{heterogeneous multiprocessing} system incorporates ARMv8 processors of diverse microarchitectures that are fully coherent with one another, running the same system image.
\end{definition}

\begin{definition}[Asymmetric Multiprocessing]
    A \textit{asymmetric multiprocessing} system needs not contain fully coherent processors. For example, a system-on-a-chip may contain a non-coherent co-processor for secure computing purposes \cite{ARM.ARMv8-A.v1.0.2015}.
\end{definition}

\subsection{ARMv8-A Software Cache Coherence in Linux Kernel}
Because of the lack of hardware guarantee on hardware DMA coherency (though such support exists \cite{Parris.AMBA_4_ACE-Lite.2013}), programmers need to invoke architecture-specific cache-coherency instructions when porting DMA hardware support over a diverse range of ARMv8 microarchitectures, often encapsulated in problem-specific subroutines.

Notably, kernel (driver) programming warrants programmer attention to software-maintained coherency when userspace programmers downstream expect data-flow, interspersed between CPU and DMA operations, to follow program ordering and (driver vendor) specifications. One such example arises in the Linux kernel implementation of DMA memory management API \cite{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024}\footnote[1]{Based on Linux kernel v6.7.0.}:

\begin{definition}[DMA Mappings]
    The Linux kernel DMA memory allocation API, imported via
    \begin{minted}[linenos]{c}
#include <linux/dma-mapping.h>
    \end{minted}
    defines two variants of DMA mappings:

    \begin{itemize}
        \item {\label{def:consistent-dma-map}
            \textit{Consistent} DMA mappings:

            They are guaranteed to be coherent in-between concurrent CPU/DMA accesses without explicit software flushing.
            \footnote[2]{
                However, it does not preclude CPU store reordering, so memory barriers remain necessary in a multiprocessing context.
            }
        }
        \item {
            \textit{Streaming} DMA mappings:

            They provide no guarantee to coherency in-between concurrent CPU/DMA accesses. Programmers need to manually apply coherency maintenance subroutines for synchronization.
        }
    \end{itemize}
\end{definition}

Consistent DMA mappings could be trivially created via allocating non-cacheable memory, which guarantees \textit{PoC} for all memory observers (though system-specific fastpaths exist).

On the other hand, streaming DMA mappings require manual synchronization upon programmed CPU/DMA access. Take single-buffer synchronization on CPU after DMA access for example:
\begin{minted}[linenos, mathescape]{c}
/* In kernel/dma/mapping.c $\label{code:dma_sync_single_for_cpu}$*/
void dma_sync_single_for_cpu(
    struct device *dev,          // kernel repr for DMA device
    dma_addr_t addr,             // DMA address
    size_t size,                 // Synchronization buffer size
    enum dma_data_direction dir, // Data-flow direction
) {
    /* Translate DMA address to physical address */
    phys_addr_t paddr = dma_to_phys(dev, addr);

    if (!dev_is_dma_coherent(dev)) {
        arch_sync_dma_for_cpu(paddr, size, dir);
        arch_sync_dma_for_cpu_all(); // MIPS quirks...
    }

    /* Miscellaneous cases... */
}
\end{minted}

\begin{minted}[linenos]{c}
/* In arch/arm64/mm/dma-mapping.c */
void arch_sync_dma_for_cpu(
    phys_addr_t paddr,
    size_t size,
    enum dma_data_direction dir,
) {
    /* Translate physical address to (kernel) virtual address */
    unsigned long start = (unsigned long)phys_to_virt(paddr);

    /* Early exit for DMA read: no action needed for CPU */
    if (dir == DMA_TO_DEVICE)
        return;

    /* ARM64-specific: invalidate CPU cache to PoC */
    dcache_inval_poc(start, start + size);
}
\end{minted}

This call-chain, as well as its mirror case which maintains cache coherency for the DMA device after CPU access: \mint[breaklines=true]{c}|dma_sync_single_for_device(struct device *, dma_addr_t, size_t, enum dma_data_direction)|, call into the following procedures, respectively:

\begin{minted}[linenos]{c}
/* Exported @ arch/arm64/include/asm/cacheflush.h */
/* Defined @ arch/arm64/mm/cache.S */
/* All functions accept virtual start, end addresses. */

/* Invalidate data cache region [start, end) to PoC.
 *
 * Invalidate CPU cache entries that intersect with [start, end),
 * such that data from external writers becomes visible to CPU.
 */
extern void dcache_inval_poc(
    unsigned long start, unsigned long end
);

/* Clean data cache region [start, end) to PoC.
 *
 * Write-back CPU cache entries that intersect with [start, end),
 * such that data from CPU becomes visible to external writers.
 */
extern void dcache_clean_poc(
    unsigned long start, unsigned long end
);
\end{minted}

\subsubsection{Addendum: \texttt{enum dma\_data\_direction}}

The Linux kernel defines 4 direction \texttt{enum} values for fine-tuning synchronization behaviors:
\begin{minted}[linenos]{c}
/* In include/linux/dma-direction.h */
enum dma_data_direction {
    DMA_BIDIRECTION = 0, // data transfer direction uncertain.
    DMA_TO_DEVICE = 1,   // data from main memory to device.
    DMA_FROM_DEVICE = 2, // data from device to main memory.
    DMA_NONE = 3,        // invalid repr for runtime errors.
};
\end{minted}

These values allow for certain fast-paths to be taken at runtime. For example, \texttt{DMA\_TO\_DEVICE} implies that the device reads data from memory without modification, and hence precludes software coherence instructions from being run when synchronizing for CPU after DMA operation.

% TODO: Move to addendum section.
\subsubsection{Use-case: Kernel-space \textit{SMBDirect} Driver}
\textit{SMBDirect} is an extension of the \textit{SMB} (\textit{Server Message Block}) protocol for opportunistically establishing the communication protocol over RDMA-capable network interfaces \cite{many.MSFTLearn-SMBDirect.2024}.

We focus on two procedures inside the in-kernel SMBDirect implementation:

\paragraph{Before send: \texttt{smbd\_post\_send}}
\begin{minted}[linenos]{c}
/* In fs/smb/client/smbdirect.c */
static int smbd_post_send(
    struct smbd_connection *info, // SMBDirect transport context
    struct smbd_request *request, // SMBDirect request context
) // ...
\end{minted}

Downstream of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_post\_send} is called to signal the RDMA NIC for transport.

The function body is roughly as follows:
\begin{minted}[linenos, firstnumber=last, mathescape]{c}
{
    struct ib_send_wr send_wr; // "Write Request" for entire payload
    int rc, i;

    /* For each message in batched payload */
    for (i = 0; i < request->num_sge; i++) {
        /* Log to kmesg ring buffer... */

        /* RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ $\label{code:ib_dma_sync_single_for_device}$*/
        ib_dma_sync_single_for_device(
            info->id->device,       // struct ib_device *
            request->sge[i].addr,   // u64 (as dma_addr_t)
            request->sge[i].length, // size_t
            DMA_TO_DEVICE,          // enum dma_data_direction
        );
    }

    /* Populate `request`, `send_wr`... */

    rc = ib_post_send(
        info->id->qp, // struct ib_qp * ("Queue Pair")
        &send_wr,     // const struct ib_recv_wr *
        NULL,         // const struct ib_recv_wr ** (err handling)
    );

    /* Error handling... */

    return rc;
}
\end{minted}

Line \ref{code:ib_dma_sync_single_for_device} writes back CPU cache lines to be visible for RDMA NIC in preparation for DMA operations when the posted \textit{send request} is worked upon.

\paragraph{Upon reception: \texttt{recv\_done}}
\begin{minted}[linenos]{c}
/* In fs/smb/client/smbdirect.c */
static void recv_done(
    struct ib_cq *cq, // "Completion Queue"
    struct ib_wc *wc, // "Work Completion"
) // ...
\end{minted}

Called when the RDMA subsystem works on the received payload over RDMA. Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores prior to any operations on received data:

\begin{minted}[linenos, firstnumber=last, mathescape]{c}
{
    struct smbd_data_transfer *data_transfer;
    struct smbd_response *response = container_of(
        wc->wr_cqe,           // ptr: pointer to member
        struct smbd_response, // type: type of container struct
        cqe,                  // name: name of member in struct
    ); // Cast member of struct into containing struct (C magic)
    struct smbd_connection *info = response->info;
    int data_length = 0;

    /* Logging, error handling... */

    /* Likewise, RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ */
    ib_dma_sync_single_for_cpu(
        wc->qp->device,
        response->sge.addr,
        response->sge.length,
        DMA_FROM_DEVICE,
    );

    /* ... */
}
\end{minted}

\chapter{Software Coherency Latency}

\chapter{DSM System Design}

% \bibliographystyle{plain}
% \bibliographystyle{plainnat}
% \bibliography{mybibfile}
\printbibliography


% You may delete everything from \appendix up to \end{document} if you don't need it.
\appendix

\chapter{First appendix}

\section{First section}

Any appendices, including any required ethics information, should be included
after the references.

Markers do not have to consider appendices. Make sure that your contributions
are made clear in the main body of the dissertation (within the page limit).

% \chapter{Participants' information sheet}

% If you had human participants, include key information that they were given in
% an appendix, and point to it from the ethics declaration.

% \chapter{Participants' consent form}

% If you had human participants, include information about how consent was
% gathered in an appendix, and point to it from the ethics declaration.
% This information is often a copy of a consent form.


\end{document}