unnamed_ba_thesis/tex/misc/background_draft.tex

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[dvipsnames]{xcolor}
\usepackage{biblatex}

\addbibresource{background_draft.bib}

\begin{document}
Though large-scale cluster systems remain the dominant solution for request and
data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
efficient heterogeneous computation with more tightly-coupled heterogeneous nodes
providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015}
\textcolor{red}{[ADD MORE CITATIONS]} Orthogonally, within the scope of one
motherboard, \emph{heterogeneous memory management (HMM)} enables the use of
OS-controlled, unified memory view across both main memory and device memory
\cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
function calls as one would with SMP programming, the underlying complexities of
memory ownership and data placement automatically managed by the OS kernel.
On the other hand, while HMM promises a distributed shared memory approach towards
exposing CPU and peripheral memory, applications (drivers and front-ends) that
exploit HMM to provide ergonomic programming models remain fragmented and
narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus
on exposing global address space abstraction to GPU memory -- a largely
non-coordinated effort surrounding both \textit{in-tree} and proprietary code
\cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}.
Limited effort have been done on incorporating HMM into other variants of
accelerators in various system topologies.

Orthogonally, allocation of hardware accelerator resources in a cluster computing
environment becomes difficult when the required hardware accelerator resources
of one workload cannot be easily determined and/or isolated as a ``stage'' of
computation. Within a cluster system there may exist a large amount of
general-purpose worker nodes and limited amount of hardware-accelerated nodes.
Further, it is possible that every workload performed on this cluster asks for
hardware acceleration from time to time, but never for a relatively long time.
Many job scheduling mechanisms within a cluster
\emph{move data near computation} by migrating the entire job/container between
general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019}
{Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs
large overhead -- accelerator nodes which strictly perform computation on data in memory
without ever needing to touch the container's filesystem should not have to install
the entire filesystem locally, for starters. Moreover, must \emph{all} computations be
performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over
fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups,
result in negligible impact on tail latencies but high impact on throughput when
bandwidth is maximized.

This thesis paper builds upon an ongoing research effort in implementing a
tightly coupled cluster where HMM abstractions allow for transparent RDMA access
from accelerator nodes to local data and migration of data near computation,
leveraging different consistency model and coherency protocols to amortize the
communication cost for shared data. \textcolor{red}{
Specifically, this thesis explores the effect of memory consistency model and
coherency protocol on memory-sharing between cluster nodes }

\textcolor{red}{The rest of the chapter is structured as follows\dots}

\section{Experiences from Software DSM}
A majority of contributions to software DSM systems come from the 1990s
\cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}
{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These
developments follow from the success of the Stanford DASH project in the late
1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a
multiprocessor that first proposed the \textit{directory-based protocol} for
cache coherence, which stores the ownership information of cache lines to reduce
unnecessary communication that prevented previous multiprocessors from scaling out
\cite{Lenoski_etal.Stanford_DASH.1992}.

While developments in hardware DSM materialized into a universal approach to
cache-coherence in contemporary many-core processors (e.g., \textit{Ampere
Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered
computing languished in favor of loosely-coupled nodes performing data-parallel
computation, communicating via message-passing. Bandwidth limitations with the
network interfaces of the late 1990s was insufficient to support the high traffic
incurred by DSM and its programming model
\cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}
{Lu_etal.MPI_vs_DSM_over_cluster.1995}.

New developments in network interfaces provides much improved bandwidth and latency
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
Similar results have been observed for \textit{APACHE Spark}\cite{Lu_etal.Spark_over_RDMA.2014}
\textcolor{red}{and what?}. Consequently, there have been a resurgence of interest
in software DSM systems and programming models
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.

% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
% i.e., we do it in kernel as opposed to in userspace. Accelerator node can access
% local node's shared page like a DMA device do so via HMM.

\subsection{Munin: Multi-Consistency Protocol}
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older
developments in software DSM systems. The authors of Munin identify that
\textit{false-sharing}, occurring due to multiple processors writing to different
offsets of the same page triggering invalidations, is strongly detrimental to the
performance of shared-memory systems. To combat this, Munin exposes annotations
as part of its programming model to facilitate multiple consistency protocols on
top of release consistency. An immutable shared memory object across readers,
for example, can be safely copied without concern for coherence between processors.
On the other hand, the \textit{write-shared} annotation explicates that a memory
object is written by multiple processors without synchronization -- i.e., the
programmer guarantees that only false-sharing occurs within this granularity.
Annotations such as these explicitly disables subsets of consistency procedures
to reduce communication in the network fabric, thereby improving the performance
of the DSM system.

Perhaps most importantly, experiences from Munin show that \emph{restricting the
flexibility of programming model can lead to more performant coherence models}, as
\textcolor{teal}{corroborated} by the now-foundational
\textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} --
which powered many now-popular scalable data processing frameworks such as
\textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and
\textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
tolerance efficiently, RDDs provide a restricted form of shared memory
[based on]\dots transformations rather than\dots updates to shared state''
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
cheaply synchronize states between unshared address spaces -- a much desired
property for highly scalable, loosely-coupled clustered systems.

\subsection{Treadmarks: Multi-Writer Protocol}
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system
developed in 1996, which featured an intricate \textit{interval}-based
multi-writer protocol that allows multiple nodes to write to the same page
without false-sharing. The system follows a release-consistent memory model,
which requires the use of either locks (via \texttt{acquire}, \texttt{release})
or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval}
represents a time period in-between page creation, \texttt{release} to another
processor, or a \texttt{barrier}; they also each correspond to a
\textit{write notice}, which are used for page invalidation. Each \texttt{acquire}
message is sent to the statically-assigned lock-manager node, which forwards the
message to the last releaser. The last releaser computes the outstanding write
notices and piggy-backs them back for the acquirer to invalidate its own cached
page entry, thus signifying entry into the critical section. Consistency
information, including write notices, intervals, and page diffs, are routinely
garbage-collected which forces cached pages in each node to become validated.

Compared to \textit{Treadmarks}, the system described in this paper uses a
single-writer protocol, thus eliminating the concept of ``intervals'' --
with regards to synchronization, each page can be either in-sync (in which case
they can be safely shared) or out-of-sync (in which case they must be
invalidated/updated). This comes with the following advantage:

\begin{itemize}
    \item Less metadata for consistency-keeping.
    \item More adherent to the CPU-accelerator dichotomy model.
    \item Much simpler coherence protocol, which reduces communication cost.
\end{itemize}

In view of the (still) disparate throughput and latency differences between local
and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018},
the simpler coherence protocol of single-writer protocol should provide better
performance on the critical paths of remote memory access.

% The majority of contributions to DSM study come from the 1990s, for example
% \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
% leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
% these systems provide a strong theoretical basis for today's majority-software
% DSM systems and applications that expose a \emph{(partitioned) global address space},
% they were nevertheless constrained by the limitations in NIC transfer rate and
% bandwidth, and the concept of DSM failed to take off (relative to cluster computing).

\subsection{Hotpot: Single-Writer \& Data Replication}
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply
distributed shared memory techniques on persistent memory to provide
``transparent memory accesses, data persistence, data reliability, and high
availability''. Leveraging on persistent memory devices allow DSM applications
to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017},
ensuring both distributed cache coherence and data reliability at the same time
\cite{Shan_Tsai_Zhang.DSPM.2017}.

We specifically discuss the single-writer portion of its coherence protocol. The
data reliability guarantees proposed by the \textit{Hotpot} system requires each
shared page to be replicated to some \textit{degree of replication}. Nodes
who always store latest replication of shared pages are referred to as
``owner nodes'', which arbitrate other nodes to store more replications in order
to reach the degree of replication quota. At acquisition time, the acquiring node
asks the access-management node for single-writer access to shared page,
who grants it if no other critical section exists, alongside list of current
owner nodes. At release time, the releaser first commits its changes to all owner
nodes which, in turn, commits its received changes across lesser sharers to
achieve the required degree of replication. These two operations are all
acknowledged back in reverse order. Once all acknowledgements are received from
owner nodes by commit node, the releaser tells them to delete their commit logs
and, finally, tells the manager node to exit critical section.

The required degree of replication and logged commit transaction until explicit
deletion facilitate crash recovery at the expense of worse performance over
release-time I/O. While the study of crash recovery with respect to shared
memory systems is out of the scope of this thesis, this paper provides a good
framework for a \textbf{correct} coherence protocol for a single-writer,
multiple-reader shared memory system, particularly when the protocol needs to
cater for a great variety of nodes each with their own memory preferences (e.g.,
write-update vs. write-invalidate, prefetching, etc.).

\subsection{MENPS: A Return to DSM}
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable
interconnects as a proof-of-concept that DSM systems and programming models can
be as efficient as \textit{partitioned global address space} (PGAS) using today's
network interfaces. It builds upon \textit{TreadMark}'s
\cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to
a \textit{floating home-based} protocol, based on the insight that diff-transfers
across the network is comparatively costly compared to RDMA intrinsics -- which
implies preference towards local diff-merging. The home node then acts as the
data supplier for every shared page within the system.

Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of
\textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup
in some of the computation tasks, while achieving much
better productivity due to DSM's support for transparent caching, etc.
\cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that
DSM systems are at least as viable as traditional PGAS/message-passing frameworks
for scientific computing, also corroborated by the resurgence of DSM studies
later on\cite{Masouros_etal.Adrias.2023}.

\section{HPC and Partitioned Global Address Space}
Improvement in NIC bandwidth and transfer rate allows for applications that expose
global address space, as well as RDMA technologies that leverage single-writer
protocols over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.


Contemporary works on DSM systems focus more on leveraging hardware advancements
to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
for example, implements a complex system for memory disaggregation over multiple
compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
they observed significant performance improvements over existing data-intensive
processing frameworks, for example APACHE Spark, Memcached, and Redis, over
no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
systems.

\subsection{Programming Model}

\subsection{Move Data to Process, or Move Process to Data?}
(TBD -- The former is costly for data-intensive computation, but the latter may
be impossible for certain tasks, and greatly hardens the replacement problem.)

\section{Replacement Policy}

In general, three variants of replacement strategies have been proposed for either
generic cache block replacement problems, or specific use-cases where contextual
factors can facilitate more efficient cache resource allocation:
\begin{itemize}
    \item General-Purpose Replacement Algorithms, for example LRU.
    \item Cost-Model Analysis
    \item Probabilistic and Learned Algorithms
\end{itemize}

\subsection{General-Purpose Replacement Algorithms}
Practically speaking, in the general case of the cache replacement problem,
we desire to predict the re-reference interval of a cache block
\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
optimal case for the \emph{ideal} replacement problem occurs when, at eviction
time, the entry with the highest re-reference interval is replaced. Under this
framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
where the re-reference interval for each entry is predicted to be immediate.
Fortunately, memory access traces of real computer systems agree with this
tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
however, and there are other behaviors...) On the other hand, the hypothetical
LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
textbook LFU algorithm suffers from needing to maintain a priority-queue for
frequency analysis, it was nevertheless useful for keeping recurrent (though
non-recent) blocks from being evicted from the cache \textbf{[source]}.

Derivatives from the LRU algorithm attempts to balance between frequency and
recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}

Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
CLOCK here \dots]}

Finally, real-life experiences have shown the need to reduce CPU time in practical
applications, owing from one simple observation -- during the fetch-execution
cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
design, despite its hypothetical optimality, could nevertheless degrade the performance
of a system during low-memory situations. In fact, this proves to be the driving
motivation behind Linux's transition away from the old LRU-2Q page replacement
algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
been mainlined since v6.1.

\subsection{Cost-Model Analysis}
The ideal case for the replacement problem fails to account for invalidation of
cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
that is insufficient to capture the heterogeneity of today's massively-parallel,
distributed systems. High-speed network interfaces are capable of exposing RDMA
interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
when compared to swapping over the kernel I/O stack, while software that bypass
the kernel I/O stack is capable of stretching the bandwidth advantage even more
(source). This creates an interesting network topology between RDMA-enabled nodes,
where, in addition to swapping at low-memory situations, the node may opt to ``swap''
or simply drop the physical page in order to lessen the cost of page misses.

\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}

Traditionally, replacement policies based on cost-model analysis were utilized in
content-delivery networks, which had different consistency models compared to
finer-grained systems. HTTP servers need not pertain to strong consistency models,
as out-of-date information is considered permissible, and single-writer scenarios
are common. Consequently, most replacement policies for static content servers,
while making strong distinction towards network topology, fails to concern for the
cases where an entry might become invalidated, let along multi-writer protocols.
One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
page fault frequency as an indicator of preference towards working set inclusion
(which I personally think is highly flawed -- to be explained). Another paper
\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
page fault into consideration for eviction, but fails to go beyond the obvious
implication that pages that have been faulted \emph{must} be evicted.

The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
too. (Expand)

\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
Finally, machine learning techniques and low-cost probabilistic approaches have
been applied on the ideal cache replacement problem with some level of success.
\textbf{[Talk about LeCaR, CACHEUS here]}.

\section{Cache Coherence and Consistency in DSM Systems}

(I need to read more into this. Most of the contribution comes from CPU caches,
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
protocol.]}

Consistency and communication protocols naturally affect the cost for each faulted
memory access \dots

\textbf{[Talk about directory, transactional, scope, and library cache coherence,
which allow for multi-casted communications at page fault but all with different
levels of book-keeping.]}

\printbibliography
\end{document}