512 lines
No EOL
30 KiB
TeX
512 lines
No EOL
30 KiB
TeX
\documentclass{article}
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[dvipsnames]{xcolor}
|
|
\usepackage{biblatex}
|
|
|
|
\addbibresource{background_draft.bib}
|
|
|
|
\begin{document}
|
|
Though large-scale cluster systems remain the dominant solution for request and
|
|
data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
|
|
there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
|
|
efficient heterogeneous computation with tighter-coupled heterogeneous nodes
|
|
providing (hardware) acceleration for one another
|
|
\cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}
|
|
Orthogonally, within the scope of one motherboard,
|
|
\emph{heterogeneous memory management (HMM)} enables the use of
|
|
OS-controlled, unified memory view across both main memory and device memory
|
|
\cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
|
|
function calls as one would with SMP programming, the underlying complexities of
|
|
memory ownership and data placement automatically managed by the OS kernel.
|
|
On the other hand, while HMM promises a distributed shared memory approach towards
|
|
exposing CPU and peripheral memory, applications (drivers and front-ends) that
|
|
exploit HMM to provide ergonomic programming models remain fragmented and
|
|
narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus
|
|
on exposing global address space abstraction to GPU memory -- a largely
|
|
non-coordinated effort surrounding both \textit{in-tree} and proprietary code
|
|
\cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}.
|
|
Limited effort have been done on incorporating HMM into other variants of
|
|
accelerators in various system topologies.
|
|
|
|
Orthogonally, allocation of hardware accelerator resources in a cluster computing
|
|
environment becomes difficult when the required hardware accelerator resources
|
|
of one workload cannot be easily determined and/or isolated as a ``stage'' of
|
|
computation. Within a cluster system there may exist a large amount of
|
|
general-purpose worker nodes and limited amount of hardware-accelerated nodes.
|
|
Further, it is possible that every workload performed on this cluster asks for
|
|
hardware acceleration from time to time, but never for a relatively long time.
|
|
Many job scheduling mechanisms within a cluster
|
|
\emph{move data near computation} by migrating the entire job/container between
|
|
general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019}
|
|
{Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs
|
|
large overhead -- accelerator nodes which strictly perform computation on data in memory
|
|
without ever needing to touch the container's filesystem should not have to install
|
|
the entire filesystem locally, for starters. Moreover, must \emph{all} computations be
|
|
performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over
|
|
fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups,
|
|
result in negligible impact on tail latencies but high impact on throughput when
|
|
bandwidth is maximized.
|
|
|
|
This thesis paper builds upon an ongoing research effort in implementing a
|
|
tightly coupled cluster where HMM abstractions allow for transparent RDMA access
|
|
from accelerator nodes to local data and migration of data near computation,
|
|
leveraging different consistency model and coherency protocols to amortize the
|
|
communication cost for shared data. More specifically, this thesis explores the
|
|
following:
|
|
|
|
\begin{itemize}
|
|
\item {
|
|
The effect of cache coherency maintenance, specifically OS-initiated,
|
|
on RDMA programs.
|
|
}
|
|
\item {
|
|
Implementation of cache coherency in cache-incoherent kernel-side RDMA
|
|
clients.
|
|
}
|
|
\item {
|
|
Discussion of memory models and coherence protocol designs for a
|
|
single-writer, multi-reader RDMA-based DSM system.
|
|
}
|
|
\end{itemize}
|
|
|
|
The rest of the chapter is structured as follows:
|
|
\begin{itemize}
|
|
\item {
|
|
We identify and discuss notable developments in software-implemented
|
|
DSM systems, and thus identify key features of contemporary advancements
|
|
in DSM techniques that differentiate them from their predecessors.
|
|
}
|
|
\item {
|
|
We identify alternative (shared memory) programming paradigms and
|
|
compare them with DSM, which sought to provide transparent shared
|
|
address space among participating nodes.
|
|
}
|
|
\item {
|
|
We give an overview of coherency protocol and consistency models for
|
|
multi-sharer DSM systems.
|
|
}
|
|
\item {
|
|
We provide a primer to cache coherency in ARM64 systems, which
|
|
\emph{do not} guarantee cache-coherent DMA,
|
|
as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
|
|
}
|
|
\end{itemize}
|
|
|
|
\section{Experiences from Software DSM}
|
|
A majority of contributions to software DSM systems come from the 1990s
|
|
\cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}
|
|
{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These
|
|
developments follow from the success of the Stanford DASH project in the late
|
|
1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a
|
|
multiprocessor that first proposed the \textit{directory-based protocol} for
|
|
cache coherence, which stores the ownership information of cache lines to reduce
|
|
unnecessary communication that prevented previous multiprocessors from scaling out
|
|
\cite{Lenoski_etal.Stanford_DASH.1992}.
|
|
|
|
While developments in hardware DSM materialized into a universal approach to
|
|
cache-coherence in contemporary many-core processors (e.g., \textit{Ampere
|
|
Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered
|
|
computing languished in favor of loosely-coupled nodes performing data-parallel
|
|
computation, communicating via message-passing. Bandwidth limitations with the
|
|
network interfaces of the late 1990s was insufficient to support the high traffic
|
|
incurred by DSM and its programming model
|
|
\cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}
|
|
{Lu_etal.MPI_vs_DSM_over_cluster.1995}.
|
|
|
|
New developments in network interfaces provides much improved bandwidth and latency
|
|
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
|
|
the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
|
|
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
|
|
Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014}
|
|
and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
|
|
resurgence of interest in software DSM systems and programming models
|
|
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
|
|
|
|
% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
|
|
% i.e., we do it in kernel as opposed to in userspace. Accelerator node can access
|
|
% local node's shared page like a DMA device do so via HMM.
|
|
|
|
\subsection{Munin: Multi-Consistency Protocol}
|
|
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older
|
|
developments in software DSM systems. The authors of Munin identify that
|
|
\textit{false-sharing}, occurring due to multiple processors writing to different
|
|
offsets of the same page triggering invalidations, is strongly detrimental to the
|
|
performance of shared-memory systems. To combat this, Munin exposes annotations
|
|
as part of its programming model to facilitate multiple consistency protocols on
|
|
top of release consistency. An immutable shared memory object across readers,
|
|
for example, can be safely copied without concern for coherence between processors.
|
|
On the other hand, the \textit{write-shared} annotation explicates that a memory
|
|
object is written by multiple processors without synchronization -- i.e., the
|
|
programmer guarantees that only false-sharing occurs within this granularity.
|
|
Annotations such as these explicitly disables subsets of consistency procedures
|
|
to reduce communication in the network fabric, thereby improving the performance
|
|
of the DSM system.
|
|
|
|
Perhaps most importantly, experiences from Munin show that \emph{restricting the
|
|
flexibility of programming model can lead to more performant coherence models}, as
|
|
exhibited by the now-foundational \textit{Resilient Distributed Database} paper
|
|
\cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data
|
|
processing frameworks such as \textit{Hadoop MapReduce}
|
|
\cite{WEB.APACHE..Apache_Hadoop.2023} and
|
|
\textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
|
|
tolerance efficiently, RDDs provide a restricted form of shared memory
|
|
[based on]\dots transformations rather than\dots updates to shared state''
|
|
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
|
|
cheaply synchronize states between unshared address spaces -- a much desired
|
|
property for highly scalable, loosely-coupled clustered systems.
|
|
|
|
\subsection{Treadmarks: Multi-Writer Protocol}
|
|
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system
|
|
developed in 1996, which featured an intricate \textit{interval}-based
|
|
multi-writer protocol that allows multiple nodes to write to the same page
|
|
without false-sharing. The system follows a release-consistent memory model,
|
|
which requires the use of either locks (via \texttt{acquire}, \texttt{release})
|
|
or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval}
|
|
represents a time period in-between page creation, \texttt{release} to another
|
|
processor, or a \texttt{barrier}; they also each correspond to a
|
|
\textit{write notice}, which are used for page invalidation. Each \texttt{acquire}
|
|
message is sent to the statically-assigned lock-manager node, which forwards the
|
|
message to the last releaser. The last releaser computes the outstanding write
|
|
notices and piggy-backs them back for the acquirer to invalidate its own cached
|
|
page entry, thus signifying entry into the critical section. Consistency
|
|
information, including write notices, intervals, and page diffs, are routinely
|
|
garbage-collected which forces cached pages in each node to become validated.
|
|
|
|
Compared to \textit{Treadmarks}, the system described in this paper uses a
|
|
single-writer protocol, thus eliminating the concept of ``intervals'' --
|
|
with regards to synchronization, each page can be either in-sync (in which case
|
|
they can be safely shared) or out-of-sync (in which case they must be
|
|
invalidated/updated). This comes with the following advantage:
|
|
|
|
\begin{itemize}
|
|
\item Less metadata for consistency-keeping.
|
|
\item More adherent to the CPU-accelerator dichotomy model.
|
|
\item Much simpler coherence protocol, which reduces communication cost.
|
|
\end{itemize}
|
|
|
|
In view of the (still) disparate throughput and latency differences between local
|
|
and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018},
|
|
the simpler coherence protocol of single-writer protocol should provide better
|
|
performance on the critical paths of remote memory access.
|
|
|
|
% The majority of contributions to DSM study come from the 1990s, for example
|
|
% \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
|
|
% leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
|
|
% these systems provide a strong theoretical basis for today's majority-software
|
|
% DSM systems and applications that expose a \emph{(partitioned) global address space},
|
|
% they were nevertheless constrained by the limitations in NIC transfer rate and
|
|
% bandwidth, and the concept of DSM failed to take off (relative to cluster computing).
|
|
|
|
\subsection{Hotpot: Single-Writer \& Data Replication}
|
|
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply
|
|
distributed shared memory techniques on persistent memory to provide
|
|
``transparent memory accesses, data persistence, data reliability, and high
|
|
availability''. Leveraging on persistent memory devices allow DSM applications
|
|
to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017},
|
|
ensuring both distributed cache coherence and data reliability at the same time
|
|
\cite{Shan_Tsai_Zhang.DSPM.2017}.
|
|
|
|
We specifically discuss the single-writer portion of its coherence protocol. The
|
|
data reliability guarantees proposed by the \textit{Hotpot} system requires each
|
|
shared page to be replicated to some \textit{degree of replication}. Nodes
|
|
who always store latest replication of shared pages are referred to as
|
|
``owner nodes'', which arbitrate other nodes to store more replications in order
|
|
to reach the degree of replication quota. At acquisition time, the acquiring node
|
|
asks the access-management node for single-writer access to shared page,
|
|
who grants it if no other critical section exists, alongside list of current
|
|
owner nodes. At release time, the releaser first commits its changes to all owner
|
|
nodes which, in turn, commits its received changes across lesser sharers to
|
|
achieve the required degree of replication. These two operations are all
|
|
acknowledged back in reverse order. Once all acknowledgements are received from
|
|
owner nodes by commit node, the releaser tells them to delete their commit logs
|
|
and, finally, tells the manager node to exit critical section.
|
|
|
|
The required degree of replication and logged commit transaction until explicit
|
|
deletion facilitate crash recovery at the expense of worse performance over
|
|
release-time I/O. While the study of crash recovery with respect to shared
|
|
memory systems is out of the scope of this thesis, this paper provides a good
|
|
framework for a \textbf{correct} coherence protocol for a single-writer,
|
|
multiple-reader shared memory system, particularly when the protocol needs to
|
|
cater for a great variety of nodes each with their own memory preferences (e.g.,
|
|
write-update vs. write-invalidate, prefetching, etc.).
|
|
|
|
\subsection{MENPS: A Return to DSM}
|
|
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable
|
|
interconnects as a proof-of-concept that DSM systems and programming models can
|
|
be as efficient as \textit{partitioned global address space} (PGAS) using today's
|
|
network interfaces. It builds upon \textit{TreadMark}'s
|
|
\cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to
|
|
a \textit{floating home-based} protocol, based on the insight that diff-transfers
|
|
across the network is comparatively costly compared to RDMA intrinsics -- which
|
|
implies preference towards local diff-merging. The home node then acts as the
|
|
data supplier for every shared page within the system.
|
|
|
|
Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of
|
|
\textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup
|
|
in some of the computation tasks, while achieving much
|
|
better productivity due to DSM's support for transparent caching, etc.
|
|
\cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that
|
|
DSM systems are at least as viable as traditional PGAS/message-passing frameworks
|
|
for scientific computing, also corroborated by the resurgence of DSM studies
|
|
later on\cite{Masouros_etal.Adrias.2023}.
|
|
|
|
\section{PGAS and Message Passing}
|
|
While the feasibility of transparent DSM systems over multiple machines on the
|
|
network has been made apparent since the 1980s, predominant approaches to
|
|
``scaling-out'' programs over the network relies on the message-passing approach
|
|
\cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold:
|
|
|
|
\begin{enumerate}
|
|
\item {
|
|
Programmers would rather resort to more intricate, more predictable
|
|
approaches to scaling-out programs over the network
|
|
\cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies
|
|
manual/controlled data sharding over nodes, separation of compute and
|
|
communication ``stages'' of computation, etc., which benefit performance
|
|
analysis and engineering.
|
|
}
|
|
\item {
|
|
Enterprise applications value throughput and uptime of relatively
|
|
computationally inexpensive tasks/resources
|
|
\cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy
|
|
scalability of tried-and-true, latency-inexpensive applications.
|
|
Studies in transparent DSM systems mostly require exotic,
|
|
specifically-written programs to exploit global address space, which is
|
|
fundamentally at odds in terms of reusability and flexibility required.
|
|
}
|
|
\end{enumerate}
|
|
|
|
\subsection{PGAS}
|
|
\textit{Partitioned Global Address Space} (PGAS) is a parallel programming model
|
|
that (1) exposes a global address space to all machines within a network and
|
|
(2) explicates distinction between local and remote memory
|
|
\cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks,
|
|
for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used
|
|
as backends to provide the PGAS model over various network interfaces/platforms
|
|
(e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022}
|
|
{WEB.HPE.Chapel_Platforms-v1.33.2023}.
|
|
|
|
Notably, implementation of a \emph{global} address space across machines on top
|
|
of machines already equipped with their own \emph{local} address space (e.g.,
|
|
cluster nodes running commercial Linux) necessitates a global addressing
|
|
mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014},
|
|
for example, utilizes a 128-bit ``global pointer'' to encode global memory
|
|
object/segment ID and access flags in the upper 64 bits and virtual addresses in
|
|
the lower 64 bits for each (slice of) memory object allocated within the PGAS
|
|
model. A \textit{non-collective} PGAS object is allocated entirely local to the
|
|
allocating node's memory, but registered globally. Consequently, a single global
|
|
pointer is recorded in the runtime with corresponding permission flags for the
|
|
context of some user-defined group of associated nodes. Comparatively, a
|
|
\textit{collective} PGAS object is allocated such that a partition of the object
|
|
(i.e., a sub-array of the repr) is stored in each of the associated node -- for
|
|
a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
|
|
pointing to the same object, with different offsets and (intuitively)
|
|
independently-chosen virtual addresses. Note that this design naturally requires
|
|
virtual addresses within each node to be \emph{pinned} -- the allocated object
|
|
cannot be re-addressed to a different virtual address, thus preventing the
|
|
global pointer that records the local virtual address from becoming
|
|
spontaneously invalidated.
|
|
|
|
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
|
|
opt to use a map-like data structure for addressing instead. In general, despite
|
|
both PGAS and DSM systems provide memory management over remote nodes, PGAS
|
|
frameworks provide no transparent caching and transfer of remote memory objects
|
|
accessed by local nodes. The programmer is still expected to handle data/thread
|
|
movement manually when working with shared memory over network to maximize
|
|
their performance metrics of interest.
|
|
|
|
\subsection{Message Passing}
|
|
\textit{Message Passing} remains the predominant programming model for
|
|
parallelism between loosely-coupled nodes within a computer system, much as it
|
|
is ubiquitous in supporting all levels of abstraction within any concurrent
|
|
components of a computer system. Specific to cluster computing systems is the
|
|
message-passing programming model, where parallel programs (or instances of
|
|
the same parallel program) on different nodes within the system communicate
|
|
via exchanging messages over network between these nodes. Such models exchange
|
|
programming model productivity for more fine-grained control over the messages
|
|
passed, as well as more explicit separation between communication and computation
|
|
stages within a programming subproblem.
|
|
|
|
Commonly, message-passing backends function as \textit{middlewares} --
|
|
communication runtimes -- to aid distributed software development
|
|
\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend
|
|
expose facilities for inter-application communication to frontend developers
|
|
while transparently providing security, accounting, and fault-tolerance, much
|
|
like how an operating system may provide resource management, scheduling, and
|
|
security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}.
|
|
This is the case for implementing the PGAS programming model, which mostly rely
|
|
on common message-passing backends to facilitate orchestrated data manipulation
|
|
across distributed nodes. Likewise, message-passing backends, including RDMA API,
|
|
form the backbone of many research-oriented DSM systems
|
|
\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}
|
|
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
|
|
|
|
Message-passing between network-connected nodes may be \textit{two-sided} or
|
|
\textit{one-sided}. The former models an intuitive workflow to sending and
|
|
receiving datagrams over the network -- the sender initiates a transfer; the
|
|
receiver copies a received packet from the network card into a kernel buffer;
|
|
the receiver's kernel filters the packet and (optionally)
|
|
\cite{FreeBSD.man-BPF-4.2021} copies the internal message
|
|
into the message-passing runtime/middleware's address space; the receiver's
|
|
middleware inspects the copied message and performs some procedures accordingly,
|
|
likely also involving copying slices of message data to some registered
|
|
distributed shared memory buffer for the distributed application to access.
|
|
Despite it being a highly intuitive model of data manipulation over the network,
|
|
this poses a fundamental performance issue: because the process requires the
|
|
receiver's kernel AND userspace to exert CPU-time, upon reception of each
|
|
message, the receiver node needs to proactively exert CPU-time to move the
|
|
received data from bytes read from NIC devices to userspace. Because this
|
|
happens concurrently with other kernel and userspace routines in a
|
|
concurrent system, a preemptable kernel may incur significant latency if the
|
|
kernel routine for packet filtering is pre-empted by another kernel routine,
|
|
userspace, or IRQs.
|
|
|
|
Comparatively, a ``one-sided'' message-passing scheme, for example RDMA,
|
|
allows the network interface card to bypass in-kernel packet filters and
|
|
perform DMA on registered memory regions. The NIC can hence notify the CPU via
|
|
interrupts, thus allowing the kernel and the userspace programs to perform
|
|
callbacks at reception time with reduced latency. Because of this advantage,
|
|
many recent studies attempt to leverage RDMA APIs for improved distributed data
|
|
workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014}
|
|
{Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020}
|
|
{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
|
|
{Kaxiras_etal.DSM-Argos.2015}.
|
|
|
|
% \subsection{Data to Process, or Process to Data?}
|
|
% Hypothetically, instead of moving data back-and-forth between nodes within a
|
|
% shared storage domain, nodes could instead opt to perform remote procedure
|
|
% calls to other nodes which have access to their own share of data and
|
|
% acknowledge its completion at return. In the latter case, nodes connected within
|
|
% a network exchange task information -- data necessary to (re)construct the task
|
|
% in question on a remote node -- which can lead to significantly smaller packets
|
|
% than transmitting data over network. Provided that the time necessary to
|
|
% reconstruct the task on a remote node is less than the time necessary to
|
|
% transmit the data over network.
|
|
|
|
% Indeed, RPC have been shown
|
|
% (TBD -- The former is costly for data-intensive computation, but the latter may
|
|
% be impossible for certain tasks, and greatly hardens the replacement problem.)
|
|
|
|
% \section{Replacement Policy}
|
|
|
|
% In general, three variants of replacement strategies have been proposed for either
|
|
% generic cache block replacement problems, or specific use-cases where contextual
|
|
% factors can facilitate more efficient cache resource allocation:
|
|
% \begin{itemize}
|
|
% \item General-Purpose Replacement Algorithms, for example LRU.
|
|
% \item Cost-Model Analysis
|
|
% \item Probabilistic and Learned Algorithms
|
|
% \end{itemize}
|
|
|
|
% \subsection{General-Purpose Replacement Algorithms}
|
|
% Practically speaking, in the general case of the cache replacement problem,
|
|
% we desire to predict the re-reference interval of a cache block
|
|
% \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
|
% optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
|
% time, the entry with the highest re-reference interval is replaced. Under this
|
|
% framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
|
% where the re-reference interval for each entry is predicted to be immediate.
|
|
% Fortunately, memory access traces of real computer systems agree with this
|
|
% tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
|
% however, and there are other behaviors...) On the other hand, the hypothetical
|
|
% LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
|
% textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
|
% frequency analysis, it was nevertheless useful for keeping recurrent (though
|
|
% non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
|
|
|
% Derivatives from the LRU algorithm attempts to balance between frequency and
|
|
% recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
|
|
|
% Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
|
% of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
|
% book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
|
% for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
|
% \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
|
% CLOCK here \dots]}
|
|
|
|
% Finally, real-life experiences have shown the need to reduce CPU time in practical
|
|
% applications, owing from one simple observation -- during the fetch-execution
|
|
% cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
|
% design, despite its hypothetical optimality, could nevertheless degrade the performance
|
|
% of a system during low-memory situations. In fact, this proves to be the driving
|
|
% motivation behind Linux's transition away from the old LRU-2Q page replacement
|
|
% algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
|
% been mainlined since v6.1.
|
|
|
|
% \subsection{Cost-Model Analysis}
|
|
% The ideal case for the replacement problem fails to account for invalidation of
|
|
% cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
|
% that is insufficient to capture the heterogeneity of today's massively-parallel,
|
|
% distributed systems. High-speed network interfaces are capable of exposing RDMA
|
|
% interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
|
% when compared to swapping over the kernel I/O stack, while software that bypass
|
|
% the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
|
% (source). This creates an interesting network topology between RDMA-enabled nodes,
|
|
% where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
|
% or simply drop the physical page in order to lessen the cost of page misses.
|
|
|
|
% \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
|
|
|
% Traditionally, replacement policies based on cost-model analysis were utilized in
|
|
% content-delivery networks, which had different consistency models compared to
|
|
% finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
|
% as out-of-date information is considered permissible, and single-writer scenarios
|
|
% are common. Consequently, most replacement policies for static content servers,
|
|
% while making strong distinction towards network topology, fails to concern for the
|
|
% cases where an entry might become invalidated, let along multi-writer protocols.
|
|
% One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
|
% page fault frequency as an indicator of preference towards working set inclusion
|
|
% (which I personally think is highly flawed -- to be explained). Another paper
|
|
% \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
|
% page fault into consideration for eviction, but fails to go beyond the obvious
|
|
% implication that pages that have been faulted \emph{must} be evicted.
|
|
|
|
% The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
|
% too. (Expand)
|
|
|
|
% \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
|
% Finally, machine learning techniques and low-cost probabilistic approaches have
|
|
% been applied on the ideal cache replacement problem with some level of success.
|
|
% \textbf{[Talk about LeCaR, CACHEUS here]}.
|
|
|
|
% XXX: I will be writing about replacement as postfix...
|
|
\section{Consistency Model and Cache Coherence}
|
|
Consistency model specifies a contract on allowed behaviors of multi-processing
|
|
programs with regards to a shared memory
|
|
\cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious
|
|
conflict, which consistency models aim to resolve, lies within the interaction
|
|
between processor-native programs and multi-processors, all of whom needs to
|
|
operate on a shared memory with heterogeneous cache topologies. Here, a
|
|
well-defined consistency model aims to resolve the conflict on an architectural
|
|
scope. Beyond consistency models for bare-metal systems, programming languages
|
|
\cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007}
|
|
{Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024}
|
|
and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019}
|
|
{Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for
|
|
parallel access to shared memory on top of program order guarantees to explicate
|
|
program behavior under shared memory parallel programming across underlying
|
|
implementations.
|
|
|
|
\subsection{Consistency Model in DSM}
|
|
|
|
\subsection{Coherence Protocol}
|
|
|
|
\subsection{DMA and Cache Coherence}
|
|
|
|
\subsection{Cache Coherence in ARMv8}
|
|
|
|
|
|
|
|
(I need to read more into this. Most of the contribution comes from CPU caches,
|
|
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
|
protocol.]}
|
|
|
|
Consistency and communication protocols naturally affect the cost for each faulted
|
|
memory access \dots
|
|
|
|
\textbf{[Talk about directory, transactional, scope, and library cache coherence,
|
|
which allow for multi-casted communications at page fault but all with different
|
|
levels of book-keeping.]}
|
|
|
|
\printbibliography
|
|
\end{document} |