252 lines
No EOL
15 KiB
TeX
252 lines
No EOL
15 KiB
TeX
\documentclass{article}
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[dvipsnames]{xcolor}
|
|
\usepackage{biblatex}
|
|
|
|
\addbibresource{background_draft.bib}
|
|
|
|
\begin{document}
|
|
Though large-scale cluster systems remain the dominant solution for request and
|
|
data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
|
|
there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
|
|
efficient heterogeneous computation with more tightly-coupled heterogeneous nodes
|
|
providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015}
|
|
\textcolor{red}{[ADD MORE CITATIONS]} Within the scope of one node,
|
|
\emph{heterogeneous memory management (HMM)} enables the use of OS-controlled,
|
|
unified memory view into the entire memory landscape across attached devices
|
|
\cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
|
|
function calls as one would with SMP programming, the underlying complexities of
|
|
memory ownership and locality managed by the OS kernel.
|
|
|
|
Nevertheless, while HMM promises a distributed shared memory approach towards
|
|
exposing CPU and peripheral memory, applications (drivers and front-ends) that
|
|
exploit HMM to provide ergonomic programming models remain fragmented and
|
|
narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus
|
|
on exposing global address space abstraction to GPU memory -- a largely
|
|
non-coordinated effort surrounding both \textit{in-tree} and proprietary code
|
|
\cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}.
|
|
Limited effort have been done on incorporating HMM into other variants of
|
|
accelerators in various system topologies.
|
|
|
|
Orthogonally, allocation of hardware accelerator resources in a cluster computing
|
|
environment becomes difficult when the required hardware acceleration resources
|
|
of one workload cannot be easily determined and/or isolated. Within a cluster
|
|
system there may exist a large amount of general-purpose worker nodes and limited
|
|
amount of hardware-accelerated nodes. Further, it is possible that every workload
|
|
performed on this cluster wishes for hardware acceleration from time to time,
|
|
but never for a relatively long time. Many job scheduling mechanisms within a cluster
|
|
\emph{move data near computation} by migrating the entire job/container between
|
|
general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019}
|
|
{Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs
|
|
large overhead -- accelerator nodes which strictly perform in-memory computing
|
|
without ever needing to touch the container's filesystem should not have to install
|
|
the entire filesystem locally, for starters. Moreover, must \emph{all} computations be
|
|
near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over
|
|
fast network interfaces ($25 \times 8$Gbps) result negligible impact on tail latencies
|
|
but high impact on throughput when bandwidth is maximized.
|
|
|
|
This thesis paper builds upon an ongoing research effort in implementing a
|
|
tightly coupled cluster where HMM abstractions allow for transparent RDMA access
|
|
from accelerator nodes to local data and data migration near computation, focusing
|
|
on the effect of replacement policies on balancing the cost between near-data and
|
|
far-data computation between home node and accelerator node. \textcolor{red}{
|
|
Specifically, this paper explores the possibility of implementing shared page
|
|
movement between home and accelerator nodes to enable efficient memory over-commit
|
|
without the I/O-intensive swapping overhead.}
|
|
|
|
\textcolor{red}{The rest of the chapter is structured as follows\dots}
|
|
|
|
\section{Experiences from Software DSM}
|
|
The majority of contributions to the study of software DSM systems come from the
|
|
1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}
|
|
{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These
|
|
developments follow from the success of the Stanford DASH project in the late
|
|
1980s -- a hardware distributed shared memory (i.e., NUMA) implementation of a
|
|
multiprocessor that first proposed the \textit{directory-based protocol} for
|
|
cache coherence, which stores the ownership information of cache lines to reduce
|
|
unnecessary communication that prevented SMP processors from scaling out
|
|
\cite{Lenoski_etal.Stanford_DASH.1992}.
|
|
|
|
While developments in hardware DSM materialized into a universal approach to
|
|
cache-coherence in contemporary many-core processors (e.g., \textit{Ampere
|
|
Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered
|
|
computing languished in favor of loosely-coupled nodes performing data-parallel
|
|
computation, communicating via message-passing. Bandwidth limitations with the
|
|
network interfaces of the late 1990s was insufficient to support the high traffic
|
|
incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}
|
|
{Lu_etal.MPI_vs_DSM_over_cluster.1995}.
|
|
|
|
New developments in network interfaces provides much improved bandwidth and latency
|
|
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
|
|
the training efficiency sixfold compared to distributed TensorFlow via RPC,
|
|
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
|
|
Similar results have been observed for Spark\cite{Lu_etal.Spark_over_RDMA.2014}
|
|
\textcolor{red}{and what?}. Consequently, there have been a resurgence of interest
|
|
in software DSM systems and their corresponding programming models
|
|
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
|
|
|
|
% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
|
|
% i.e., we do it in kernel as opposed to in userspace. Accelerator node can access
|
|
% local node's shared page like a DMA device do so via HMM.
|
|
|
|
\subsection{Munin: Multiple Consistency Protocols}
|
|
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older
|
|
developments in software DSM systems. The authors of Munin identify that
|
|
\textit{false-sharing}, occurring due to multiple processors writing to different
|
|
offsets of the same page triggering invalidations, is strongly detrimental to the
|
|
performance of shared-memory systems. To combat this, Munin exposes annotations
|
|
as part of its programming model to facilitate multiple consistency protocols on
|
|
top of release consistency. An immutable shared memory object across readers,
|
|
for example, can be safely copied without concern for coherence between processors.
|
|
On the other hand, the \textit{write-shared} annotation explicates that a memory
|
|
object is written by multiple processors without synchronization -- i.e., the
|
|
programmer guarantees that only false-sharing occurs within this granularity.
|
|
Annotations such as these explicitly disables subsets of consistency procedures
|
|
to reduce communication in the network fabric, thereby improving the performance
|
|
of the DSM system.
|
|
|
|
Perhaps most importantly, experiences from Munin show that \emph{restricting the
|
|
flexibility of programming model can lead to more performant coherence models}, as
|
|
\textcolor{teal}{corroborated} by the now-foundational
|
|
\textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} --
|
|
which powered many now-popular scalable data processing frameworks such as
|
|
\textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and
|
|
\textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
|
|
tolerance efficiently, RDDs provide a restricted form of shared memory
|
|
[based on]\dots transformations rather than\dots updates to shared state''
|
|
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
|
|
cheaply synchronize states between unshared address spaces -- a much desired
|
|
property for highly scalable, loosely-coupled clustered systems.
|
|
|
|
\subsection{Treadmarks: Multi-Writer Protocol}
|
|
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM developed in
|
|
1996
|
|
|
|
% The majority of contributions to DSM study come from the 1990s, for example
|
|
% \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
|
|
% leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
|
|
% these systems provide a strong theoretical basis for today's majority-software
|
|
% DSM systems and applications that expose a \emph{(partitioned) global address space},
|
|
% they were nevertheless constrained by the limitations in NIC transfer rate and
|
|
% bandwidth, and the concept of DSM failed to take off (relative to cluster computing).
|
|
|
|
\section{HPC and Partitioned Global Address Space}
|
|
Improvement in NIC bandwidth and transfer rate allows for applications that expose
|
|
global address space, as well as RDMA technologies that leverage single-writer
|
|
protocols over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
|
|
technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
|
|
specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
|
|
|
|
|
|
Contemporary works on DSM systems focus more on leveraging hardware advancements
|
|
to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
|
|
for example, implements a complex system for memory disaggregation over multiple
|
|
compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
|
|
they observed significant performance improvements over existing data-intensive
|
|
processing frameworks, for example APACHE Spark, Memcached, and Redis, over
|
|
no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
|
|
systems.
|
|
|
|
\subsection{Programming Model}
|
|
|
|
\subsection{Move Data to Process, or Move Process to Data?}
|
|
(TBD -- The former is costly for data-intensive computation, but the latter may
|
|
be impossible for certain tasks, and greatly hardens the replacement problem.)
|
|
|
|
\section{Replacement Policy}
|
|
|
|
In general, three variants of replacement strategies have been proposed for either
|
|
generic cache block replacement problems, or specific use-cases where contextual
|
|
factors can facilitate more efficient cache resource allocation:
|
|
\begin{itemize}
|
|
\item General-Purpose Replacement Algorithms, for example LRU.
|
|
\item Cost-Model Analysis
|
|
\item Probabilistic and Learned Algorithms
|
|
\end{itemize}
|
|
|
|
\subsection{General-Purpose Replacement Algorithms}
|
|
Practically speaking, in the general case of the cache replacement problem,
|
|
we desire to predict the re-reference interval of a cache block
|
|
\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
|
optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
|
time, the entry with the highest re-reference interval is replaced. Under this
|
|
framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
|
where the re-reference interval for each entry is predicted to be immediate.
|
|
Fortunately, memory access traces of real computer systems agree with this
|
|
tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
|
however, and there are other behaviors...) On the other hand, the hypothetical
|
|
LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
|
textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
|
frequency analysis, it was nevertheless useful for keeping recurrent (though
|
|
non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
|
|
|
Derivatives from the LRU algorithm attempts to balance between frequency and
|
|
recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
|
|
|
Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
|
of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
|
book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
|
for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
|
\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
|
CLOCK here \dots]}
|
|
|
|
Finally, real-life experiences have shown the need to reduce CPU time in practical
|
|
applications, owing from one simple observation -- during the fetch-execution
|
|
cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
|
design, despite its hypothetical optimality, could nevertheless degrade the performance
|
|
of a system during low-memory situations. In fact, this proves to be the driving
|
|
motivation behind Linux's transition away from the old LRU-2Q page replacement
|
|
algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
|
been mainlined since v6.1.
|
|
|
|
\subsection{Cost-Model Analysis}
|
|
The ideal case for the replacement problem fails to account for invalidation of
|
|
cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
|
that is insufficient to capture the heterogeneity of today's massively-parallel,
|
|
distributed systems. High-speed network interfaces are capable of exposing RDMA
|
|
interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
|
when compared to swapping over the kernel I/O stack, while software that bypass
|
|
the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
|
(source). This creates an interesting network topology between RDMA-enabled nodes,
|
|
where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
|
or simply drop the physical page in order to lessen the cost of page misses.
|
|
|
|
\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
|
|
|
Traditionally, replacement policies based on cost-model analysis were utilized in
|
|
content-delivery networks, which had different consistency models compared to
|
|
finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
|
as out-of-date information is considered permissible, and single-writer scenarios
|
|
are common. Consequently, most replacement policies for static content servers,
|
|
while making strong distinction towards network topology, fails to concern for the
|
|
cases where an entry might become invalidated, let along multi-writer protocols.
|
|
One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
|
page fault frequency as an indicator of preference towards working set inclusion
|
|
(which I personally think is highly flawed -- to be explained). Another paper
|
|
\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
|
page fault into consideration for eviction, but fails to go beyond the obvious
|
|
implication that pages that have been faulted \emph{must} be evicted.
|
|
|
|
The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
|
too. (Expand)
|
|
|
|
\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
|
Finally, machine learning techniques and low-cost probabilistic approaches have
|
|
been applied on the ideal cache replacement problem with some level of success.
|
|
\textbf{[Talk about LeCaR, CACHEUS here]}.
|
|
|
|
\section{Cache Coherence and Consistency in DSM Systems}
|
|
|
|
(I need to read more into this. Most of the contribution comes from CPU caches,
|
|
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
|
protocol.]}
|
|
|
|
Consistency and communication protocols naturally affect the cost for each faulted
|
|
memory access \dots
|
|
|
|
\textbf{[Talk about directory, transactional, scope, and library cache coherence,
|
|
which allow for multi-casted communications at page fault but all with different
|
|
levels of book-keeping.]}
|
|
|
|
\printbibliography
|
|
\end{document} |