143 lines
No EOL
7.9 KiB
TeX
143 lines
No EOL
7.9 KiB
TeX
\documentclass{article}
|
|
\usepackage{biblatex}
|
|
|
|
\addbibresource{background_draft.bib}
|
|
|
|
\begin{document}
|
|
% \chapter{Backgrounds}
|
|
Recent studies has shown a reinvigorated interest in disaggregated/distributed
|
|
shared memory systems last seen in the 1990s. While large-scale cluster systems
|
|
remain predominantly the solution for massively parallel computation, it is known
|
|
to
|
|
The interplay between (page) replacement policy and runtime performance of
|
|
distributed shared memory systems has not been properly explored.
|
|
|
|
\section{Overview of Distributed Shared Memory}
|
|
|
|
A striking feature in the study of distributed shared memory (DSM) systems is the
|
|
non-uniformity of the terminologies used to describe overlapping study interests.
|
|
The majority of contributions to DSM study come from the 1990s, for example
|
|
\textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
|
|
leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
|
|
these systems provide a strong theoretical basis for today's majority-software
|
|
DSM systems and applications that expose a \emph{(partitioned) global address space},
|
|
they were nevertheless constrained by the limitations in NIC transfer rate and
|
|
bandwidth, and the concept of DSM failed to take off (relative to cluster computing).
|
|
|
|
Improvement in NIC bandwidth and transfer rate allows for applications that expose
|
|
global address space, as well as RDMA technologies that leverage single-writer
|
|
protocols over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
|
|
technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
|
|
specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
|
|
|
|
|
|
Contemporary works on DSM systems focus more on leveraging hardware advancements
|
|
to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
|
|
for example, implements a complex system for memory disaggregation over multiple
|
|
compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
|
|
they observed significant performance improvements over existing data-intensive
|
|
processing frameworks, for example APACHE Spark, Memcached, and Redis, over
|
|
no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
|
|
systems.
|
|
|
|
\subsection{Move Data to Process, or Move Process to Data?}
|
|
(TBD -- The former is costly for data-intensive computation, but the latter may
|
|
be impossible for certain tasks, and greatly hardens the replacement problem.)
|
|
|
|
\section{Replacement Policy}
|
|
|
|
In general, three variants of replacement strategies have been proposed for either
|
|
generic cache block replacement problems, or specific use-cases where contextual
|
|
factors can facilitate more efficient cache resource allocation:
|
|
\begin{itemize}
|
|
\item General-Purpose Replacement Algorithms, for example LRU.
|
|
\item Cost-Model Analysis
|
|
\item Probabilistic and Learned Algorithms
|
|
\end{itemize}
|
|
|
|
\subsection{General-Purpose Replacement Algorithms}
|
|
Practically speaking, in the general case of the cache replacement problem,
|
|
we desire to predict the re-reference interval of a cache block
|
|
\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
|
optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
|
time, the entry with the highest re-reference interval is replaced. Under this
|
|
framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
|
where the re-reference interval for each entry is predicted to be immediate.
|
|
Fortunately, memory access traces of real computer systems agree with this
|
|
tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
|
however, and there are other behaviors...) On the other hand, the hypothetical
|
|
LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
|
textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
|
frequency analysis, it was nevertheless useful for keeping recurrent (though
|
|
non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
|
|
|
Derivatives from the LRU algorithm attempts to balance between frequency and
|
|
recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
|
|
|
Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
|
of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
|
book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
|
for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
|
\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
|
CLOCK here \dots]}
|
|
|
|
Finally, real-life experiences have shown the need to reduce CPU time in practical
|
|
applications, owing from one simple observation -- during the fetch-execution
|
|
cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
|
design, despite its hypothetical optimality, could nevertheless degrade the performance
|
|
of a system during low-memory situations. In fact, this proves to be the driving
|
|
motivation behind Linux's transition away from the old LRU-2Q page replacement
|
|
algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
|
been mainlined since v6.1.
|
|
|
|
\subsection{Cost-Model Analysis}
|
|
The ideal case for the replacement problem fails to account for invalidation of
|
|
cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
|
that is insufficient to capture the heterogeneity of today's massively-parallel,
|
|
distributed systems. High-speed network interfaces are capable of exposing RDMA
|
|
interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
|
when compared to swapping over the kernel I/O stack, while software that bypass
|
|
the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
|
(source). This creates an interesting network topology between RDMA-enabled nodes,
|
|
where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
|
or simply drop the physical page in order to lessen the cost of page misses.
|
|
|
|
\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
|
|
|
Traditionally, replacement policies based on cost-model analysis were utilized in
|
|
content-delivery networks, which had different consistency models compared to
|
|
finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
|
as out-of-date information is considered permissible, and single-writer scenarios
|
|
are common. Consequently, most replacement policies for static content servers,
|
|
while making strong distinction towards network topology, fails to concern for the
|
|
cases where an entry might become invalidated, let along multi-writer protocols.
|
|
One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
|
page fault frequency as an indicator of preference towards working set inclusion
|
|
(which I personally think is highly flawed -- to be explained). Another paper
|
|
\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
|
page fault into consideration for eviction, but fails to go beyond the obvious
|
|
implication that pages that have been faulted \emph{must} be evicted.
|
|
|
|
The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
|
too. (Expand)
|
|
|
|
\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
|
Finally, machine learning techniques and low-cost probabilistic approaches have
|
|
been applied on the ideal cache replacement problem with some level of success.
|
|
\textbf{[Talk about LeCaR, CACHEUS here]}.
|
|
|
|
\section{Cache Coherence and Consistency in DSM Systems}
|
|
|
|
(I need to read more into this. Most of the contribution comes from CPU caches,
|
|
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
|
protocol.]}
|
|
|
|
Consistency and communication protocols naturally affect the cost for each faulted
|
|
memory access \dots
|
|
|
|
\textbf{[Talk about directory, transactional, scope, and library cache coherence,
|
|
which allow for multi-casted communications at page fault but all with different
|
|
levels of book-keeping.]}
|
|
|
|
\printbibliography
|
|
\end{document} |