...
This commit is contained in:
parent
b3c3ec961b
commit
8a7d5e7e5a
3 changed files with 194 additions and 111 deletions
|
|
@ -390,3 +390,45 @@
|
||||||
author={The FreeBSD Project},
|
author={The FreeBSD Project},
|
||||||
year={2021}
|
year={2021}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
|
||||||
|
title={A primer on memory consistency and cache coherence},
|
||||||
|
author={Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
|
||||||
|
year={2020},
|
||||||
|
publisher={Springer Nature}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{ISO/IEC_9899:2011.C11,
|
||||||
|
abstract = {Edition Status: Withdrawn on 2018-07-13},
|
||||||
|
isbn = {9780580801655},
|
||||||
|
keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
|
||||||
|
language = {eng},
|
||||||
|
publisher = {British Standards Institute},
|
||||||
|
title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
|
||||||
|
year = {2013},
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
|
||||||
|
title={C++ Atomic Types and Operations},
|
||||||
|
url={https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
|
||||||
|
journal={C++ atomic types and operations},
|
||||||
|
publisher={ISO/IEC JTC 1},
|
||||||
|
author={Boehm, Hans J and Crowl, Lawrence},
|
||||||
|
year={2007}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{Rust.core::sync::atomic::Ordering.2024,
|
||||||
|
title={Ordering in core::sync::atomic - Rust},
|
||||||
|
url={https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
|
||||||
|
journal={The Rust Core Library},
|
||||||
|
publisher={the Rust Team},
|
||||||
|
year={2024}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{Manson_Goetz.JSR_133.Java_5.2004,
|
||||||
|
url={https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
|
||||||
|
journal={JSR 133 (Java Memory Model) FAQ},
|
||||||
|
publisher={Department of Computer Science, University of Maryland},
|
||||||
|
author={Manson, Jeremy and Goetz, Brian},
|
||||||
|
year={2004}
|
||||||
|
}
|
||||||
|
|
|
||||||
Binary file not shown.
|
|
@ -342,119 +342,160 @@ form the backbone of many research-oriented DSM systems
|
||||||
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
|
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
|
||||||
|
|
||||||
Message-passing between network-connected nodes may be \textit{two-sided} or
|
Message-passing between network-connected nodes may be \textit{two-sided} or
|
||||||
\textit{one-sided}. The former models an intuitive workflow to sending and receiving
|
\textit{one-sided}. The former models an intuitive workflow to sending and
|
||||||
datagrams over the network -- the sender initiates a transfer; the receiver
|
receiving datagrams over the network -- the sender initiates a transfer; the
|
||||||
copies a received packet from the network card into a kernel buffer; the
|
receiver copies a received packet from the network card into a kernel buffer;
|
||||||
receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021}
|
the receiver's kernel filters the packet and (optionally)
|
||||||
copies the internal message
|
\cite{FreeBSD.man-BPF-4.2021} copies the internal message
|
||||||
into the message-passing runtime/middleware's address space; the receiver's
|
into the message-passing runtime/middleware's address space; the receiver's
|
||||||
middleware inspects the copied message and performs some procedures accordingly,
|
middleware inspects the copied message and performs some procedures accordingly,
|
||||||
likely also involving copying slices of message data to some registered distributed
|
likely also involving copying slices of message data to some registered
|
||||||
shared memory buffer for the distributed application to access. Despite it
|
distributed shared memory buffer for the distributed application to access.
|
||||||
being a highly intuitive model of data manipulation over the network, this
|
Despite it being a highly intuitive model of data manipulation over the network,
|
||||||
poses a fundamental performance issue: because the process requires the receiver's
|
this poses a fundamental performance issue: because the process requires the
|
||||||
kernel AND userspace to exert CPU-time, upon reception of each message, the
|
receiver's kernel AND userspace to exert CPU-time, upon reception of each
|
||||||
receiver node needs to proactively exert CPU-time to move the received data
|
message, the receiver node needs to proactively exert CPU-time to move the
|
||||||
from bytes read from NIC devices to userspace. Because this happens concurrently
|
received data from bytes read from NIC devices to userspace. Because this
|
||||||
with other kernel and userspace routines in a multi-processing system, a
|
happens concurrently with other kernel and userspace routines in a
|
||||||
preemptable kernel may incur significant latency if the kernel routine for
|
concurrent system, a preemptable kernel may incur significant latency if the
|
||||||
packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
|
kernel routine for packet filtering is pre-empted by another kernel routine,
|
||||||
|
userspace, or IRQs.
|
||||||
|
|
||||||
Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA},
|
Comparatively, a ``one-sided'' message-passing scheme, for example RDMA,
|
||||||
allows the network interface card to bypass in-kernel packet filters and
|
allows the network interface card to bypass in-kernel packet filters and
|
||||||
perform DMA on registered memory regions. The NIC can hence notify the CPU via
|
perform DMA on registered memory regions. The NIC can hence notify the CPU via
|
||||||
interrupts, thus allowing the kernel and the userspace programs to perform
|
interrupts, thus allowing the kernel and the userspace programs to perform
|
||||||
callbacks at reception time with reduced latency. Because of this advantage,
|
callbacks at reception time with reduced latency. Because of this advantage,
|
||||||
many recent studies attempt to leverage RDMA APIs \dots
|
many recent studies attempt to leverage RDMA APIs for improved distributed data
|
||||||
|
workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014}
|
||||||
|
{Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||||
|
{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
|
||||||
|
{Kaxiras_etal.DSM-Argos.2015}.
|
||||||
|
|
||||||
|
% \subsection{Data to Process, or Process to Data?}
|
||||||
|
% Hypothetically, instead of moving data back-and-forth between nodes within a
|
||||||
|
% shared storage domain, nodes could instead opt to perform remote procedure
|
||||||
|
% calls to other nodes which have access to their own share of data and
|
||||||
|
% acknowledge its completion at return. In the latter case, nodes connected within
|
||||||
|
% a network exchange task information -- data necessary to (re)construct the task
|
||||||
|
% in question on a remote node -- which can lead to significantly smaller packets
|
||||||
|
% than transmitting data over network. Provided that the time necessary to
|
||||||
|
% reconstruct the task on a remote node is less than the time necessary to
|
||||||
|
% transmit the data over network.
|
||||||
|
|
||||||
|
% Indeed, RPC have been shown
|
||||||
|
% (TBD -- The former is costly for data-intensive computation, but the latter may
|
||||||
|
% be impossible for certain tasks, and greatly hardens the replacement problem.)
|
||||||
|
|
||||||
|
% \section{Replacement Policy}
|
||||||
|
|
||||||
|
% In general, three variants of replacement strategies have been proposed for either
|
||||||
|
% generic cache block replacement problems, or specific use-cases where contextual
|
||||||
|
% factors can facilitate more efficient cache resource allocation:
|
||||||
|
% \begin{itemize}
|
||||||
|
% \item General-Purpose Replacement Algorithms, for example LRU.
|
||||||
|
% \item Cost-Model Analysis
|
||||||
|
% \item Probabilistic and Learned Algorithms
|
||||||
|
% \end{itemize}
|
||||||
|
|
||||||
|
% \subsection{General-Purpose Replacement Algorithms}
|
||||||
|
% Practically speaking, in the general case of the cache replacement problem,
|
||||||
|
% we desire to predict the re-reference interval of a cache block
|
||||||
|
% \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
||||||
|
% optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
||||||
|
% time, the entry with the highest re-reference interval is replaced. Under this
|
||||||
|
% framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
||||||
|
% where the re-reference interval for each entry is predicted to be immediate.
|
||||||
|
% Fortunately, memory access traces of real computer systems agree with this
|
||||||
|
% tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
||||||
|
% however, and there are other behaviors...) On the other hand, the hypothetical
|
||||||
|
% LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
||||||
|
% textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
||||||
|
% frequency analysis, it was nevertheless useful for keeping recurrent (though
|
||||||
|
% non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
||||||
|
|
||||||
|
% Derivatives from the LRU algorithm attempts to balance between frequency and
|
||||||
|
% recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
||||||
|
|
||||||
|
% Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
||||||
|
% of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
||||||
|
% book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
||||||
|
% for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
||||||
|
% \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
||||||
|
% CLOCK here \dots]}
|
||||||
|
|
||||||
|
% Finally, real-life experiences have shown the need to reduce CPU time in practical
|
||||||
|
% applications, owing from one simple observation -- during the fetch-execution
|
||||||
|
% cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
||||||
|
% design, despite its hypothetical optimality, could nevertheless degrade the performance
|
||||||
|
% of a system during low-memory situations. In fact, this proves to be the driving
|
||||||
|
% motivation behind Linux's transition away from the old LRU-2Q page replacement
|
||||||
|
% algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
||||||
|
% been mainlined since v6.1.
|
||||||
|
|
||||||
|
% \subsection{Cost-Model Analysis}
|
||||||
|
% The ideal case for the replacement problem fails to account for invalidation of
|
||||||
|
% cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
||||||
|
% that is insufficient to capture the heterogeneity of today's massively-parallel,
|
||||||
|
% distributed systems. High-speed network interfaces are capable of exposing RDMA
|
||||||
|
% interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
||||||
|
% when compared to swapping over the kernel I/O stack, while software that bypass
|
||||||
|
% the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
||||||
|
% (source). This creates an interesting network topology between RDMA-enabled nodes,
|
||||||
|
% where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
||||||
|
% or simply drop the physical page in order to lessen the cost of page misses.
|
||||||
|
|
||||||
|
% \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
||||||
|
|
||||||
|
% Traditionally, replacement policies based on cost-model analysis were utilized in
|
||||||
|
% content-delivery networks, which had different consistency models compared to
|
||||||
|
% finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
||||||
|
% as out-of-date information is considered permissible, and single-writer scenarios
|
||||||
|
% are common. Consequently, most replacement policies for static content servers,
|
||||||
|
% while making strong distinction towards network topology, fails to concern for the
|
||||||
|
% cases where an entry might become invalidated, let along multi-writer protocols.
|
||||||
|
% One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
||||||
|
% page fault frequency as an indicator of preference towards working set inclusion
|
||||||
|
% (which I personally think is highly flawed -- to be explained). Another paper
|
||||||
|
% \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
||||||
|
% page fault into consideration for eviction, but fails to go beyond the obvious
|
||||||
|
% implication that pages that have been faulted \emph{must} be evicted.
|
||||||
|
|
||||||
|
% The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
||||||
|
% too. (Expand)
|
||||||
|
|
||||||
|
% \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
||||||
|
% Finally, machine learning techniques and low-cost probabilistic approaches have
|
||||||
|
% been applied on the ideal cache replacement problem with some level of success.
|
||||||
|
% \textbf{[Talk about LeCaR, CACHEUS here]}.
|
||||||
|
|
||||||
|
% XXX: I will be writing about replacement as postfix...
|
||||||
|
\section{Consistency Model and Cache Coherence}
|
||||||
|
Consistency model specifies a contract on allowed behaviors of multi-processing
|
||||||
|
programs with regards to a shared memory
|
||||||
|
\cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious
|
||||||
|
conflict, which consistency models aim to resolve, lies within the interaction
|
||||||
|
between processor-native programs and multi-processors, all of whom needs to
|
||||||
|
operate on a shared memory with heterogeneous cache topologies. Here, a
|
||||||
|
well-defined consistency model aims to resolve the conflict on an architectural
|
||||||
|
scope. Beyond consistency models for bare-metal systems, programming languages
|
||||||
|
\cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007}
|
||||||
|
{Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024}
|
||||||
|
and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019}
|
||||||
|
{Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for
|
||||||
|
parallel access to shared memory on top of program order guarantees to explicate
|
||||||
|
program behavior under shared memory parallel programming across underlying
|
||||||
|
implementations.
|
||||||
|
|
||||||
|
\subsection{Consistency Model in DSM}
|
||||||
|
|
||||||
|
\subsection{Coherence Protocol}
|
||||||
|
|
||||||
|
\subsection{DMA and Cache Coherence}
|
||||||
|
|
||||||
|
\subsection{Cache Coherence in ARMv8}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Data to Process, or Process to Data?}
|
|
||||||
(TBD -- The former is costly for data-intensive computation, but the latter may
|
|
||||||
be impossible for certain tasks, and greatly hardens the replacement problem.)
|
|
||||||
|
|
||||||
\section{Replacement Policy}
|
|
||||||
|
|
||||||
In general, three variants of replacement strategies have been proposed for either
|
|
||||||
generic cache block replacement problems, or specific use-cases where contextual
|
|
||||||
factors can facilitate more efficient cache resource allocation:
|
|
||||||
\begin{itemize}
|
|
||||||
\item General-Purpose Replacement Algorithms, for example LRU.
|
|
||||||
\item Cost-Model Analysis
|
|
||||||
\item Probabilistic and Learned Algorithms
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\subsection{General-Purpose Replacement Algorithms}
|
|
||||||
Practically speaking, in the general case of the cache replacement problem,
|
|
||||||
we desire to predict the re-reference interval of a cache block
|
|
||||||
\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
|
||||||
optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
|
||||||
time, the entry with the highest re-reference interval is replaced. Under this
|
|
||||||
framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
|
||||||
where the re-reference interval for each entry is predicted to be immediate.
|
|
||||||
Fortunately, memory access traces of real computer systems agree with this
|
|
||||||
tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
|
||||||
however, and there are other behaviors...) On the other hand, the hypothetical
|
|
||||||
LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
|
||||||
textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
|
||||||
frequency analysis, it was nevertheless useful for keeping recurrent (though
|
|
||||||
non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
|
||||||
|
|
||||||
Derivatives from the LRU algorithm attempts to balance between frequency and
|
|
||||||
recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
|
||||||
|
|
||||||
Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
|
||||||
of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
|
||||||
book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
|
||||||
for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
|
||||||
\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
|
||||||
CLOCK here \dots]}
|
|
||||||
|
|
||||||
Finally, real-life experiences have shown the need to reduce CPU time in practical
|
|
||||||
applications, owing from one simple observation -- during the fetch-execution
|
|
||||||
cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
|
||||||
design, despite its hypothetical optimality, could nevertheless degrade the performance
|
|
||||||
of a system during low-memory situations. In fact, this proves to be the driving
|
|
||||||
motivation behind Linux's transition away from the old LRU-2Q page replacement
|
|
||||||
algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
|
||||||
been mainlined since v6.1.
|
|
||||||
|
|
||||||
\subsection{Cost-Model Analysis}
|
|
||||||
The ideal case for the replacement problem fails to account for invalidation of
|
|
||||||
cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
|
||||||
that is insufficient to capture the heterogeneity of today's massively-parallel,
|
|
||||||
distributed systems. High-speed network interfaces are capable of exposing RDMA
|
|
||||||
interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
|
||||||
when compared to swapping over the kernel I/O stack, while software that bypass
|
|
||||||
the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
|
||||||
(source). This creates an interesting network topology between RDMA-enabled nodes,
|
|
||||||
where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
|
||||||
or simply drop the physical page in order to lessen the cost of page misses.
|
|
||||||
|
|
||||||
\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
|
||||||
|
|
||||||
Traditionally, replacement policies based on cost-model analysis were utilized in
|
|
||||||
content-delivery networks, which had different consistency models compared to
|
|
||||||
finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
|
||||||
as out-of-date information is considered permissible, and single-writer scenarios
|
|
||||||
are common. Consequently, most replacement policies for static content servers,
|
|
||||||
while making strong distinction towards network topology, fails to concern for the
|
|
||||||
cases where an entry might become invalidated, let along multi-writer protocols.
|
|
||||||
One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
|
||||||
page fault frequency as an indicator of preference towards working set inclusion
|
|
||||||
(which I personally think is highly flawed -- to be explained). Another paper
|
|
||||||
\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
|
||||||
page fault into consideration for eviction, but fails to go beyond the obvious
|
|
||||||
implication that pages that have been faulted \emph{must} be evicted.
|
|
||||||
|
|
||||||
The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
|
||||||
too. (Expand)
|
|
||||||
|
|
||||||
\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
|
||||||
Finally, machine learning techniques and low-cost probabilistic approaches have
|
|
||||||
been applied on the ideal cache replacement problem with some level of success.
|
|
||||||
\textbf{[Talk about LeCaR, CACHEUS here]}.
|
|
||||||
|
|
||||||
\section{Cache Coherence and Consistency in DSM Systems}
|
|
||||||
|
|
||||||
(I need to read more into this. Most of the contribution comes from CPU caches,
|
(I need to read more into this. Most of the contribution comes from CPU caches,
|
||||||
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue