This commit is contained in:
Zhengyi Chen 2024-02-28 21:51:23 +00:00
parent b3c3ec961b
commit 8a7d5e7e5a
3 changed files with 194 additions and 111 deletions

View file

@ -382,11 +382,53 @@
series = {HPDC '15} series = {HPDC '15}
} }
@misc{FreeBSD.man-BPF-4.2021, @misc{FreeBSD.man-BPF-4.2021,
title={FreeBSD manual pages}, title={FreeBSD manual pages},
url={https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports}, url={https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports},
journal={BPF(4) Kernel Interfaces Manual}, journal={BPF(4) Kernel Interfaces Manual},
publisher={The FreeBSD Project}, publisher={The FreeBSD Project},
author={The FreeBSD Project}, author={The FreeBSD Project},
year={2021} year={2021}
} }
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
title={A primer on memory consistency and cache coherence},
author={Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
year={2020},
publisher={Springer Nature}
}
@misc{ISO/IEC_9899:2011.C11,
abstract = {Edition Status: Withdrawn on 2018-07-13},
isbn = {9780580801655},
keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
language = {eng},
publisher = {British Standards Institute},
title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
year = {2013},
}
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
title={C++ Atomic Types and Operations},
url={https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
journal={C++ atomic types and operations},
publisher={ISO/IEC JTC 1},
author={Boehm, Hans J and Crowl, Lawrence},
year={2007}
}
@misc{Rust.core::sync::atomic::Ordering.2024,
title={Ordering in core::sync::atomic - Rust},
url={https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
journal={The Rust Core Library},
publisher={the Rust Team},
year={2024}
}
@misc{Manson_Goetz.JSR_133.Java_5.2004,
url={https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
journal={JSR 133 (Java Memory Model) FAQ},
publisher={Department of Computer Science, University of Maryland},
author={Manson, Jeremy and Goetz, Brian},
year={2004}
}

Binary file not shown.

View file

@ -342,119 +342,160 @@ form the backbone of many research-oriented DSM systems
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}. {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
Message-passing between network-connected nodes may be \textit{two-sided} or Message-passing between network-connected nodes may be \textit{two-sided} or
\textit{one-sided}. The former models an intuitive workflow to sending and receiving \textit{one-sided}. The former models an intuitive workflow to sending and
datagrams over the network -- the sender initiates a transfer; the receiver receiving datagrams over the network -- the sender initiates a transfer; the
copies a received packet from the network card into a kernel buffer; the receiver copies a received packet from the network card into a kernel buffer;
receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021} the receiver's kernel filters the packet and (optionally)
copies the internal message \cite{FreeBSD.man-BPF-4.2021} copies the internal message
into the message-passing runtime/middleware's address space; the receiver's into the message-passing runtime/middleware's address space; the receiver's
middleware inspects the copied message and performs some procedures accordingly, middleware inspects the copied message and performs some procedures accordingly,
likely also involving copying slices of message data to some registered distributed likely also involving copying slices of message data to some registered
shared memory buffer for the distributed application to access. Despite it distributed shared memory buffer for the distributed application to access.
being a highly intuitive model of data manipulation over the network, this Despite it being a highly intuitive model of data manipulation over the network,
poses a fundamental performance issue: because the process requires the receiver's this poses a fundamental performance issue: because the process requires the
kernel AND userspace to exert CPU-time, upon reception of each message, the receiver's kernel AND userspace to exert CPU-time, upon reception of each
receiver node needs to proactively exert CPU-time to move the received data message, the receiver node needs to proactively exert CPU-time to move the
from bytes read from NIC devices to userspace. Because this happens concurrently received data from bytes read from NIC devices to userspace. Because this
with other kernel and userspace routines in a multi-processing system, a happens concurrently with other kernel and userspace routines in a
preemptable kernel may incur significant latency if the kernel routine for concurrent system, a preemptable kernel may incur significant latency if the
packet filtering is pre-empted by another kernel routine, userspace, or IRQs. kernel routine for packet filtering is pre-empted by another kernel routine,
userspace, or IRQs.
Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, Comparatively, a ``one-sided'' message-passing scheme, for example RDMA,
allows the network interface card to bypass in-kernel packet filters and allows the network interface card to bypass in-kernel packet filters and
perform DMA on registered memory regions. The NIC can hence notify the CPU via perform DMA on registered memory regions. The NIC can hence notify the CPU via
interrupts, thus allowing the kernel and the userspace programs to perform interrupts, thus allowing the kernel and the userspace programs to perform
callbacks at reception time with reduced latency. Because of this advantage, callbacks at reception time with reduced latency. Because of this advantage,
many recent studies attempt to leverage RDMA APIs \dots many recent studies attempt to leverage RDMA APIs for improved distributed data
workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014}
{Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020}
{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
{Kaxiras_etal.DSM-Argos.2015}.
% \subsection{Data to Process, or Process to Data?}
% Hypothetically, instead of moving data back-and-forth between nodes within a
% shared storage domain, nodes could instead opt to perform remote procedure
% calls to other nodes which have access to their own share of data and
% acknowledge its completion at return. In the latter case, nodes connected within
% a network exchange task information -- data necessary to (re)construct the task
% in question on a remote node -- which can lead to significantly smaller packets
% than transmitting data over network. Provided that the time necessary to
% reconstruct the task on a remote node is less than the time necessary to
% transmit the data over network.
% Indeed, RPC have been shown
% (TBD -- The former is costly for data-intensive computation, but the latter may
% be impossible for certain tasks, and greatly hardens the replacement problem.)
% \section{Replacement Policy}
% In general, three variants of replacement strategies have been proposed for either
% generic cache block replacement problems, or specific use-cases where contextual
% factors can facilitate more efficient cache resource allocation:
% \begin{itemize}
% \item General-Purpose Replacement Algorithms, for example LRU.
% \item Cost-Model Analysis
% \item Probabilistic and Learned Algorithms
% \end{itemize}
% \subsection{General-Purpose Replacement Algorithms}
% Practically speaking, in the general case of the cache replacement problem,
% we desire to predict the re-reference interval of a cache block
% \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
% optimal case for the \emph{ideal} replacement problem occurs when, at eviction
% time, the entry with the highest re-reference interval is replaced. Under this
% framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
% where the re-reference interval for each entry is predicted to be immediate.
% Fortunately, memory access traces of real computer systems agree with this
% tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
% however, and there are other behaviors...) On the other hand, the hypothetical
% LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
% textbook LFU algorithm suffers from needing to maintain a priority-queue for
% frequency analysis, it was nevertheless useful for keeping recurrent (though
% non-recent) blocks from being evicted from the cache \textbf{[source]}.
% Derivatives from the LRU algorithm attempts to balance between frequency and
% recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
% Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
% of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
% book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
% for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
% \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
% CLOCK here \dots]}
% Finally, real-life experiences have shown the need to reduce CPU time in practical
% applications, owing from one simple observation -- during the fetch-execution
% cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
% design, despite its hypothetical optimality, could nevertheless degrade the performance
% of a system during low-memory situations. In fact, this proves to be the driving
% motivation behind Linux's transition away from the old LRU-2Q page replacement
% algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
% been mainlined since v6.1.
% \subsection{Cost-Model Analysis}
% The ideal case for the replacement problem fails to account for invalidation of
% cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
% that is insufficient to capture the heterogeneity of today's massively-parallel,
% distributed systems. High-speed network interfaces are capable of exposing RDMA
% interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
% when compared to swapping over the kernel I/O stack, while software that bypass
% the kernel I/O stack is capable of stretching the bandwidth advantage even more
% (source). This creates an interesting network topology between RDMA-enabled nodes,
% where, in addition to swapping at low-memory situations, the node may opt to ``swap''
% or simply drop the physical page in order to lessen the cost of page misses.
% \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
% Traditionally, replacement policies based on cost-model analysis were utilized in
% content-delivery networks, which had different consistency models compared to
% finer-grained systems. HTTP servers need not pertain to strong consistency models,
% as out-of-date information is considered permissible, and single-writer scenarios
% are common. Consequently, most replacement policies for static content servers,
% while making strong distinction towards network topology, fails to concern for the
% cases where an entry might become invalidated, let along multi-writer protocols.
% One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
% page fault frequency as an indicator of preference towards working set inclusion
% (which I personally think is highly flawed -- to be explained). Another paper
% \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
% page fault into consideration for eviction, but fails to go beyond the obvious
% implication that pages that have been faulted \emph{must} be evicted.
% The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
% too. (Expand)
% \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
% Finally, machine learning techniques and low-cost probabilistic approaches have
% been applied on the ideal cache replacement problem with some level of success.
% \textbf{[Talk about LeCaR, CACHEUS here]}.
% XXX: I will be writing about replacement as postfix...
\section{Consistency Model and Cache Coherence}
Consistency model specifies a contract on allowed behaviors of multi-processing
programs with regards to a shared memory
\cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious
conflict, which consistency models aim to resolve, lies within the interaction
between processor-native programs and multi-processors, all of whom needs to
operate on a shared memory with heterogeneous cache topologies. Here, a
well-defined consistency model aims to resolve the conflict on an architectural
scope. Beyond consistency models for bare-metal systems, programming languages
\cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007}
{Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024}
and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019}
{Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for
parallel access to shared memory on top of program order guarantees to explicate
program behavior under shared memory parallel programming across underlying
implementations.
\subsection{Consistency Model in DSM}
\subsection{Coherence Protocol}
\subsection{DMA and Cache Coherence}
\subsection{Cache Coherence in ARMv8}
\subsection{Data to Process, or Process to Data?}
(TBD -- The former is costly for data-intensive computation, but the latter may
be impossible for certain tasks, and greatly hardens the replacement problem.)
\section{Replacement Policy}
In general, three variants of replacement strategies have been proposed for either
generic cache block replacement problems, or specific use-cases where contextual
factors can facilitate more efficient cache resource allocation:
\begin{itemize}
\item General-Purpose Replacement Algorithms, for example LRU.
\item Cost-Model Analysis
\item Probabilistic and Learned Algorithms
\end{itemize}
\subsection{General-Purpose Replacement Algorithms}
Practically speaking, in the general case of the cache replacement problem,
we desire to predict the re-reference interval of a cache block
\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
optimal case for the \emph{ideal} replacement problem occurs when, at eviction
time, the entry with the highest re-reference interval is replaced. Under this
framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
where the re-reference interval for each entry is predicted to be immediate.
Fortunately, memory access traces of real computer systems agree with this
tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
however, and there are other behaviors...) On the other hand, the hypothetical
LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
textbook LFU algorithm suffers from needing to maintain a priority-queue for
frequency analysis, it was nevertheless useful for keeping recurrent (though
non-recent) blocks from being evicted from the cache \textbf{[source]}.
Derivatives from the LRU algorithm attempts to balance between frequency and
recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
CLOCK here \dots]}
Finally, real-life experiences have shown the need to reduce CPU time in practical
applications, owing from one simple observation -- during the fetch-execution
cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
design, despite its hypothetical optimality, could nevertheless degrade the performance
of a system during low-memory situations. In fact, this proves to be the driving
motivation behind Linux's transition away from the old LRU-2Q page replacement
algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
been mainlined since v6.1.
\subsection{Cost-Model Analysis}
The ideal case for the replacement problem fails to account for invalidation of
cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
that is insufficient to capture the heterogeneity of today's massively-parallel,
distributed systems. High-speed network interfaces are capable of exposing RDMA
interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
when compared to swapping over the kernel I/O stack, while software that bypass
the kernel I/O stack is capable of stretching the bandwidth advantage even more
(source). This creates an interesting network topology between RDMA-enabled nodes,
where, in addition to swapping at low-memory situations, the node may opt to ``swap''
or simply drop the physical page in order to lessen the cost of page misses.
\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
Traditionally, replacement policies based on cost-model analysis were utilized in
content-delivery networks, which had different consistency models compared to
finer-grained systems. HTTP servers need not pertain to strong consistency models,
as out-of-date information is considered permissible, and single-writer scenarios
are common. Consequently, most replacement policies for static content servers,
while making strong distinction towards network topology, fails to concern for the
cases where an entry might become invalidated, let along multi-writer protocols.
One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
page fault frequency as an indicator of preference towards working set inclusion
(which I personally think is highly flawed -- to be explained). Another paper
\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
page fault into consideration for eviction, but fails to go beyond the obvious
implication that pages that have been faulted \emph{must} be evicted.
The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
too. (Expand)
\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
Finally, machine learning techniques and low-cost probabilistic approaches have
been applied on the ideal cache replacement problem with some level of success.
\textbf{[Talk about LeCaR, CACHEUS here]}.
\section{Cache Coherence and Consistency in DSM Systems}
(I need to read more into this. Most of the contribution comes from CPU caches, (I need to read more into this. Most of the contribution comes from CPU caches,
less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence