...

2024-02-28 21:51:23 +00:00 · 2024-02-28 21:51:23 +00:00 · 8a7d5e7e5a
commit 8a7d5e7e5a
parent b3c3ec961b
3 changed files with 194 additions and 111 deletions
--- a/tex/misc/background_draft.bib
+++ b/tex/misc/background_draft.bib
@ -390,3 +390,45 @@
  author={The FreeBSD Project},
  year={2021}
 }
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
  title={A primer on memory consistency and cache coherence},
  author={Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
  year={2020},
  publisher={Springer Nature}
 }
@misc{ISO/IEC_9899:2011.C11,
  abstract = {Edition Status: Withdrawn on 2018-07-13},
  isbn = {9780580801655},
  keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
  language = {eng},
  publisher = {British Standards Institute},
  title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
  year = {2013},
 }
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
  title={C++ Atomic Types and Operations},
  url={https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
  journal={C++ atomic types and operations},
  publisher={ISO/IEC JTC 1},
  author={Boehm, Hans J and Crowl, Lawrence},
  year={2007}
 }
@misc{Rust.core::sync::atomic::Ordering.2024,
  title={Ordering in core::sync::atomic - Rust},
  url={https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
  journal={The Rust Core Library},
  publisher={the Rust Team},
  year={2024}
 }
@misc{Manson_Goetz.JSR_133.Java_5.2004,
  url={https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
  journal={JSR 133 (Java Memory Model) FAQ},
  publisher={Department of Computer Science, University of Maryland},
  author={Manson, Jeremy and Goetz, Brian},
  year={2004}
 }
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -342,119 +342,160 @@ form the backbone of many research-oriented DSM systems
 {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
 Message-passing between network-connected nodes may be \textit{two-sided} or
-\textit{one-sided}. The former models an intuitive workflow to sending and receiving
+\textit{one-sided}. The former models an intuitive workflow to sending and
-datagrams over the network -- the sender initiates a transfer; the receiver
+receiving datagrams over the network -- the sender initiates a transfer; the
-copies a received packet from the network card into a kernel buffer; the
+receiver copies a received packet from the network card into a kernel buffer;
-receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021}
+the receiver's kernel filters the packet and (optionally)
-copies the internal message
+\cite{FreeBSD.man-BPF-4.2021} copies the internal message
 into the message-passing runtime/middleware's address space; the receiver's
 middleware inspects the copied message and performs some procedures accordingly,
-likely also involving copying slices of message data to some registered distributed
+likely also involving copying slices of message data to some registered
-shared memory buffer for the distributed application to access. Despite it
+distributed shared memory buffer for the distributed application to access.
-being a highly intuitive model of data manipulation over the network, this
+Despite it being a highly intuitive model of data manipulation over the network,
-poses a fundamental performance issue: because the process requires the receiver's
+this poses a fundamental performance issue: because the process requires the
-kernel AND userspace to exert CPU-time, upon reception of each message, the
+receiver's kernel AND userspace to exert CPU-time, upon reception of each
-receiver node needs to proactively exert CPU-time to move the received data
+message, the receiver node needs to proactively exert CPU-time to move the
-from bytes read from NIC devices to userspace. Because this happens concurrently
+received data from bytes read from NIC devices to userspace. Because this
-with other kernel and userspace routines in a multi-processing system, a
+happens concurrently with other kernel and userspace routines in a
-preemptable kernel may incur significant latency if the kernel routine for
+concurrent system, a preemptable kernel may incur significant latency if the
-packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
+kernel routine for packet filtering is pre-empted by another kernel routine,
 userspace, or IRQs.
-Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA},
+Comparatively, a ``one-sided'' message-passing scheme, for example RDMA,
 allows the network interface card to bypass in-kernel packet filters and
 perform DMA on registered memory regions. The NIC can hence notify the CPU via
 interrupts, thus allowing the kernel and the userspace programs to perform
 callbacks at reception time with reduced latency. Because of this advantage,
-many recent studies attempt to leverage RDMA APIs \dots
+many recent studies attempt to leverage RDMA APIs for improved distributed data
 workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014}
 {Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020}
 {Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
 {Kaxiras_etal.DSM-Argos.2015}.
 % \subsection{Data to Process, or Process to Data?}
 % Hypothetically, instead of moving data back-and-forth between nodes within a
 % shared storage domain, nodes could instead opt to perform remote procedure
 % calls to other nodes which have access to their own share of data and
 % acknowledge its completion at return. In the latter case, nodes connected within
 % a network exchange task information -- data necessary to (re)construct the task
 % in question on a remote node -- which can lead to significantly smaller packets
 % than transmitting data over network. Provided that the time necessary to
 % reconstruct the task on a remote node is less than the time necessary to
 % transmit the data over network.
 % Indeed, RPC have been shown
 % (TBD -- The former is costly for data-intensive computation, but the latter may
 % be impossible for certain tasks, and greatly hardens the replacement problem.)
 % \section{Replacement Policy}
 % In general, three variants of replacement strategies have been proposed for either
 % generic cache block replacement problems, or specific use-cases where contextual
 % factors can facilitate more efficient cache resource allocation:
 % \begin{itemize}
 %     \item General-Purpose Replacement Algorithms, for example LRU.
 %     \item Cost-Model Analysis
 %     \item Probabilistic and Learned Algorithms
 % \end{itemize}
 % \subsection{General-Purpose Replacement Algorithms}
 % Practically speaking, in the general case of the cache replacement problem,
 % we desire to predict the re-reference interval of a cache block
 % \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
 % optimal case for the \emph{ideal} replacement problem occurs when, at eviction
 % time, the entry with the highest re-reference interval is replaced. Under this
 % framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
 % where the re-reference interval for each entry is predicted to be immediate.
 % Fortunately, memory access traces of real computer systems agree with this
 % tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
 % however, and there are other behaviors...) On the other hand, the hypothetical
 % LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
 % textbook LFU algorithm suffers from needing to maintain a priority-queue for
 % frequency analysis, it was nevertheless useful for keeping recurrent (though
 % non-recent) blocks from being evicted from the cache \textbf{[source]}.
 % Derivatives from the LRU algorithm attempts to balance between frequency and
 % recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
 % Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
 % of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
 % book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
 % for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
 % \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
 % CLOCK here \dots]}
 % Finally, real-life experiences have shown the need to reduce CPU time in practical
 % applications, owing from one simple observation -- during the fetch-execution
 % cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
 % design, despite its hypothetical optimality, could nevertheless degrade the performance
 % of a system during low-memory situations. In fact, this proves to be the driving
 % motivation behind Linux's transition away from the old LRU-2Q page replacement
 % algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
 % been mainlined since v6.1.
 % \subsection{Cost-Model Analysis}
 % The ideal case for the replacement problem fails to account for invalidation of
 % cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
 % that is insufficient to capture the heterogeneity of today's massively-parallel,
 % distributed systems. High-speed network interfaces are capable of exposing RDMA
 % interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
 % when compared to swapping over the kernel I/O stack, while software that bypass
 % the kernel I/O stack is capable of stretching the bandwidth advantage even more
 % (source). This creates an interesting network topology between RDMA-enabled nodes,
 % where, in addition to swapping at low-memory situations, the node may opt to ``swap''
 % or simply drop the physical page in order to lessen the cost of page misses.
 % \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
 % Traditionally, replacement policies based on cost-model analysis were utilized in
 % content-delivery networks, which had different consistency models compared to
 % finer-grained systems. HTTP servers need not pertain to strong consistency models,
 % as out-of-date information is considered permissible, and single-writer scenarios
 % are common. Consequently, most replacement policies for static content servers,
 % while making strong distinction towards network topology, fails to concern for the
 % cases where an entry might become invalidated, let along multi-writer protocols.
 % One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
 % page fault frequency as an indicator of preference towards working set inclusion
 % (which I personally think is highly flawed -- to be explained). Another paper
 % \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
 % page fault into consideration for eviction, but fails to go beyond the obvious
 % implication that pages that have been faulted \emph{must} be evicted.
 % The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
 % too. (Expand)
 % \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
 % Finally, machine learning techniques and low-cost probabilistic approaches have
 % been applied on the ideal cache replacement problem with some level of success.
 % \textbf{[Talk about LeCaR, CACHEUS here]}.
 % XXX: I will be writing about replacement as postfix...
 \section{Consistency Model and Cache Coherence}
 Consistency model specifies a contract on allowed behaviors of multi-processing
 programs with regards to a shared memory
 \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious
 conflict, which consistency models aim to resolve, lies within the interaction
 between processor-native programs and multi-processors, all of whom needs to
 operate on a shared memory with heterogeneous cache topologies. Here, a
 well-defined consistency model aims to resolve the conflict on an architectural
 scope. Beyond consistency models for bare-metal systems, programming languages
 \cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007}
 {Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024}
 and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019}
 {Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for
 parallel access to shared memory on top of program order guarantees to explicate
 program behavior under shared memory parallel programming across underlying
 implementations.
 \subsection{Consistency Model in DSM}
 \subsection{Coherence Protocol}
 \subsection{DMA and Cache Coherence}
 \subsection{Cache Coherence in ARMv8}
 \subsection{Data to Process, or Process to Data?}
 (TBD -- The former is costly for data-intensive computation, but the latter may
 be impossible for certain tasks, and greatly hardens the replacement problem.)
 \section{Replacement Policy}
 In general, three variants of replacement strategies have been proposed for either
 generic cache block replacement problems, or specific use-cases where contextual
 factors can facilitate more efficient cache resource allocation:
 \begin{itemize}
    \item General-Purpose Replacement Algorithms, for example LRU.
    \item Cost-Model Analysis
    \item Probabilistic and Learned Algorithms
 \end{itemize}
 \subsection{General-Purpose Replacement Algorithms}
 Practically speaking, in the general case of the cache replacement problem,
 we desire to predict the re-reference interval of a cache block
 \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
 optimal case for the \emph{ideal} replacement problem occurs when, at eviction
 time, the entry with the highest re-reference interval is replaced. Under this
 framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
 where the re-reference interval for each entry is predicted to be immediate.
 Fortunately, memory access traces of real computer systems agree with this
 tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
 however, and there are other behaviors...) On the other hand, the hypothetical
 LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
 textbook LFU algorithm suffers from needing to maintain a priority-queue for
 frequency analysis, it was nevertheless useful for keeping recurrent (though
 non-recent) blocks from being evicted from the cache \textbf{[source]}.
 Derivatives from the LRU algorithm attempts to balance between frequency and
 recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
 Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
 of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
 book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
 for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
 \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
 CLOCK here \dots]}
 Finally, real-life experiences have shown the need to reduce CPU time in practical
 applications, owing from one simple observation -- during the fetch-execution
 cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
 design, despite its hypothetical optimality, could nevertheless degrade the performance
 of a system during low-memory situations. In fact, this proves to be the driving
 motivation behind Linux's transition away from the old LRU-2Q page replacement
 algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
 been mainlined since v6.1.
 \subsection{Cost-Model Analysis}
 The ideal case for the replacement problem fails to account for invalidation of
 cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
 that is insufficient to capture the heterogeneity of today's massively-parallel,
 distributed systems. High-speed network interfaces are capable of exposing RDMA
 interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
 when compared to swapping over the kernel I/O stack, while software that bypass
 the kernel I/O stack is capable of stretching the bandwidth advantage even more
 (source). This creates an interesting network topology between RDMA-enabled nodes,
 where, in addition to swapping at low-memory situations, the node may opt to ``swap''
 or simply drop the physical page in order to lessen the cost of page misses.
 \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
 Traditionally, replacement policies based on cost-model analysis were utilized in
 content-delivery networks, which had different consistency models compared to
 finer-grained systems. HTTP servers need not pertain to strong consistency models,
 as out-of-date information is considered permissible, and single-writer scenarios
 are common. Consequently, most replacement policies for static content servers,
 while making strong distinction towards network topology, fails to concern for the
 cases where an entry might become invalidated, let along multi-writer protocols.
 One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
 page fault frequency as an indicator of preference towards working set inclusion
 (which I personally think is highly flawed -- to be explained). Another paper
 \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
 page fault into consideration for eviction, but fails to go beyond the obvious
 implication that pages that have been faulted \emph{must} be evicted.
 The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
 too. (Expand)
 \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
 Finally, machine learning techniques and low-cost probabilistic approaches have
 been applied on the ideal cache replacement problem with some level of success.
 \textbf{[Talk about LeCaR, CACHEUS here]}.
 \section{Cache Coherence and Consistency in DSM Systems}
 (I need to read more into this. Most of the contribution comes from CPU caches,
 less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence