I'm fighting for my life over here!

2023-11-08 03:31:13 +00:00 · 2023-11-08 03:31:13 +00:00 · 9ce717a313
commit 9ce717a313
parent 44805929f8
4 changed files with 236 additions and 0 deletions
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -0,0 +1,142 @@
+\documentclass{article}
+\usepackage{biblatex}
+
+\addbibresource{background_draft.bib}
+
+\begin{document}
+% \chapter{Backgrounds}
+Recent studies has shown a reinvigorated interest in disaggregted/distributed
+shared memory systems since the 1990s. While large-scale cluster systems
+predominantly make up the mainstream
+The interplay between (page) replacement policy and runtime performance of
+distributed shared memory systems has not been properly explored.
+
+\section{Overview of Distributed Shared Memory}
+
+A striking feature in the study of distributed shared memory (DSM) systems is the
+non-uniformity of the terminologies used to describe overlapping study interests.
+The majority of contributions to DSM study come from the 1990s, for example
+\textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
+leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
+these systems provide a strong theoretical basis for today's majority-software
+DSM systems and applications that expose a \emph{(partitioned) global address space},
+they were nevertheless constrained by the limitations in NIC transfer rate and
+bandwidth, and the concept of DSM failed to take off (relative to cluster computing).
+
+Improvement in NIC bandwidth and transfer rate allows for applications that expose
+global address space, as well as RDMA technologies that leverage single-writer
+protocols over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
+technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
+specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
+
+
+Contemporary works on DSM systems focus more on leveraging hardware advancements
+to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
+for example, implements a complex system for memory disaggregation over multiple
+compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
+they observed significant performance improvements over existing data-intensive
+processing frameworks, for example APACHE Spark, Memcached, and Redis, over
+no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
+systems.
+
+\subsection{Move Data to Process, or Move Process to Data?}
+(TBD -- The former is costly for data-intensive computation, but the latter may
+be impossible for certain tasks, and greatly hardens the replacement problem.)
+
+\section{Replacement Policy}
+
+In general, three variants of replacement strategies have been proposed for either
+generic cache block replacement problems, or specific use-cases where contextual
+factors can facilitate more efficient cache resource allocation:
+\begin{itemize}
+    \item General-Purpose Replacement Algorithms, for example LRU.
+    \item Cost-Model Analysis
+    \item Probabilistic and Learned Algorithms
+\end{itemize}
+
+\subsection{General-Purpose Replacement Algorithms}
+Practically speaking, in the general case of the cache replacement problem,
+we desire to predict the re-reference interval of a cache block
+\cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
+optimal case for the \emph{ideal} replacement problem occurs when, at eviction
+time, the entry with the highest re-reference interval is replaced. Under this
+framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
+where the re-reference interval for each entry is predicted to be immediate.
+Fortunately, memory access traces of real computer systems agree with this
+tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
+however, and there are other behaviors...) On the other hand, the hypothetical
+LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
+textbook LFU algorithm suffers from needing to maintain a priority-queue for
+frequency analysis, it was nevertheless useful for keeping recurrent (though
+non-recent) blocks from being evicted from the cache \textbf{[source]}.
+
+Derivatives from the LRU algorithm attempts to balance between frequency and
+recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
+
+Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
+of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
+book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
+for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
+\textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
+CLOCK here \dots]}
+
+Finally, real-life experiences have shown the need to reduce CPU time in practical
+applications, owing from one simple observation -- during the fetch-execution
+cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
+design, despite its hypothetical optimality, could nevertheless degrade the performance
+of a system during low-memory situations. In fact, this proves to be the driving
+motivation behind Linux's transition away from the old LRU-2Q page replacement
+algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
+been mainlined since v6.1.
+
+\subsection{Cost-Model Analysis}
+The ideal case for the replacement problem fails to account for invalidation of
+cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
+that is insufficient to capture the heterogeneity of today's massively-parallel,
+distributed systems. High-speed network interfaces are capable of exposing RDMA
+interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
+when compared to swapping over the kernel I/O stack, while software that bypass
+the kernel I/O stack is capable of stretching the bandwidth advantage even more
+(source). This creates an interesting network topology between RDMA-enabled nodes,
+where, in addition to swapping at low-memory situations, the node may opt to ``swap''
+or simply drop the physical page in order to lessen the cost of page misses.
+
+\textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
+
+Traditionally, replacement policies based on cost-model analysis were utilized in
+content-delivery networks, which had different consistency models compared to
+finer-grained systems. HTTP servers need not pertain to strong consistency models,
+as out-of-date information is considered permissible, and single-writer scenarios
+are common. Consequently, most replacement policies for static content servers,
+while making strong distinction towards network topology, fails to concern for the
+cases where an entry might become invalidated, let along multi-writer protocols.
+One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
+page fault frequency as an indicator of preference towards working set inclusion
+(which I personally think is highly flawed -- to be explained). Another paper
+\cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
+page fault into consideration for eviction, but fails to go beyond the obvious
+implication that pages that have been faulted \emph{must} be evicted.
+
+The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
+too. (Expand)
+
+\subsection{Probabilistic and Learned Algorithms for Cache Replacement}
+Finally, machine learning techniques and low-cost probabilistic approaches have
+been applied on the ideal cache replacement problem with some level of success.
+\textbf{[Talk about LeCaR, CACHEUS here]}.
+
+\section{Cache Coherence and Consistency in DSM Systems}
+
+(I need to read more into this. Most of the contribution comes from CPU caches,
+less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
+protocol.]}
+
+Consistency and communication protocols naturally affect the cost for each faulted
+memory access \dots
+
+\textbf{[Talk about directory, transactional, scope, and library cache coherence,
+which allow for multi-casted communications at page fault but all with different
+levels of book-keeping.]}
+
+\printbibliography
+\end{document}