\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[dvipsnames]{xcolor} \usepackage{biblatex} \addbibresource{background_draft.bib} \begin{document} Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with more tightly-coupled heterogeneous nodes providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015} \textcolor{red}{[ADD MORE CITATIONS]} Within the scope of one node, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view into the entire memory landscape across attached devices \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and locality managed by the OS kernel. Nevertheless, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies. Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware acceleration resources of one workload cannot be easily determined and/or isolated. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster wishes for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform in-memory computing without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces ($25 \times 8$Gbps) result negligible impact on tail latencies but high impact on throughput when bandwidth is maximized. This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and data migration near computation, focusing on the effect of replacement policies on balancing the cost between near-data and far-data computation between home node and accelerator node. \textcolor{red}{ Specifically, this paper explores the possibility of implementing shared page movement between home and accelerator nodes to enable efficient memory over-commit without the I/O-intensive swapping overhead.} \textcolor{red}{The rest of the chapter is structured as follows\dots} \section{Experiences from Software DSM} The majority of contributions to the study of software DSM systems come from the 1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991} {Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These developments follow from the success of the Stanford DASH project in the late 1980s -- a hardware distributed shared memory (i.e., NUMA) implementation of a multiprocessor that first proposed the \textit{directory-based protocol} for cache coherence, which stores the ownership information of cache lines to reduce unnecessary communication that prevented SMP processors from scaling out \cite{Lenoski_etal.Stanford_DASH.1992}. While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003} {Lu_etal.MPI_vs_DSM_over_cluster.1995}. New developments in network interfaces provides much improved bandwidth and latency compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve the training efficiency sixfold compared to distributed TensorFlow via RPC, scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}. Similar results have been observed for Spark\cite{Lu_etal.Spark_over_RDMA.2014} \textcolor{red}{and what?}. Consequently, there have been a resurgence of interest in software DSM systems and their corresponding programming models \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. % Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability % i.e., we do it in kernel as opposed to in userspace. Accelerator node can access % local node's shared page like a DMA device do so via HMM. \subsection{Munin: Multiple Consistency Protocols} \textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older developments in software DSM systems. The authors of Munin identify that \textit{false-sharing}, occurring due to multiple processors writing to different offsets of the same page triggering invalidations, is strongly detrimental to the performance of shared-memory systems. To combat this, Munin exposes annotations as part of its programming model to facilitate multiple consistency protocols on top of release consistency. An immutable shared memory object across readers, for example, can be safely copied without concern for coherence between processors. On the other hand, the \textit{write-shared} annotation explicates that a memory object is written by multiple processors without synchronization -- i.e., the programmer guarantees that only false-sharing occurs within this granularity. Annotations such as these explicitly disables subsets of consistency procedures to reduce communication in the network fabric, thereby improving the performance of the DSM system. Perhaps most importantly, experiences from Munin show that \emph{restricting the flexibility of programming model can lead to more performant coherence models}, as \textcolor{teal}{corroborated} by the now-foundational \textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} -- which powered many now-popular scalable data processing frameworks such as \textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and \textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory [based on]\dots transformations rather than\dots updates to shared state'' \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to cheaply synchronize states between unshared address spaces -- a much desired property for highly scalable, loosely-coupled clustered systems. \subsection{Treadmarks: Multi-Writer Protocol} \textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM developed in 1996 % The majority of contributions to DSM study come from the 1990s, for example % \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to % leverage kernel system calls to allow for user-level DSM over ethernet NICs. While % these systems provide a strong theoretical basis for today's majority-software % DSM systems and applications that expose a \emph{(partitioned) global address space}, % they were nevertheless constrained by the limitations in NIC transfer rate and % bandwidth, and the concept of DSM failed to take off (relative to cluster computing). \section{HPC and Partitioned Global Address Space} Improvement in NIC bandwidth and transfer rate allows for applications that expose global address space, as well as RDMA technologies that leverage single-writer protocols over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS) technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}. Contemporary works on DSM systems focus more on leveraging hardware advancements to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023}, for example, implements a complex system for memory disaggregation over multiple compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where they observed significant performance improvements over existing data-intensive processing frameworks, for example APACHE Spark, Memcached, and Redis, over no-disaggregation (i.e., using node-local memory only, similar to cluster computing) systems. \subsection{Programming Model} \subsection{Move Data to Process, or Move Process to Data?} (TBD -- The former is costly for data-intensive computation, but the latter may be impossible for certain tasks, and greatly hardens the replacement problem.) \section{Replacement Policy} In general, three variants of replacement strategies have been proposed for either generic cache block replacement problems, or specific use-cases where contextual factors can facilitate more efficient cache resource allocation: \begin{itemize} \item General-Purpose Replacement Algorithms, for example LRU. \item Cost-Model Analysis \item Probabilistic and Learned Algorithms \end{itemize} \subsection{General-Purpose Replacement Algorithms} Practically speaking, in the general case of the cache replacement problem, we desire to predict the re-reference interval of a cache block \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the optimal case for the \emph{ideal} replacement problem occurs when, at eviction time, the entry with the highest re-reference interval is replaced. Under this framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic where the re-reference interval for each entry is predicted to be immediate. Fortunately, memory access traces of real computer systems agree with this tendency due to spatial locality \textbf{[source]}. (Real systems are complex, however, and there are other behaviors...) On the other hand, the hypothetical LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the textbook LFU algorithm suffers from needing to maintain a priority-queue for frequency analysis, it was nevertheless useful for keeping recurrent (though non-recent) blocks from being evicted from the cache \textbf{[source]}. Derivatives from the LRU algorithm attempts to balance between frequency and recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]} Advancements in parallel/concurrent systems had led to a rediscovery of the benefits of using FIFO-derived replacement policies over their LRU/LFU counterparts, as book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}. \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling CLOCK here \dots]} Finally, real-life experiences have shown the need to reduce CPU time in practical applications, owing from one simple observation -- during the fetch-execution cycle, all processors perform blocking I/O on the memory. A cache-unfriendly design, despite its hypothetical optimality, could nevertheless degrade the performance of a system during low-memory situations. In fact, this proves to be the driving motivation behind Linux's transition away from the old LRU-2Q page replacement algorithm into the more coarse-grained Multi-generation LRU algorithm, which has been mainlined since v6.1. \subsection{Cost-Model Analysis} The ideal case for the replacement problem fails to account for invalidation of cache entries. It also assumes for a uniform, dual-hierarchical cache-store model that is insufficient to capture the heterogeneity of today's massively-parallel, distributed systems. High-speed network interfaces are capable of exposing RDMA interfaces between computer nodes, which amount to almost twice as fast RDMA transfer when compared to swapping over the kernel I/O stack, while software that bypass the kernel I/O stack is capable of stretching the bandwidth advantage even more (source). This creates an interesting network topology between RDMA-enabled nodes, where, in addition to swapping at low-memory situations, the node may opt to ``swap'' or simply drop the physical page in order to lessen the cost of page misses. \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]} Traditionally, replacement policies based on cost-model analysis were utilized in content-delivery networks, which had different consistency models compared to finer-grained systems. HTTP servers need not pertain to strong consistency models, as out-of-date information is considered permissible, and single-writer scenarios are common. Consequently, most replacement policies for static content servers, while making strong distinction towards network topology, fails to concern for the cases where an entry might become invalidated, let along multi-writer protocols. One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using page fault frequency as an indicator of preference towards working set inclusion (which I personally think is highly flawed -- to be explained). Another paper \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking page fault into consideration for eviction, but fails to go beyond the obvious implication that pages that have been faulted \emph{must} be evicted. The concept of cost models for RDMA and NUMA systems are relatively underdeveloped, too. (Expand) \subsection{Probabilistic and Learned Algorithms for Cache Replacement} Finally, machine learning techniques and low-cost probabilistic approaches have been applied on the ideal cache replacement problem with some level of success. \textbf{[Talk about LeCaR, CACHEUS here]}. \section{Cache Coherence and Consistency in DSM Systems} (I need to read more into this. Most of the contribution comes from CPU caches, less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence protocol.]} Consistency and communication protocols naturally affect the cost for each faulted memory access \dots \textbf{[Talk about directory, transactional, scope, and library cache coherence, which allow for multi-casted communications at page fault but all with different levels of book-keeping.]} \printbibliography \end{document}