\documentclass{article} \usepackage[utf8]{inputenc} \usepackage[dvipsnames]{xcolor} \usepackage{biblatex} \addbibresource{background_draft.bib} \begin{document} Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018} Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. On the other hand, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies. Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized. This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following: \begin{itemize} \item { The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs. } \item { Implementation of cache coherency in cache-incoherent kernel-side RDMA clients. } \item { Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system. } \end{itemize} The rest of the chapter is structured as follows: \begin{itemize} \item { We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors. } \item { We identify alternative (shared memory) programming paradigms and compare them with DSM, which sought to provide transparent shared address space among participating nodes. } \item { We give an overview of coherency protocol and consistency models for multi-sharer DSM systems. } \item { We provide a primer to cache coherency in ARM64 systems, which \emph{do not} guarantee cache-coherent DMA, as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}. } \end{itemize} \section{Experiences from Software DSM} A majority of contributions to software DSM systems come from the 1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991} {Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These developments follow from the success of the Stanford DASH project in the late 1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a multiprocessor that first proposed the \textit{directory-based protocol} for cache coherence, which stores the ownership information of cache lines to reduce unnecessary communication that prevented previous multiprocessors from scaling out \cite{Lenoski_etal.Stanford_DASH.1992}. While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003} {Lu_etal.MPI_vs_DSM_over_cluster.1995}. New developments in network interfaces provides much improved bandwidth and latency compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC, scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}. Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014} and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a resurgence of interest in software DSM systems and programming models \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. % Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability % i.e., we do it in kernel as opposed to in userspace. Accelerator node can access % local node's shared page like a DMA device do so via HMM. \subsection{Munin: Multi-Consistency Protocol} \textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older developments in software DSM systems. The authors of Munin identify that \textit{false-sharing}, occurring due to multiple processors writing to different offsets of the same page triggering invalidations, is strongly detrimental to the performance of shared-memory systems. To combat this, Munin exposes annotations as part of its programming model to facilitate multiple consistency protocols on top of release consistency. An immutable shared memory object across readers, for example, can be safely copied without concern for coherence between processors. On the other hand, the \textit{write-shared} annotation explicates that a memory object is written by multiple processors without synchronization -- i.e., the programmer guarantees that only false-sharing occurs within this granularity. Annotations such as these explicitly disables subsets of consistency procedures to reduce communication in the network fabric, thereby improving the performance of the DSM system. Perhaps most importantly, experiences from Munin show that \emph{restricting the flexibility of programming model can lead to more performant coherence models}, as exhibited by the now-foundational \textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data processing frameworks such as \textit{Hadoop MapReduce} \cite{WEB.APACHE..Apache_Hadoop.2023} and \textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory [based on]\dots transformations rather than\dots updates to shared state'' \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to cheaply synchronize states between unshared address spaces -- a much desired property for highly scalable, loosely-coupled clustered systems. \subsection{Treadmarks: Multi-Writer Protocol} \textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system developed in 1996, which featured an intricate \textit{interval}-based multi-writer protocol that allows multiple nodes to write to the same page without false-sharing. The system follows a release-consistent memory model, which requires the use of either locks (via \texttt{acquire}, \texttt{release}) or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval} represents a time period in-between page creation, \texttt{release} to another processor, or a \texttt{barrier}; they also each correspond to a \textit{write notice}, which are used for page invalidation. Each \texttt{acquire} message is sent to the statically-assigned lock-manager node, which forwards the message to the last releaser. The last releaser computes the outstanding write notices and piggy-backs them back for the acquirer to invalidate its own cached page entry, thus signifying entry into the critical section. Consistency information, including write notices, intervals, and page diffs, are routinely garbage-collected which forces cached pages in each node to become validated. Compared to \textit{Treadmarks}, the system described in this paper uses a single-writer protocol, thus eliminating the concept of ``intervals'' -- with regards to synchronization, each page can be either in-sync (in which case they can be safely shared) or out-of-sync (in which case they must be invalidated/updated). This comes with the following advantage: \begin{itemize} \item Less metadata for consistency-keeping. \item More adherent to the CPU-accelerator dichotomy model. \item Much simpler coherence protocol, which reduces communication cost. \end{itemize} In view of the (still) disparate throughput and latency differences between local and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018}, the simpler coherence protocol of single-writer protocol should provide better performance on the critical paths of remote memory access. % The majority of contributions to DSM study come from the 1990s, for example % \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to % leverage kernel system calls to allow for user-level DSM over ethernet NICs. While % these systems provide a strong theoretical basis for today's majority-software % DSM systems and applications that expose a \emph{(partitioned) global address space}, % they were nevertheless constrained by the limitations in NIC transfer rate and % bandwidth, and the concept of DSM failed to take off (relative to cluster computing). \subsection{Hotpot: Single-Writer \& Data Replication} Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply distributed shared memory techniques on persistent memory to provide ``transparent memory accesses, data persistence, data reliability, and high availability''. Leveraging on persistent memory devices allow DSM applications to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017}, ensuring both distributed cache coherence and data reliability at the same time \cite{Shan_Tsai_Zhang.DSPM.2017}. We specifically discuss the single-writer portion of its coherence protocol. The data reliability guarantees proposed by the \textit{Hotpot} system requires each shared page to be replicated to some \textit{degree of replication}. Nodes who always store latest replication of shared pages are referred to as ``owner nodes'', which arbitrate other nodes to store more replications in order to reach the degree of replication quota. At acquisition time, the acquiring node asks the access-management node for single-writer access to shared page, who grants it if no other critical section exists, alongside list of current owner nodes. At release time, the releaser first commits its changes to all owner nodes which, in turn, commits its received changes across lesser sharers to achieve the required degree of replication. These two operations are all acknowledged back in reverse order. Once all acknowledgements are received from owner nodes by commit node, the releaser tells them to delete their commit logs and, finally, tells the manager node to exit critical section. The required degree of replication and logged commit transaction until explicit deletion facilitate crash recovery at the expense of worse performance over release-time I/O. While the study of crash recovery with respect to shared memory systems is out of the scope of this thesis, this paper provides a good framework for a \textbf{correct} coherence protocol for a single-writer, multiple-reader shared memory system, particularly when the protocol needs to cater for a great variety of nodes each with their own memory preferences (e.g., write-update vs. write-invalidate, prefetching, etc.). \subsection{MENPS: A Return to DSM} MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable interconnects as a proof-of-concept that DSM systems and programming models can be as efficient as \textit{partitioned global address space} (PGAS) using today's network interfaces. It builds upon \textit{TreadMark}'s \cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to a \textit{floating home-based} protocol, based on the insight that diff-transfers across the network is comparatively costly compared to RDMA intrinsics -- which implies preference towards local diff-merging. The home node then acts as the data supplier for every shared page within the system. Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of \textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup in some of the computation tasks, while achieving much better productivity due to DSM's support for transparent caching, etc. \cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that DSM systems are at least as viable as traditional PGAS/message-passing frameworks for scientific computing, also corroborated by the resurgence of DSM studies later on\cite{Masouros_etal.Adrias.2023}. \section{PGAS and Message Passing} While the feasibility of transparent DSM systems over multiple machines on the network has been made apparent since the 1980s, predominant approaches to ``scaling-out'' programs over the network relies on the message-passing approach \cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold: \begin{enumerate} \item { Programmers would rather resort to more intricate, more predictable approaches to scaling-out programs over the network \cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies manual/controlled data sharding over nodes, separation of compute and communication ``stages'' of computation, etc., which benefit performance analysis and engineering. } \item { Enterprise applications value throughput and uptime of relatively computationally inexpensive tasks/resources \cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy scalability of tried-and-true, latency-inexpensive applications. Studies in transparent DSM systems mostly require exotic, specifically-written programs to exploit global address space, which is fundamentally at odds in terms of reusability and flexibility required. } \end{enumerate} \subsection{PGAS} \textit{Partitioned Global Address Space} (PGAS) is a parallel programming model that (1) exposes a global address space to all machines within a network and (2) explicates distinction between local and remote memory \cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks, for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used as backends to provide the PGAS model over various network interfaces/platforms (e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022} {WEB.HPE.Chapel_Platforms-v1.33.2023}. Notably, implementation of a \emph{global} address space across machines on top of machines already equipped with their own \emph{local} address space (e.g., cluster nodes running commercial Linux) necessitates a global addressing mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014}, for example, utilizes a 128-bit ``global pointer'' to encode global memory object/segment ID and access flags in the upper 64 bits and virtual addresses in the lower 64 bits for each (slice of) memory object allocated within the PGAS model. A \textit{non-collective} PGAS object is allocated entirely local to the allocating node's memory, but registered globally. Consequently, a single global pointer is recorded in the runtime with corresponding permission flags for the context of some user-defined group of associated nodes. Comparatively, a \textit{collective} PGAS object is allocated such that a partition of the object (i.e., a sub-array of the repr) is stored in each of the associated node -- for a $k$-partitioned object, $k$ global pointers are recorded in the runtime each pointing to the same object, with different offsets and (intuitively) independently-chosen virtual addresses. Note that this design naturally requires virtual addresses within each node to be \emph{pinned} -- the allocated object cannot be re-addressed to a different virtual address, thus preventing the global pointer that records the local virtual address from becoming spontaneously invalidated. Similar schemes can be observed in other PGAS backends/runtimes, albeit they may opt to use a map-like data structure for addressing instead. In general, despite both PGAS and DSM systems provide memory management over remote nodes, PGAS frameworks provide no transparent caching and transfer of remote memory objects accessed by local nodes. The programmer is still expected to handle data/thread movement manually when working with shared memory over network to maximize their performance metrics of interest. \subsection{Message Passing} \textit{Message Passing} remains the predominant programming model for parallelism between loosely-coupled nodes within a computer system, much as it is ubiquitous in supporting all levels of abstraction within any concurrent components of a computer system. Specific to cluster computing systems is the message-passing programming model, where parallel programs (or instances of the same parallel program) on different nodes within the system communicate via exchanging messages over network between these nodes. Such models exchange programming model productivity for more fine-grained control over the messages passed, as well as more explicit separation between communication and computation stages within a programming subproblem. Commonly, message-passing backends function as \textit{middlewares} -- communication runtimes -- to aid distributed software development \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend expose facilities for inter-application communication to frontend developers while transparently providing security, accounting, and fault-tolerance, much like how an operating system may provide resource management, scheduling, and security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. This is the case for implementing the PGAS programming model, which mostly rely on common message-passing backends to facilitate orchestrated data manipulation across distributed nodes. Likewise, message-passing backends, including RDMA API, form the backbone of many research-oriented DSM systems \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}. Message-passing between network-connected nodes may be \textit{two-sided} or \textit{one-sided}. The former models an intuitive workflow to sending and receiving datagrams over the network -- the sender initiates a transfer; the receiver copies a received packet from the network card into a kernel buffer; the receiver's kernel filters the packet and (optionally) \cite{FreeBSD.man-BPF-4.2021} copies the internal message into the message-passing runtime/middleware's address space; the receiver's middleware inspects the copied message and performs some procedures accordingly, likely also involving copying slices of message data to some registered distributed shared memory buffer for the distributed application to access. Despite it being a highly intuitive model of data manipulation over the network, this poses a fundamental performance issue: because the process requires the receiver's kernel AND userspace to exert CPU-time, upon reception of each message, the receiver node needs to proactively exert CPU-time to move the received data from bytes read from NIC devices to userspace. Because this happens concurrently with other kernel and userspace routines in a concurrent system, a preemptable kernel may incur significant latency if the kernel routine for packet filtering is pre-empted by another kernel routine, userspace, or IRQs. Comparatively, a ``one-sided'' message-passing scheme, for example RDMA, allows the network interface card to bypass in-kernel packet filters and perform DMA on registered memory regions. The NIC can hence notify the CPU via interrupts, thus allowing the kernel and the userspace programs to perform callbacks at reception time with reduced latency. Because of this advantage, many recent studies attempt to leverage RDMA APIs for improved distributed data workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014} {Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020} {Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kaxiras_etal.DSM-Argos.2015}. % \subsection{Data to Process, or Process to Data?} % Hypothetically, instead of moving data back-and-forth between nodes within a % shared storage domain, nodes could instead opt to perform remote procedure % calls to other nodes which have access to their own share of data and % acknowledge its completion at return. In the latter case, nodes connected within % a network exchange task information -- data necessary to (re)construct the task % in question on a remote node -- which can lead to significantly smaller packets % than transmitting data over network. Provided that the time necessary to % reconstruct the task on a remote node is less than the time necessary to % transmit the data over network. % Indeed, RPC have been shown % (TBD -- The former is costly for data-intensive computation, but the latter may % be impossible for certain tasks, and greatly hardens the replacement problem.) % \section{Replacement Policy} % In general, three variants of replacement strategies have been proposed for either % generic cache block replacement problems, or specific use-cases where contextual % factors can facilitate more efficient cache resource allocation: % \begin{itemize} % \item General-Purpose Replacement Algorithms, for example LRU. % \item Cost-Model Analysis % \item Probabilistic and Learned Algorithms % \end{itemize} % \subsection{General-Purpose Replacement Algorithms} % Practically speaking, in the general case of the cache replacement problem, % we desire to predict the re-reference interval of a cache block % \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the % optimal case for the \emph{ideal} replacement problem occurs when, at eviction % time, the entry with the highest re-reference interval is replaced. Under this % framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic % where the re-reference interval for each entry is predicted to be immediate. % Fortunately, memory access traces of real computer systems agree with this % tendency due to spatial locality \textbf{[source]}. (Real systems are complex, % however, and there are other behaviors...) On the other hand, the hypothetical % LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the % textbook LFU algorithm suffers from needing to maintain a priority-queue for % frequency analysis, it was nevertheless useful for keeping recurrent (though % non-recent) blocks from being evicted from the cache \textbf{[source]}. % Derivatives from the LRU algorithm attempts to balance between frequency and % recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]} % Advancements in parallel/concurrent systems had led to a rediscovery of the benefits % of using FIFO-derived replacement policies over their LRU/LFU counterparts, as % book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult % for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}. % \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling % CLOCK here \dots]} % Finally, real-life experiences have shown the need to reduce CPU time in practical % applications, owing from one simple observation -- during the fetch-execution % cycle, all processors perform blocking I/O on the memory. A cache-unfriendly % design, despite its hypothetical optimality, could nevertheless degrade the performance % of a system during low-memory situations. In fact, this proves to be the driving % motivation behind Linux's transition away from the old LRU-2Q page replacement % algorithm into the more coarse-grained Multi-generation LRU algorithm, which has % been mainlined since v6.1. % \subsection{Cost-Model Analysis} % The ideal case for the replacement problem fails to account for invalidation of % cache entries. It also assumes for a uniform, dual-hierarchical cache-store model % that is insufficient to capture the heterogeneity of today's massively-parallel, % distributed systems. High-speed network interfaces are capable of exposing RDMA % interfaces between computer nodes, which amount to almost twice as fast RDMA transfer % when compared to swapping over the kernel I/O stack, while software that bypass % the kernel I/O stack is capable of stretching the bandwidth advantage even more % (source). This creates an interesting network topology between RDMA-enabled nodes, % where, in addition to swapping at low-memory situations, the node may opt to ``swap'' % or simply drop the physical page in order to lessen the cost of page misses. % \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]} % Traditionally, replacement policies based on cost-model analysis were utilized in % content-delivery networks, which had different consistency models compared to % finer-grained systems. HTTP servers need not pertain to strong consistency models, % as out-of-date information is considered permissible, and single-writer scenarios % are common. Consequently, most replacement policies for static content servers, % while making strong distinction towards network topology, fails to concern for the % cases where an entry might become invalidated, let along multi-writer protocols. % One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using % page fault frequency as an indicator of preference towards working set inclusion % (which I personally think is highly flawed -- to be explained). Another paper % \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking % page fault into consideration for eviction, but fails to go beyond the obvious % implication that pages that have been faulted \emph{must} be evicted. % The concept of cost models for RDMA and NUMA systems are relatively underdeveloped, % too. (Expand) % \subsection{Probabilistic and Learned Algorithms for Cache Replacement} % Finally, machine learning techniques and low-cost probabilistic approaches have % been applied on the ideal cache replacement problem with some level of success. % \textbf{[Talk about LeCaR, CACHEUS here]}. % XXX: I will be writing about replacement as postfix... \section{Consistency Model and Cache Coherence} Consistency model specifies a contract on allowed behaviors of multi-processing programs with regards to a shared memory \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious conflict, which consistency models aim to resolve, lies within the interaction between processor-native programs and multi-processors, all of whom needs to operate on a shared memory with heterogeneous cache topologies. Here, a well-defined consistency model aims to resolve the conflict on an architectural scope. Beyond consistency models for bare-metal systems, programming languages \cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007} {Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024} and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for parallel access to shared memory on top of program order guarantees to explicate program behavior under shared memory parallel programming across underlying implementations. Related to the definition of a consistency model is the coherence problem, which arises whenever multiple actors have access to multiple copies of some datum, which needs to be synchronized across multiple actors with regards to write-accesses \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. While less relevant to programming language design, coherence must be maintained via a coherence protocol \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020} in systems of both microarchitectural and network scales. For DSM systems, the design of a correct and performant coherence protocol is of especially high priority and is a major part of many studies in DSM systems throughout history \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996} {Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020} {Couceiro_etal.D2STM.2009}. \subsection{Common Consistency Models} % ... should I even write this section? imo it's too basic for anyone to read % and really just serves as a means to increase word count \subsection{Consistency Model in DSM} While distributed shared memory systems with node-local caching naturally implies the existence of a corresponding memory model, only a subset of DSM studies (cites\dots) characterize their own system to one of the few well-known memory models Notably, \dots % about Munin to Spark to access-pattern vs. consistency \subsection{Coherence Protocol} \subsection{DMA and Cache Coherence} \subsection{Cache Coherence in ARMv8} (I need to read more into this. Most of the contribution comes from CPU caches, less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence protocol.]} Consistency and communication protocols naturally affect the cost for each faulted memory access \dots \textbf{[Talk about directory, transactional, scope, and library cache coherence, which allow for multi-casted communications at page fault but all with different levels of book-keeping.]} \printbibliography \end{document}