Added ftrace visualization code

2024-02-17 22:39:46 +00:00 · 2024-02-17 22:39:46 +00:00 · 518f7d1bf5
commit 518f7d1bf5
parent bab62e1974
14 changed files with 1070 additions and 41 deletions
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -9,10 +9,11 @@
 Though large-scale cluster systems remain the dominant solution for request and
 data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
 there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
-efficient heterogeneous computation with more tightly-coupled heterogeneous nodes
-providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015}
-\textcolor{red}{[ADD MORE CITATIONS]} Orthogonally, within the scope of one
-motherboard, \emph{heterogeneous memory management (HMM)} enables the use of
+efficient heterogeneous computation with tighter-coupled heterogeneous nodes
+providing (hardware) acceleration for one another
+\cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}
+Orthogonally, within the scope of one motherboard,
+\emph{heterogeneous memory management (HMM)} enables the use of
 OS-controlled, unified memory view across both main memory and device memory
 \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
 function calls as one would with SMP programming, the underlying complexities of
@ -50,11 +51,46 @@ This thesis paper builds upon an ongoing research effort in implementing a
 tightly coupled cluster where HMM abstractions allow for transparent RDMA access
 from accelerator nodes to local data and migration of data near computation,
 leveraging different consistency model and coherency protocols to amortize the
-communication cost for shared data. \textcolor{red}{
-Specifically, this thesis explores the effect of memory consistency model and
-coherency protocol on memory-sharing between cluster nodes }
+communication cost for shared data. More specifically, this thesis explores the
+following:

-\textcolor{red}{The rest of the chapter is structured as follows\dots}
+\begin{itemize}
+    \item {
+        The effect of cache coherency maintenance, specifically OS-initiated,
+        on RDMA programs.
+    }
+    \item {
+        Implementation of cache coherency in cache-incoherent kernel-side RDMA
+        clients.
+    }
+    \item {
+        Discussion of memory models and coherence protocol designs for a
+        single-writer, multi-reader RDMA-based DSM system.
+    }
+\end{itemize}
+
+The rest of the chapter is structured as follows:
+\begin{itemize}
+    \item {
+        We identify and discuss notable developments in software-implemented
+        DSM systems, and thus identify key features of contemporary advancements
+        in DSM techniques that differentiate them from their predecessors.
+    }
+    \item {
+        We identify alternative (shared memory) programming paradigms and
+        compare them with DSM, which sought to provide transparent shared
+        address space among participating nodes.
+    }
+    \item {
+        We give an overview of coherency protocol and consistency models for
+        multi-sharer DSM systems.
+    }
+    \item {
+        We provide a primer to cache coherency in ARM64 systems, which
+        \emph{do not} guarantee cache-coherent DMA,
+        as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
+    }
+\end{itemize}

 \section{Experiences from Software DSM}
 A majority of contributions to software DSM systems come from the 1990s
@ -81,9 +117,9 @@ New developments in network interfaces provides much improved bandwidth and late
 compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
 the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
 scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
-Similar results have been observed for \textit{APACHE Spark}\cite{Lu_etal.Spark_over_RDMA.2014}
-\textcolor{red}{and what?}. Consequently, there have been a resurgence of interest
-in software DSM systems and programming models
+Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014}
+and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
+resurgence of interest in software DSM systems and programming models
 \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.

 % Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
@ -108,11 +144,11 @@ of the DSM system.

 Perhaps most importantly, experiences from Munin show that \emph{restricting the
 flexibility of programming model can lead to more performant coherence models}, as
-\textcolor{teal}{corroborated} by the now-foundational
-\textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} --
-which powered many now-popular scalable data processing frameworks such as
-\textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and
-\textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
+exhibited by the now-foundational \textit{Resilient Distributed Database} paper
+\cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data
+processing frameworks such as \textit{Hadoop MapReduce}
+\cite{WEB.APACHE..Apache_Hadoop.2023} and
+\textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
 tolerance efficiently, RDDs provide a restricted form of shared memory
 [based on]\dots transformations rather than\dots updates to shared state''
 \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
@ -227,7 +263,7 @@ network has been made apparent since the 1980s, predominant approaches to
        \cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies
        manual/controlled data sharding over nodes, separation of compute and
        communication ``stages'' of computation, etc., which benefit performance
-        analysis.
+        analysis and engineering.
    }
    \item {
        Enterprise applications value throughput and uptime of relatively
@ -250,7 +286,6 @@ as backends to provide the PGAS model over various network interfaces/platforms
 (e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022}
 {WEB.HPE.Chapel_Platforms-v1.33.2023}.

-Examples of PGAS programming languages and models include \textcolor{red}{\dots}.
 Notably, implementation of a \emph{global} address space across machines on top
 of machines already equipped with their own \emph{local} address space (e.g.,
 cluster nodes running commercial Linux) necessitates a global addressing
@ -263,7 +298,7 @@ allocating node's memory, but registered globally. Consequently, a single global
 pointer is recorded in the runtime with corresponding permission flags for the
 context of some user-defined group of associated nodes. Comparatively, a
 \textit{collective} PGAS object is allocated such that a partition of the object
-(i.e., a subarray of the repr) is stored in each of the associated node -- for
+(i.e., a sub-array of the repr) is stored in each of the associated node -- for
 a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
 pointing to the same object, with different offsets and (naturally)
 independently-chosen virtual addresses. Note that this design naturally requires
@ -272,33 +307,36 @@ cannot be re-addressed to a different virtual address i.e., the global pointer
 that records the local virtual address cannot be auto-invalidated.

 Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
-opt to use a map-like data structure for addressing instead. In general, PGAS
-backends differ from DSM systems in that, despite providing memory management
-over remote nodes, they provide no transparent caching and transfer of remote
-memory objects accessed by local nodes. The programmer is still expected to
-handle data/thread movement manually when working with shared memory over network
-to maximize performance metrics of interest.
+opt to use a map-like data structure for addressing instead. In general, despite
+both PGAS and DSM systems provide memory management over remote nodes, PGAS
+frameworks provide no transparent caching and transfer of remote memory objects
+accessed by local nodes. The programmer is still expected to handle data/thread
+movement manually when working with shared memory over network to maximize
+their performance metrics of interest.

-\dots
-
-Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
-global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
-technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
-specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
+\subsection{Message Passing}


-Contemporary works on DSM systems focus more on leveraging hardware advancements
-to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
-for example, implements a complex system for memory disaggregation over multiple
-compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
-they observed significant performance improvements over existing data-intensive
-processing frameworks, for example APACHE Spark, Memcached, and Redis, over
-no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
-systems.
+% \dots

-\subsection{Programming Model}
+% Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
+% global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
+% technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
+% specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.

-\subsection{Move Data to Process, or Move Process to Data?}
+
+% Contemporary works on DSM systems focus more on leveraging hardware advancements
+% to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
+% for example, implements a complex system for memory disaggregation over multiple
+% compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
+% they observed significant performance improvements over existing data-intensive
+% processing frameworks, for example APACHE Spark, Memcached, and Redis, over
+% no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
+% systems.
+
+% \subsection{Programming Model}
+
+\subsection{Data to Process, or Process to Data?}
 (TBD -- The former is costly for data-intensive computation, but the latter may
 be impossible for certain tasks, and greatly hardens the replacement problem.)