Added ftrace visualization code

This commit is contained in:
Zhengyi Chen 2024-02-17 22:39:46 +00:00
parent bab62e1974
commit 518f7d1bf5
14 changed files with 1070 additions and 41 deletions

View file

@ -9,10 +9,11 @@
Though large-scale cluster systems remain the dominant solution for request and
data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
efficient heterogeneous computation with more tightly-coupled heterogeneous nodes
providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015}
\textcolor{red}{[ADD MORE CITATIONS]} Orthogonally, within the scope of one
motherboard, \emph{heterogeneous memory management (HMM)} enables the use of
efficient heterogeneous computation with tighter-coupled heterogeneous nodes
providing (hardware) acceleration for one another
\cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}
Orthogonally, within the scope of one motherboard,
\emph{heterogeneous memory management (HMM)} enables the use of
OS-controlled, unified memory view across both main memory and device memory
\cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
function calls as one would with SMP programming, the underlying complexities of
@ -50,11 +51,46 @@ This thesis paper builds upon an ongoing research effort in implementing a
tightly coupled cluster where HMM abstractions allow for transparent RDMA access
from accelerator nodes to local data and migration of data near computation,
leveraging different consistency model and coherency protocols to amortize the
communication cost for shared data. \textcolor{red}{
Specifically, this thesis explores the effect of memory consistency model and
coherency protocol on memory-sharing between cluster nodes }
communication cost for shared data. More specifically, this thesis explores the
following:
\textcolor{red}{The rest of the chapter is structured as follows\dots}
\begin{itemize}
\item {
The effect of cache coherency maintenance, specifically OS-initiated,
on RDMA programs.
}
\item {
Implementation of cache coherency in cache-incoherent kernel-side RDMA
clients.
}
\item {
Discussion of memory models and coherence protocol designs for a
single-writer, multi-reader RDMA-based DSM system.
}
\end{itemize}
The rest of the chapter is structured as follows:
\begin{itemize}
\item {
We identify and discuss notable developments in software-implemented
DSM systems, and thus identify key features of contemporary advancements
in DSM techniques that differentiate them from their predecessors.
}
\item {
We identify alternative (shared memory) programming paradigms and
compare them with DSM, which sought to provide transparent shared
address space among participating nodes.
}
\item {
We give an overview of coherency protocol and consistency models for
multi-sharer DSM systems.
}
\item {
We provide a primer to cache coherency in ARM64 systems, which
\emph{do not} guarantee cache-coherent DMA,
as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
}
\end{itemize}
\section{Experiences from Software DSM}
A majority of contributions to software DSM systems come from the 1990s
@ -81,9 +117,9 @@ New developments in network interfaces provides much improved bandwidth and late
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
Similar results have been observed for \textit{APACHE Spark}\cite{Lu_etal.Spark_over_RDMA.2014}
\textcolor{red}{and what?}. Consequently, there have been a resurgence of interest
in software DSM systems and programming models
Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014}
and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
resurgence of interest in software DSM systems and programming models
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
@ -108,11 +144,11 @@ of the DSM system.
Perhaps most importantly, experiences from Munin show that \emph{restricting the
flexibility of programming model can lead to more performant coherence models}, as
\textcolor{teal}{corroborated} by the now-foundational
\textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} --
which powered many now-popular scalable data processing frameworks such as
\textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and
\textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
exhibited by the now-foundational \textit{Resilient Distributed Database} paper
\cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data
processing frameworks such as \textit{Hadoop MapReduce}
\cite{WEB.APACHE..Apache_Hadoop.2023} and
\textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
tolerance efficiently, RDDs provide a restricted form of shared memory
[based on]\dots transformations rather than\dots updates to shared state''
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
@ -227,7 +263,7 @@ network has been made apparent since the 1980s, predominant approaches to
\cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies
manual/controlled data sharding over nodes, separation of compute and
communication ``stages'' of computation, etc., which benefit performance
analysis.
analysis and engineering.
}
\item {
Enterprise applications value throughput and uptime of relatively
@ -250,7 +286,6 @@ as backends to provide the PGAS model over various network interfaces/platforms
(e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022}
{WEB.HPE.Chapel_Platforms-v1.33.2023}.
Examples of PGAS programming languages and models include \textcolor{red}{\dots}.
Notably, implementation of a \emph{global} address space across machines on top
of machines already equipped with their own \emph{local} address space (e.g.,
cluster nodes running commercial Linux) necessitates a global addressing
@ -263,7 +298,7 @@ allocating node's memory, but registered globally. Consequently, a single global
pointer is recorded in the runtime with corresponding permission flags for the
context of some user-defined group of associated nodes. Comparatively, a
\textit{collective} PGAS object is allocated such that a partition of the object
(i.e., a subarray of the repr) is stored in each of the associated node -- for
(i.e., a sub-array of the repr) is stored in each of the associated node -- for
a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
pointing to the same object, with different offsets and (naturally)
independently-chosen virtual addresses. Note that this design naturally requires
@ -272,33 +307,36 @@ cannot be re-addressed to a different virtual address i.e., the global pointer
that records the local virtual address cannot be auto-invalidated.
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
opt to use a map-like data structure for addressing instead. In general, PGAS
backends differ from DSM systems in that, despite providing memory management
over remote nodes, they provide no transparent caching and transfer of remote
memory objects accessed by local nodes. The programmer is still expected to
handle data/thread movement manually when working with shared memory over network
to maximize performance metrics of interest.
opt to use a map-like data structure for addressing instead. In general, despite
both PGAS and DSM systems provide memory management over remote nodes, PGAS
frameworks provide no transparent caching and transfer of remote memory objects
accessed by local nodes. The programmer is still expected to handle data/thread
movement manually when working with shared memory over network to maximize
their performance metrics of interest.
\dots
Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
\subsection{Message Passing}
Contemporary works on DSM systems focus more on leveraging hardware advancements
to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
for example, implements a complex system for memory disaggregation over multiple
compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
they observed significant performance improvements over existing data-intensive
processing frameworks, for example APACHE Spark, Memcached, and Redis, over
no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
systems.
% \dots
\subsection{Programming Model}
% Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
% global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
% technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
% specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
\subsection{Move Data to Process, or Move Process to Data?}
% Contemporary works on DSM systems focus more on leveraging hardware advancements
% to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
% for example, implements a complex system for memory disaggregation over multiple
% compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
% they observed significant performance improvements over existing data-intensive
% processing frameworks, for example APACHE Spark, Memcached, and Redis, over
% no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
% systems.
% \subsection{Programming Model}
\subsection{Data to Process, or Process to Data?}
(TBD -- The former is costly for data-intensive computation, but the latter may
be impossible for certain tasks, and greatly hardens the replacement problem.)