Added ftrace visualization code
This commit is contained in:
parent
bab62e1974
commit
518f7d1bf5
14 changed files with 1070 additions and 41 deletions
|
|
@ -9,10 +9,11 @@
|
|||
Though large-scale cluster systems remain the dominant solution for request and
|
||||
data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011},
|
||||
there have been a resurgence towards applying HPC techniques (e.g., DSM) for more
|
||||
efficient heterogeneous computation with more tightly-coupled heterogeneous nodes
|
||||
providing (hardware) acceleration for one another \cite{Cabezas_etal.GPU-SM.2015}
|
||||
\textcolor{red}{[ADD MORE CITATIONS]} Orthogonally, within the scope of one
|
||||
motherboard, \emph{heterogeneous memory management (HMM)} enables the use of
|
||||
efficient heterogeneous computation with tighter-coupled heterogeneous nodes
|
||||
providing (hardware) acceleration for one another
|
||||
\cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}
|
||||
Orthogonally, within the scope of one motherboard,
|
||||
\emph{heterogeneous memory management (HMM)} enables the use of
|
||||
OS-controlled, unified memory view across both main memory and device memory
|
||||
\cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc
|
||||
function calls as one would with SMP programming, the underlying complexities of
|
||||
|
|
@ -50,11 +51,46 @@ This thesis paper builds upon an ongoing research effort in implementing a
|
|||
tightly coupled cluster where HMM abstractions allow for transparent RDMA access
|
||||
from accelerator nodes to local data and migration of data near computation,
|
||||
leveraging different consistency model and coherency protocols to amortize the
|
||||
communication cost for shared data. \textcolor{red}{
|
||||
Specifically, this thesis explores the effect of memory consistency model and
|
||||
coherency protocol on memory-sharing between cluster nodes }
|
||||
communication cost for shared data. More specifically, this thesis explores the
|
||||
following:
|
||||
|
||||
\textcolor{red}{The rest of the chapter is structured as follows\dots}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
The effect of cache coherency maintenance, specifically OS-initiated,
|
||||
on RDMA programs.
|
||||
}
|
||||
\item {
|
||||
Implementation of cache coherency in cache-incoherent kernel-side RDMA
|
||||
clients.
|
||||
}
|
||||
\item {
|
||||
Discussion of memory models and coherence protocol designs for a
|
||||
single-writer, multi-reader RDMA-based DSM system.
|
||||
}
|
||||
\end{itemize}
|
||||
|
||||
The rest of the chapter is structured as follows:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
We identify and discuss notable developments in software-implemented
|
||||
DSM systems, and thus identify key features of contemporary advancements
|
||||
in DSM techniques that differentiate them from their predecessors.
|
||||
}
|
||||
\item {
|
||||
We identify alternative (shared memory) programming paradigms and
|
||||
compare them with DSM, which sought to provide transparent shared
|
||||
address space among participating nodes.
|
||||
}
|
||||
\item {
|
||||
We give an overview of coherency protocol and consistency models for
|
||||
multi-sharer DSM systems.
|
||||
}
|
||||
\item {
|
||||
We provide a primer to cache coherency in ARM64 systems, which
|
||||
\emph{do not} guarantee cache-coherent DMA,
|
||||
as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
|
||||
}
|
||||
\end{itemize}
|
||||
|
||||
\section{Experiences from Software DSM}
|
||||
A majority of contributions to software DSM systems come from the 1990s
|
||||
|
|
@ -81,9 +117,9 @@ New developments in network interfaces provides much improved bandwidth and late
|
|||
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
|
||||
the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
|
||||
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
|
||||
Similar results have been observed for \textit{APACHE Spark}\cite{Lu_etal.Spark_over_RDMA.2014}
|
||||
\textcolor{red}{and what?}. Consequently, there have been a resurgence of interest
|
||||
in software DSM systems and programming models
|
||||
Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014}
|
||||
and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
|
||||
resurgence of interest in software DSM systems and programming models
|
||||
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
|
||||
|
||||
% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
|
||||
|
|
@ -108,11 +144,11 @@ of the DSM system.
|
|||
|
||||
Perhaps most importantly, experiences from Munin show that \emph{restricting the
|
||||
flexibility of programming model can lead to more performant coherence models}, as
|
||||
\textcolor{teal}{corroborated} by the now-foundational
|
||||
\textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} --
|
||||
which powered many now-popular scalable data processing frameworks such as
|
||||
\textit{Hadoop MapReduce}\cite{WEB.APACHE..Apache_Hadoop.2023} and
|
||||
\textit{APACHE Spark}\cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
|
||||
exhibited by the now-foundational \textit{Resilient Distributed Database} paper
|
||||
\cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data
|
||||
processing frameworks such as \textit{Hadoop MapReduce}
|
||||
\cite{WEB.APACHE..Apache_Hadoop.2023} and
|
||||
\textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
|
||||
tolerance efficiently, RDDs provide a restricted form of shared memory
|
||||
[based on]\dots transformations rather than\dots updates to shared state''
|
||||
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
|
||||
|
|
@ -227,7 +263,7 @@ network has been made apparent since the 1980s, predominant approaches to
|
|||
\cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies
|
||||
manual/controlled data sharding over nodes, separation of compute and
|
||||
communication ``stages'' of computation, etc., which benefit performance
|
||||
analysis.
|
||||
analysis and engineering.
|
||||
}
|
||||
\item {
|
||||
Enterprise applications value throughput and uptime of relatively
|
||||
|
|
@ -250,7 +286,6 @@ as backends to provide the PGAS model over various network interfaces/platforms
|
|||
(e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022}
|
||||
{WEB.HPE.Chapel_Platforms-v1.33.2023}.
|
||||
|
||||
Examples of PGAS programming languages and models include \textcolor{red}{\dots}.
|
||||
Notably, implementation of a \emph{global} address space across machines on top
|
||||
of machines already equipped with their own \emph{local} address space (e.g.,
|
||||
cluster nodes running commercial Linux) necessitates a global addressing
|
||||
|
|
@ -263,7 +298,7 @@ allocating node's memory, but registered globally. Consequently, a single global
|
|||
pointer is recorded in the runtime with corresponding permission flags for the
|
||||
context of some user-defined group of associated nodes. Comparatively, a
|
||||
\textit{collective} PGAS object is allocated such that a partition of the object
|
||||
(i.e., a subarray of the repr) is stored in each of the associated node -- for
|
||||
(i.e., a sub-array of the repr) is stored in each of the associated node -- for
|
||||
a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
|
||||
pointing to the same object, with different offsets and (naturally)
|
||||
independently-chosen virtual addresses. Note that this design naturally requires
|
||||
|
|
@ -272,33 +307,36 @@ cannot be re-addressed to a different virtual address i.e., the global pointer
|
|||
that records the local virtual address cannot be auto-invalidated.
|
||||
|
||||
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
|
||||
opt to use a map-like data structure for addressing instead. In general, PGAS
|
||||
backends differ from DSM systems in that, despite providing memory management
|
||||
over remote nodes, they provide no transparent caching and transfer of remote
|
||||
memory objects accessed by local nodes. The programmer is still expected to
|
||||
handle data/thread movement manually when working with shared memory over network
|
||||
to maximize performance metrics of interest.
|
||||
opt to use a map-like data structure for addressing instead. In general, despite
|
||||
both PGAS and DSM systems provide memory management over remote nodes, PGAS
|
||||
frameworks provide no transparent caching and transfer of remote memory objects
|
||||
accessed by local nodes. The programmer is still expected to handle data/thread
|
||||
movement manually when working with shared memory over network to maximize
|
||||
their performance metrics of interest.
|
||||
|
||||
\dots
|
||||
|
||||
Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
|
||||
global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
|
||||
technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
|
||||
specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
|
||||
\subsection{Message Passing}
|
||||
|
||||
|
||||
Contemporary works on DSM systems focus more on leveraging hardware advancements
|
||||
to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
|
||||
for example, implements a complex system for memory disaggregation over multiple
|
||||
compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
|
||||
they observed significant performance improvements over existing data-intensive
|
||||
processing frameworks, for example APACHE Spark, Memcached, and Redis, over
|
||||
no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
|
||||
systems.
|
||||
% \dots
|
||||
|
||||
\subsection{Programming Model}
|
||||
% Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
|
||||
% global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
|
||||
% technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
|
||||
% specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
|
||||
|
||||
\subsection{Move Data to Process, or Move Process to Data?}
|
||||
|
||||
% Contemporary works on DSM systems focus more on leveraging hardware advancements
|
||||
% to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
|
||||
% for example, implements a complex system for memory disaggregation over multiple
|
||||
% compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
|
||||
% they observed significant performance improvements over existing data-intensive
|
||||
% processing frameworks, for example APACHE Spark, Memcached, and Redis, over
|
||||
% no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
|
||||
% systems.
|
||||
|
||||
% \subsection{Programming Model}
|
||||
|
||||
\subsection{Data to Process, or Process to Data?}
|
||||
(TBD -- The former is costly for data-intensive computation, but the latter may
|
||||
be impossible for certain tasks, and greatly hardens the replacement problem.)
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue