Finished abstract, intro
This commit is contained in:
parent
2a48463944
commit
37728272e8
2 changed files with 17 additions and 28 deletions
Binary file not shown.
|
|
@ -49,7 +49,7 @@
|
|||
\begin{document}
|
||||
\begin{preliminary}
|
||||
|
||||
\title{Analysis of Software-Initiated Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
|
||||
\title{Analysis of Software-Maintained Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
|
||||
|
||||
\author{Zhengyi Chen}
|
||||
|
||||
|
|
@ -77,7 +77,7 @@
|
|||
\date{\today}
|
||||
|
||||
\abstract{
|
||||
\textcolor{red}{[TODO] \dots}
|
||||
Advancements in network interface hardware and operating system capabilities sometimes render historically unpopular computer system architectures feasible. One unusual example is software DSM, which may potentially regain its relevance as a unique solution to hardware acceleration resource sharing beyond hypervisor-level allocation, feasible via exploiting existing features in the Linux kernel such as RDMA networking, \textit{heterogeneous memory management} alongside RDMA-capable network interfaces. However, building such a DSM system between compute nodes of different ISAs is yet more non-trivial. We particularly note that, unlike x86, many RISC ISAs (e.g., ARMv8, RISC-V) do not guarantee cache coherence between CPU and DMA engine on a hardware level. On the other hand, such systems define cache coherence operations on an instruction-level which nevertheless consumes CPU-time. To better advise DSM design with regards to such systems, we measure the latency of emulated non-hardware cache-coherent ARMv8 processor across a variety of scenarios, focusing solely on Linux's source code implementation and instrumentation mechanisms. We find that the latency of software-initiated cache coherency operations grows as the size of the address subspace to perform coherency operation on grows, though such relation exhibit as a non-linear correlation between contiguous allocation size and latency.
|
||||
}
|
||||
|
||||
\maketitle
|
||||
|
|
@ -123,25 +123,14 @@ from the Informatics Research Ethics committee.
|
|||
\end{preliminary}
|
||||
|
||||
\chapter{Introduction}
|
||||
\dots
|
||||
|
||||
This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:
|
||||
|
||||
\begin{itemize}
|
||||
\item {
|
||||
The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
|
||||
}
|
||||
\item {
|
||||
Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
|
||||
}
|
||||
\end{itemize}
|
||||
|
||||
\chapter{Background}\label{chapter:background}
|
||||
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
|
||||
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within one cluster node, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017} -- all while using the same \textit{libc} function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement is automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
|
||||
|
||||
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
|
||||
|
||||
The rest of the chapter is structured as follows:
|
||||
This thesis paper hence builds upon an ongoing research effort in implementing an in-kernel DSM system on top of tightly coupled cluster where \textit{HMM} (\textit{Heterogeneous Memory Management}) abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation. More specifically, this thesis explores the latency incurred by OS-initiated software cache coherency maintenance procedures common across all (R)DMA programs. The findings in this thesis is expected to inform the software coherence protocol and consistency model design of the in-kernel DSM system for accelerator-sharing purposes under a reusable, simple testing framework.
|
||||
|
||||
\chapter{Background}\label{chapter:background}
|
||||
We introduce the following aspects pertaining to the in-kernel DSM project within this chapter:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue