Background mostly done?

2024-03-15 22:27:48 +00:00 · 2024-03-15 22:27:48 +00:00 · 0d78e11a97
commit 0d78e11a97
parent 54b3b3064a
3 changed files with 261 additions and 9 deletions
--- a/tex/misc/background_draft.bib
+++ b/tex/misc/background_draft.bib
@ -594,4 +594,30 @@
  publisher={LWN.net},
  author={Corbet, Jonathan},
  year={2021}
-}
+}
+
+@misc{Parris.AMBA_4_ACE-Lite.2013,
+  title={Extended system coherency: Cache Coherency Fundamentals},
+  url={https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals},
+  journal={Extended System Coherency: Cache Coherency Fundamentals - Architectures and Processors blog - Arm Community blogs - Arm Community},
+  publisher={ARM Community Blogs},
+  author={Parris, Neil},
+  year={2013}
+}
+
+@misc{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024,
+  title={Dynamic DMA mapping Guide},
+  url={https://www.kernel.org/doc/html/v6.7/core-api/dma-api-howto.html},
+  journal={The Linux Kernel},
+  author={Miller, David S and Henderson, Richard and Jelinek, Jakub},
+  year={2024}
+}
+
+@misc{many.MSFTLearn-SMBDirect.2024,
+  title={SMB Direct},
+  url={https://learn.microsoft.com/en-us/windows-server/storage/file-server/smb-direct},
+  journal={Microsoft Learn},
+  publisher={Microsoft},
+  author={Xelu86 and ManikaDhiman and dknappettmsft and v-alje and nedpyle and eross-msft and SubodhBhargava and JasonGerend and lizap and Heidilohr},
+  year={2024}
+}
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -7,11 +7,30 @@
 \usepackage[justification=centering]{caption}
 \usepackage{hyperref}
 \usepackage{amsthm}
+\usepackage{csquotes}
+% \usepackage{listings}
+% \usepackage{xcolor}
+\usepackage{minted}

 \addbibresource{background_draft.bib}
 \theoremstyle{definition}
 \newtheorem{definition}{Definition}

+% Code listings
+\usemintedstyle{vs}
+% \definecolor{code-comment}{rgb}{0.5, 0.5, 0.4}
+% \definecolor{code-background}{rgb}{0.96, 0.96, 0.96}
+% \lstset{
+%     backgroundcolor=\color{code-background},
+%     keywordstyle=\color{magenta},
+%     commentstyle=\color{code-comment},
+%     stringstyle=\color{purple},
+%     basicstyle=\ttfamily\footnotesize,
+%     emphstyle=\underline,
+%     numbers=left,
+%     tabsize=4
+% }
+
 \begin{document}
 Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.

@ -56,8 +75,8 @@ New developments in network interfaces provides much improved bandwidth and late
 compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
 the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
 scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
-Similar results have been observed for APACHE Spark \cite{Lu_etal.Spark_over_RDMA.2014}
-and SMBDirect \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
+Similar results have been observed for \textit{APACHE Spark} \cite{Lu_etal.Spark_over_RDMA.2014}
+and \textit{SMBDirect} \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
 resurgence of interest in software DSM systems and programming models
 \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.

@ -569,8 +588,8 @@ We identify that home-based protocols are conceptually straightforward compared
 \subsection{DMA and Cache Coherence}
 The advent of high-speed RDMA-capable network interfaces introduce introduce opportunities for designing more performant DSM systems over RDMA (as established in \ref{sec:msg-passing}). Orthogonally, RDMA-capable NICs on a fundamental level perform direct memory access over the main memory to achieve one-sided RDMA operations to reduce the effect of OS jittering on RDMA latencies. For modern computer systems with cached multiprocessors, this poses a potential cache coherence problem on a local level -- because RDMA operations happen concurrently with regards to memory accesses by CPUs, which stores copies of memory data in cache lines which may \cites{Kjos_etal.HP-HW-CC-IO.1996}{Ven.LKML_x86_DMA.2008} or may not \cites{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018}{Corbet.LWN-NC-DMA.2021} be fully coherent by the DMA mechanism, any DMA operations performed by the RDMA NIC may be incoherent with the cached copy of the same data inside the CPU caches (as is the case for accelerators, etc.). This issue is of particular concern to the kernel development community, who needs to ensure that the behaviors of DMA operations remain identical across architectures regardless of support of cache-coherent DMA \cite{Corbet.LWN-NC-DMA.2021}. Likewise existing RDMA implementations which make heavy use of architecture-specific DMA memory allocation implementations, implementing RDMA-based DSM systems in kernel also requires careful use of kernel API functions that ensure cache coherency as necessary.

-\subsection{Cache Coherence in ARMv8}
-We specifically focus on the implementation of cache coherence in ARMv8. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8 architecture (and many other popular ISAs e.g. RISC-V) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. ARMv8 defines a hierarchical model for coherency organization to support \textit{heterogeneous} and \textit{asymmetric} multi-processing systems \cite{ARM.ARMv8-A.v1.0.2015}.
+\subsection{Cache Coherence in ARMv8-A}
+We specifically focus on the implementation of cache coherence in ARMv8-A. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8-A architecture (and many other popular ISAs, for example \textit{RISC-V}) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. ARMv8 defines a hierarchical model for coherency organization to support \textit{heterogeneous} and \textit{asymmetric} multi-processing systems \cite{ARM.ARMv8-A.v1.0.2015}.

 \begin{definition}[cluster]
    A \textit{cluster} defines a minimal cache-coherent region for Cortex-A53 and Cortex-A57 processors. Each cluster usually comprises of 1 or more core as well as a shared last-level cache.
@ -579,17 +598,22 @@ We specifically focus on the implementation of cache coherence in ARMv8. Unlike
 \begin{definition}[sharable domain]
    A \textit{sharable domain} defines a vendor-defined cache-coherent region. Sharable domains can be \textit{inner} or \textit{outer}, which limits the scope of broadcast coherence messages to \textit{point-of-unification} and \textit{point-of-coherence}, respectively.

-    Usually, the \textit{inner} sharable domain defines the domain of all (closely-coupled) processors inside a heterogeneous multiprocessing system (see \ref{def:het-mp}); while the \textit{outer} sharable domain defines the largest memory-sharing domain for the system (e.g. DMA bus).
+    Usually, the \textit{inner} sharable domain defines the domain of all (closely-coupled) processors inside a heterogeneous multiprocessing system (see \ref{def:het-mp}); while the \textit{outer} sharable domain defines the largest memory-sharing domain for the system (e.g. inclusive of DMA bus).
 \end{definition}

-\begin{definition}[Point-of-Unification]
-    The \textit{point-of-unification} under ARMv8 defines a level of coherency such that all sharers inside the \textbf{inner} sharable domain see the same copy of data.
+\begin{definition}[Point-of-Unification]\label{def:pou}
+    The \textit{point-of-unification} (\textit{PoU}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{inner} sharable domain see the same copy of data.
+
+    Consequently, \textit{PoU} defines a point at which every core of a ARMv8-A processor sees the same (i.e., a \emph{unified}) copy of a memory location regardless of accessing via instruction caches, data caches, or TLB.
 \end{definition}

 \begin{definition}[Point-of-Coherence]\label{def:poc}
-    The \textit{point-of-coherence} under ARMv8 defines a level of coherency such that all sharers inside the \textbf{outer} sharable domain see the same copy of data.
+    The \textit{point-of-coherence} (\textit{PoC}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{outer} sharable domain see the same copy of data.
+
+    Consequently, \textit{PoC} defines a point at which all \textit{observers} (e.g., cores, DSPs, DMA engines) to memory will observe the same copy of a memory location.
 \end{definition}

+\subsubsection{Addendum: \textit{Heterogeneous} \& \textit{Asymmetric} Multiprocessing}
 Using these definitions, a vendor could build \textit{heterogeneous} and \textit{asymmetric} multi-processing systems as follows:
 \begin{definition}[Heterogeneous Multiprocessing]\label{def:het-mp}
    A \textit{heterogeneous multiprocessing} system incorporates ARMv8 processors of diverse microarchitectures that are fully coherent with one another, running the same system image.
@ -599,6 +623,208 @@ Using these definitions, a vendor could build \textit{heterogeneous} and \textit
    A \textit{asymmetric multiprocessing} system needs not contain fully coherent processors. For example, a system-on-a-chip may contain a non-coherent co-processor for secure computing purposes \cite{ARM.ARMv8-A.v1.0.2015}.
 \end{definition}

+\subsection{ARMv8-A Software Cache Coherence in Linux Kernel}
+Because of the lack of hardware guarantee on hardware DMA coherency (though such support exists \cite{Parris.AMBA_4_ACE-Lite.2013}), programmers need to invoke architecture-specific cache-coherency instructions when porting DMA hardware support over a diverse range of ARMv8 microarchitectures, often encapsulated in problem-specific subroutines.
+
+Notably, kernel (driver) programming warrants programmer attention to software-maintained coherency when userspace programmers downstream expect data-flow, interspersed between CPU and DMA operations, to follow program ordering and (driver vendor) specifications. One such example arises in the Linux kernel implementation of DMA memory management API \cite{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024}\footnote[1]{Based on Linux kernel v6.7.0.}:
+
+\begin{definition}[DMA Mappings]
+    The Linux kernel DMA memory allocation API, imported via
+    \begin{minted}[linenos]{c}
+#include <linux/dma-mapping.h>
+    \end{minted}
+    defines two variants of DMA mappings:
+
+    \begin{itemize}
+        \item {\label{def:consistent-dma-map}
+            \textit{Consistent} DMA mappings:
+
+            They are guaranteed to be coherent in-between concurrent CPU/DMA accesses without explicit software flushing.
+            \footnote[2]{
+                However, it does not preclude CPU store reordering, so memory barriers remain necessary in a multiprocessing context.
+            }
+        }
+        \item {
+            \textit{Streaming} DMA mappings:
+
+            They provide no guarantee to coherency in-between concurrent CPU/DMA accesses. Programmers need to manually apply coherency maintenance subroutines for synchronization.
+        }
+    \end{itemize}
+\end{definition}
+
+Consistent DMA mappings could be trivially created via allocating non-cacheable memory, which guarantees \textit{PoC} for all memory observers (though system-specific fastpaths exist).
+
+On the other hand, streaming DMA mappings require manual synchronization upon programmed CPU/DMA access. Take single-buffer synchronization on CPU after DMA access for example:
+\begin{minted}[linenos, mathescape]{c}
+/* In kernel/dma/mapping.c $\label{code:dma_sync_single_for_cpu}$*/
+void dma_sync_single_for_cpu(
+    struct device *dev,          // kernel repr for DMA device
+    dma_addr_t addr,             // DMA address
+    size_t size,                 // Synchronization buffer size
+    enum dma_data_direction dir, // Data-flow direction
+) {
+    /* Translate DMA address to physical address */
+    phys_addr_t paddr = dma_to_phys(dev, addr);
+
+    if (!dev_is_dma_coherent(dev)) {
+        arch_sync_dma_for_cpu(paddr, size, dir);
+        arch_sync_dma_for_cpu_all(); // MIPS quirks...
+    }
+
+    /* Miscellaneous cases... */
+}
+\end{minted}
+
+\begin{minted}[linenos]{c}
+/* In arch/arm64/mm/dma-mapping.c */
+void arch_sync_dma_for_cpu(
+    phys_addr_t paddr,
+    size_t size,
+    enum dma_data_direction dir,
+) {
+    /* Translate physical address to (kernel) virtual address */
+    unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+    /* Early exit for DMA read: no action needed for CPU */
+    if (dir == DMA_TO_DEVICE)
+        return;
+
+    /* ARM64-specific: invalidate CPU cache to PoC */
+    dcache_inval_poc(start, start + size);
+}
+\end{minted}
+
+This call-chain, as well as its mirror case which maintains cache coherency for the DMA device after CPU access: \mint[breaklines=true]{c}|dma_sync_single_for_device(struct device *, dma_addr_t, size_t, enum dma_data_direction)|, call into the following procedures, respectively:
+
+\begin{minted}[linenos]{c}
+/* Exported @ arch/arm64/include/asm/cacheflush.h */
+/* Defined @ arch/arm64/mm/cache.S */
+/* All functions accept virtual start, end addresses. */
+
+/* Invalidate data cache region [start, end) to PoC.
+ *
+ * Invalidate CPU cache entries that intersect with [start, end),
+ * such that data from external writers becomes visible to CPU.
+ */
+extern void dcache_inval_poc(
+    unsigned long start, unsigned long end
+);
+
+/* Clean data cache region [start, end) to PoC.
+ *
+ * Write-back CPU cache entries that intersect with [start, end),
+ * such that data from CPU becomes visible to external writers.
+ */
+extern void dcache_clean_poc(
+    unsigned long start, unsigned long end
+);
+\end{minted}
+
+\subsubsection{Addendum: \texttt{enum dma\_data\_direction}}
+
+The Linux kernel defines 4 direction \texttt{enum} values for fine-tuning synchronization behaviors:
+\begin{minted}[linenos]{c}
+/* In include/linux/dma-direction.h */
+enum dma_data_direction {
+    DMA_BIDIRECTION = 0, // data transfer direction uncertain.
+    DMA_TO_DEVICE = 1,   // data from main memory to device.
+    DMA_FROM_DEVICE = 2, // data from device to main memory.
+    DMA_NONE = 3,        // invalid repr for runtime errors.
+};
+\end{minted}
+
+These values allow for certain fast-paths to be taken at runtime. For example, \texttt{DMA\_TO\_DEVICE} implies that the device reads data from memory without modification, and hence precludes software coherence instructions from being run when synchronizing for CPU after DMA operation.
+
+% TODO: Move to addendum section.
+\subsubsection{Use-case: Kernel-space \textit{SMBDirect} Driver}
+\textit{SMBDirect} is an extension of the \textit{SMB} (\textit{Server Message Block}) protocol for opportunistically establishing the communication protocol over RDMA-capable network interfaces \cite{many.MSFTLearn-SMBDirect.2024}.
+
+We focus on two procedures inside the in-kernel SMBDirect implementation:
+
+\paragraph{Before send: \texttt{smbd\_post\_send}}
+\begin{minted}[linenos]{c}
+/* In fs/smb/client/smbdirect.c */
+static int smbd_post_send(
+    struct smbd_connection *info, // SMBDirect transport context
+    struct smbd_request *request, // SMBDirect request context
+) // ...
+\end{minted}
+
+Downstream of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_post\_send} is called to signal the RDMA NIC for transport.
+
+The function body is roughly as follows:
+\begin{minted}[linenos, firstnumber=last, mathescape]{c}
+{
+    struct ib_send_wr send_wr; // "Write Request" for entire payload
+    int rc, i;
+
+    /* For each message in batched payload */
+    for (i = 0; i < request->num_sge; i++) {
+        /* Log to kmesg ring buffer... */
+
+        /* RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ $\label{code:ib_dma_sync_single_for_device}$*/
+        ib_dma_sync_single_for_device(
+            info->id->device,       // struct ib_device *
+            request->sge[i].addr,   // u64 (as dma_addr_t)
+            request->sge[i].length, // size_t
+            DMA_TO_DEVICE,          // enum dma_data_direction
+        );
+    }
+
+    /* Populate `request`, `send_wr`... */
+
+    rc = ib_post_send(
+        info->id->qp, // struct ib_qp * ("Queue Pair")
+        &send_wr,     // const struct ib_recv_wr *
+        NULL,         // const struct ib_recv_wr ** (err handling)
+    );
+
+    /* Error handling... */
+
+    return rc;
+}
+\end{minted}
+
+Line \ref{code:ib_dma_sync_single_for_device} writes back CPU cache lines to be visible for RDMA NIC in preparation for DMA operations when the posted \textit{send request} is worked upon.
+
+\paragraph{Upon reception: \texttt{recv\_done}}
+\begin{minted}[linenos]{c}
+/* In fs/smb/client/smbdirect.c */
+static void recv_done(
+    struct ib_cq *cq, // "Completion Queue"
+    struct ib_wc *wc, // "Work Completion"
+) // ...
+\end{minted}
+
+Called when the RDMA subsystem works on the received payload over RDMA. Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores:
+
+\begin{minted}[linenos, firstnumber=last]{c}
+{
+    struct smbd_data_transfer *data_transfer;
+    struct smbd_response *response = container_of(
+        wc->wr_cqe,           // ptr: pointer to member
+        struct smbd_response, // type: type of container struct
+        cqe,                  // name: name of member in struct
+    ); // Cast member of struct into containing struct (C magic)
+    struct smbd_connection *info = response->info;
+    int data_length = 0;
+
+    /* Logging, error handling... */
+
+    /* Likewise, RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ */
+    ib_dma_sync_single_for_cpu(
+        wc->qp->device,
+        response->sge.addr,
+        response->sge.length,
+        DMA_FROM_DEVICE,
+    );
+
+    /* ... */
+}
+\end{minted}
+
+% TODO: lead to cache coherence mechanism in Linux kernel
+
 % Experiment: ...
 % Discussion: (1) Linux and DMA and RDMA (2) replacement and other ideas...