Added (most of) 2.2.2

This commit is contained in:
Zhengyi Chen 2024-03-18 21:36:11 +00:00
parent 172197f638
commit 417bdc115b
4 changed files with 220 additions and 11 deletions

Binary file not shown.

View file

@ -0,0 +1,45 @@
\documentclass[tikz]{standalone}
\usepackage[utf8]{inputenc}
\usepackage{tikz}
\usetikzlibrary{calc,trees,positioning,arrows,fit,shapes,calc}
\begin{document}
\begin{tikzpicture}
\def\rectwidth{2}
\def\rectheight{1}
\draw[thick, <->] (\rectwidth / 2, -0.1) -- (\rectwidth / 2, -0.9) node[below] {\texttt{NULL}};
% Bottom alloc
\foreach \i in {0, ..., 0} {
\draw (0, \i) rectangle (\rectwidth, \i + \rectheight);
\draw[fill] (-3, \i + 0.5 * \rectheight) circle (2pt) node[left] {PFN: \i};
\draw[thick, ->] (-0.1, \i + \rectheight * 0.5) -- +(-2.8, 0);
}
\draw[thick, <->] (\rectwidth / 2, \rectheight + 0.1) -- (\rectwidth / 2, \rectheight + 0.9);
% Top alloc
\foreach \i in {2, ..., 5} {
% Draw 4 rectangles
\draw (0, \i) rectangle (\rectwidth, \i + \rectheight);
\draw[fill] (-3, \i + 0.5 * \rectheight) circle (2pt) node[left] {PFN: \i};
\draw[thick, dotted, ->] (-0.1, \i + \rectheight * 0.5) -- +(-2.8, 0);
}
\draw[thick, ->] (-0.1, 2 + \rectheight * 0.5) -- +(-2.8, 0) node[below, xshift=1.5cm] {\texttt{struct page *}};
\draw[thick, <->] (\rectwidth / 2, 6.1) -- (\rectwidth / 2, 6.9) node[above] {\texttt{NULL}};
% Userspace
\draw[thick] (5, 8) -- (5, -2) node[right, xshift=0.1cm, yshift=0.9cm] {Userspace};
\draw[thick] (5 + \rectwidth, 8) -- (5 + \rectwidth, -2);
\foreach \i in {1, ..., 3} {
\draw (5, \i) rectangle (5 + \rectwidth, \i + \rectheight) node[right, yshift=-1cm] {\texttt{0xBEEF000\i}};
}
\draw[thick, ->] (2.1, 0 + \rectheight * 0.5) -- +(2.8, 1);
\draw[thick, ->] (2.1, 2 + \rectheight * 0.5) -- +(2.8, 0);
\draw[thick, dotted, ->] (2.1, 3 + \rectheight * 0.5) -- +(2.8, 0) node[above, xshift=-1.5cm, red] {Fault!};
\end{tikzpicture}
\end{document}

Binary file not shown.

View file

@ -288,21 +288,21 @@ In particular, we note that DSM studies tend to conform to either release consis
\caption{
Coherence Protocol vs. Consistency Model in Selected Disaggregated Memory Studies. ``Float'' short for ``floating home''. Studies selected for clearly described consistency model and coherence protocol.
}
\label{table:1}
\label{table:consistency-vs-coherency}
\end{table}
We especially note the role of balancing productivity and performance in terms of selecting the ideal consistency model for a system. It is common knowledge that weaker consistency models are harder to program with, at the benefit of less (implied) coherence communications resulting in better throughput overall -- provided that the programmer could guarantee correctness, a weaker consistency model allows for less invalidation of node-local cache entries, thereby allowing multiple nodes to compute in parallel on (likely) outdated local copy of data such that the result of the computation remains semantically correct with regards to the program. This point was made explicit in \textit{Munin} \cite{Carter_Bennett_Zwaenepoel.Munin.1991}, where (to reiterate) it introduces the concept of consistency ``protocol parameters'' to annotate shared memory access pattern, in order to reduce the amount of coherence communications necessary between nodes computing in distributed shared memory. For example, a DSM object (memory object accounted for by the DSM system) can be annotated with ``delayed operations'' to delay coherence operations beyond any write-access, or shared without ``write'' annotation to disable write-access over shared nodes, thereby disabling all coherence operations with regards to this DSM object. Via programmer annotation of DSM objects, the Munin DSM system explicates the effect of weaker consistency in relation to the amount of synchronization overhead necessary among shared memory nodes. To our knowledge, no other more recent DSM works have explored this interaction between consistency and coherence costs on DSM objects, though relatedly \textit{Resilient Distributed Dataset (RDD)} \cite{Zaharia_etal.RDD.2012} also highlights its performance and flexibility benefits in opting for an immutable data representation over disaggregated memory over network when compared to contemporary DSM approaches.
\subsection{Coherence Protocol}
Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:1} shows, DSM studies tends to implement write-invalidated coherence under a \textit{home-based} or \textit{directory-based} protocol framework, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.
Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:consistency-vs-coherency} shows, DSM studies tends to implement write-invalidated coherence under a \textit{home-based} or \textit{directory-based} protocol framework, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.
\subsubsection{Home-Based Protocols}
\textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.
\textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:consistency-vs-coherency}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.
We identify that home-based protocols are conceptually straightforward compared to directory-based protocols, centering communications over storage of global metadata (in this case ownership of each shared memory object). This leads to greater flexibility in implementing coherence protocols. A shared memory object at its creation may be made known globally via broadcast, or made known to only a subset of nodes (0 or more) via multicast. Likewise, metadata storage could be cached locally to each node and invalidated alongside object invalidation or fetched from a fixed node with respect to one object. This implementation flexibility is further taken advantage of in \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017}, which refines the ``home node'' concept into \textit{owner node} to provide replication and persistence, in addition to adopting a dynamic home protocol similar to that of \cite{Endo_Sato_Taura.MENPS_DSM.2020}.
\subsubsection{Directory-Based Protocols}
\textit{Directory-based} protocols instead take a shared database approach by denoting each shared memory object with a globally shared entry describing ownership and sharing status. In its non-distributed form (e.g., \cite{Wang_etal.Concordia.2021}), a global, central directory is maintained for all nodes in network for ownership information: the directory hence becomes a bottleneck for imposing latency and bandwidth constraints on parallel processing systems. Comparatively, a distributed directory scheme may delegate responsibilities across all nodes in network mostly in accordance to sharded address space \cites{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. Though theoretically sound, this scheme performs no dynamic load-balancing for commonly shared memory objects, which in the worst case would function exactly like a non-distributed directory coherence scheme. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}{Schoinas_etal.Sirocco.1998}{Eisley_Peh_Shang.In-net-coherence.2006}{Hong_etal.NUMA-to-RDMA-DSM.2019}.
\textit{Directory-based} protocols instead take a shared database approach by denoting each shared memory object with a globally shared entry describing ownership and sharing status. In its non-distributed form (e.g., \cite{Wang_etal.Concordia.2021}), a global, central directory is maintained for all nodes in network for ownership information: the directory hence becomes a bottleneck for imposing latency and bandwidth constraints on parallel processing systems. Comparatively, a distributed directory scheme may delegate responsibilities across all nodes in network mostly in accordance to sharded address space \cites{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. Though theoretically sound, this scheme performs no dynamic load-balancing for commonly shared memory objects, which in the worst case would function exactly like a non-distributed directory coherence scheme. To our knowledge, in addition to table \ref{table:consistency-vs-coherency}, this protocol and its derivatives had been adopted by \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}{Schoinas_etal.Sirocco.1998}{Eisley_Peh_Shang.In-net-coherence.2006}{Hong_etal.NUMA-to-RDMA-DSM.2019}.
\subsection{DMA and Cache Coherence}
The advent of high-speed RDMA-capable network interfaces introduce introduce opportunities for designing more performant DSM systems over RDMA (as established in \ref{sec:msg-passing}). Orthogonally, RDMA-capable NICs on a fundamental level perform direct memory access over the main memory to achieve one-sided RDMA operations to reduce the effect of OS jittering on RDMA latencies. For modern computer systems with cached multiprocessors, this poses a potential cache coherence problem on a local level -- because RDMA operations happen concurrently with regards to memory accesses by CPUs, which stores copies of memory data in cache lines which may \cites{Kjos_etal.HP-HW-CC-IO.1996}{Ven.LKML_x86_DMA.2008} or may not \cites{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018}{Corbet.LWN-NC-DMA.2021} be fully coherent by the DMA mechanism, any DMA operations performed by the RDMA NIC may be incoherent with the cached copy of the same data inside the CPU caches (as is the case for accelerators, etc.). This issue is of particular concern to the kernel development community, who needs to ensure that the behaviors of DMA operations remain identical across architectures regardless of support of cache-coherent DMA \cite{Corbet.LWN-NC-DMA.2021}. Likewise existing RDMA implementations which make heavy use of architecture-specific DMA memory allocation implementations, implementing RDMA-based DSM systems in kernel also requires careful use of kernel API functions that ensure cache coherency as necessary.
@ -461,7 +461,7 @@ An example of cache-coherent in-kernel RDMA networking module over heterogeneous
We focus on two procedures inside the in-kernel SMBDirect implementation:
\paragraph{Before send: \texttt{smbd\_post\_send}}
\paragraph*{Before send: \texttt{smbd\_post\_send}}
\texttt{smbd\_post\_send} is a function downstream of the call-chain of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_post\_send} is called to signal the RDMA NIC for transport.
The function body is roughly as follows:
@ -503,7 +503,7 @@ static int smbd_post_send(
Line \ref{code:ib_dma_sync_single_for_device} writes back CPU cache lines to be visible for RDMA NIC in preparation for DMA operations when the posted \textit{send request} is worked upon.
\paragraph{Upon reception: \texttt{recv\_done}}
\paragraph*{Upon reception: \texttt{recv\_done}}
\texttt{recv\_done} is called when the RDMA subsystem works on the received payload over RDMA.
Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores prior to any operations on received data:
@ -722,9 +722,33 @@ To implement the features as specified, \texttt{my\_shmem} exposes itself as a c
Additionally, the parameter \texttt{max\_contiguous\_alloc\_order} is exposed as a writable parameter file inside \textit{sysfs} to manually control the number of contiguous pages allocated per module allocation.
\paragraph{Static Data} \dots
\paragraph*{Data Structures} \label{para:data-structs}
The primary functions of \texttt{my\_shmem} is to provide correct accounting of current allocations via the kernel module in addition to allocating on-demand. Hence, to represent a in-kernel allocation of multi-page contiguous buffer, define \texttt{struct my\_shmem\_alloc} as follows:
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
struct my_shmem_alloc {
struct page *page; // GFP alloc repr, points to HEAD page
ulong alloc_order; // alloc buffer length: $2^{\texttt{alloc\_order}}$
struct list_head list; // kernel repr of doubly linked list
};
\end{minted}
\paragraph{File Operations}
\texttt{.list} defines the Linux kernel implementation of a element of a generically-typed doubly linked list, such that multiple allocations could be kept during the lifetime of the module. The corresponding linked list is defined as follows:
\begin{minted}[bgcolor=code-bg]{c}
static LIST_HEAD(my_shmem_allocs);
\end{minted}
To book-keep the real amount of pages allocated during the module's lifetime, define:
\begin{minted}[bgcolor=code-bg]{c}
static size_t my_shmem_page_count;
\end{minted}
Finally, to ensure mutual exlusion of the module's critical sections while running inside a \textit{SMP} (\textit{Symmetric Multi-Processing}) kernel, define mutex:
\begin{minted}[bgcolor=code-bg]{c}
static DEFINE_MUTEX(my_shmem_allocs_mtx);
\end{minted}
This protects all read/write operations to \texttt{my\_shmem\_allocs} and \texttt{my\_shmem\_page\_count} against concurrent module function calls.
\paragraph*{File Operations} \label{para:file_operations}
The Linux kernel defines \textit{file operations} as a series of module-specific callbacks whenever the userspace invokes a corresponding syscall on the (character) device file. These callbacks may be declared inside a \texttt{file\_operations} struct\cite{Corbet_Rubini_K-Hartman.LDD3.2005}, which provides an interface for modules on file-related syscalls:
\begin{minted}[linenos, bgcolor=code-bg, mathescape]{c}
/* In include/linux/fs.h */
@ -764,7 +788,7 @@ Implementation of \texttt{.open} is simple. It suffices to install the module-sp
Likewise for \texttt{.release}, which does nothing except to print a debug message into the kernel ring buffer.
To implement \texttt{.mmap}, the kernel module attempts to \emph{re-map as much allocations into the given \texttt{struct vm\_area\_struct} as possible without making any allocation}. This centralizes allocation logic into the page fault handler, which is described later in \textcolor{red}{???}:
To implement \texttt{.mmap}, the kernel module attempts to \emph{re-map as much allocations into the given \texttt{struct vm\_area\_struct} as possible without making any allocation}. This centralizes allocation logic into the page fault handler, which is described later in \ref{para:vm_operations_struct}.
\begin{minted}[linenos, bgcolor=code-bg, mathescape]{c}
static int my_shmem_fops_mmap(
struct file *filp,
@ -818,9 +842,149 @@ static int my_shmem_fops_mmap(
}
\end{minted}
\paragraph{VM Operations} \dots
\paragraph*{VM Operations}\label{para:vm_operations_struct}
On \texttt{mmap}, the Linux kernel installs a new \textit{VMA} (\textit{Virtual Memory Area}) as the internal representation for the corresponding mapping in process address space\cite{Corbet_Rubini_K-Hartman.LDD3.2005}. Likewise file operations, kernel modules may implement callbacks in \texttt{vm\_operations\_struct} to define module-specific operations per VMA access at userspace:
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
/* In include/linux/mm.h */
struct vm_operations_struct {
/* ... */
void (*close)(struct vm_area_struct * area);
/* ... */
vm_fault_t (*fault)(
struct vm_fault *vmf // Page fault descriptor
); // Page fault handler
/* ... */
};
\end{minted}
\paragraph{\textit{sysfs} Parameter} \dots
The corresponding structure for the particular module is hence defined as follows:
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
/* In my_shmem.c */
static const struct vm_operations_struct my_shmem_vmops = {
.close = my_shmem_vmops_close,
.fault = my_shmem_vmops_fault,
};
\end{minted}
Function \texttt{.fault} is implemented such that allocations are performed lazily until the number of pages allocated inside the module superseeds the faulting page offset wrt. its mapping. A simple implementation of this is to, given the number of pages allocated is insufficient to service this page fault, continuously allocate unitl this condition becomes valid:
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
static vm_fault_t my_shmem_vmops_fault(struct vm_fault *vmf)
{
vm_fault_t ret = VM_FAULT_NOPAGE; // See $\ref{quirk:VM_FAULT_NOPAGE}$
ulong tgt_offset = vmf->vma->vm_pgoff + vmf->pgoff;
/* Lock mutex... */
for (;;) {
/* When we already allocated enough, remap */
if (tgt_offset < my_shmem_page_count)
return __my_shmem_fault_remap(vmf); // See $\ref{quirk:__my_shmem_fault_remap}$
/* Otherwise, allocate $2^{order}$ pages and retry */
struct my_shmem_alloc *new_alloc_handle = kzalloc(
sizeof(struct my_shmem_alloc),
GFP_KERNEL, // kernel-only allocation rule flag
);
/* if (!new_alloc_handle) goto error handling... */
struct page *new_alloc_pg = alloc_pages(
GFP_USER, // user-remapped kernel alloc rule flag
max_contiguous_alloc_order,
); // Alloc $2^{order}$ pages
/* if (!new_alloc_pg) goto error handling... */
/* Fill in handle data */
new_alloc_handle->page = new_alloc_pg;
new_alloc_handle->alloc_order = max_contiguous_alloc_order;
/* Add `new_alloc_handle` to `my_shmem_allocs`... */
/* Prepare for next iteration */
my_shmem_page_count +=
ORDER_TO_PAGE_NR(new_alloc_handle->alloc_order);
}
/* Error handling... */
}
\end{minted}
Several implementation quirks that warrant attention are as follows:
\begin{enumerate}
\item {\label{quirk:VM_FAULT_NOPAGE}
\texttt{my\_shmem\_vmops\_fault} returns \texttt{VM\_FAULT\_NOPAGE} on success. This is due to the need to support multi-page contiguous allocation inside the kernel module for performance analysis purposes.
Usually, the \texttt{vm\_operations\_struct} API expects its \texttt{.fault} implementations to assign \texttt{struct page *} to \texttt{vmf->page} on return. Here, \texttt{vmf->page} represents the page-aligned allocation that is to be installed into the faulting process's page table, thereby resolving the page fault.
However, this expectation causes a conflict between the module's ability to allocate multi-page contiguous allocations and its ability to perform page-granularity mapping of underlying allocations (no matter the size of the allocation). Because \textit{GFP}-family of page allocators use \texttt{struct page} as the representation of the \emph{entire} allocation (no matter the number of pages actually allocated), it is incorrect to install the \texttt{struct page} representation of a multi-page contiguous allocation to any given page fault in case that the page fault offset is misaligned with the alignment of the allocation (an example of such case arising could be found at \ref{fig:misaligned-remap}).
\begin{figure}[h]
\centering
\includegraphics[scale=0.8]{graphics/tikz-misaligned-remap.pdf}
\caption{Misaligned Kernel Page Remap. Left column represents physical memory (addressed by PFN); center column represents in-module accounting of allocations; right column represents process address space.}
\label{fig:misaligned-remap}
\end{figure}
}
\item {\label{quirk:__my_shmem_fault_remap}
\texttt{\_\_my\_shmem\_fault\_remap} serves as inner logic for when outer page fault handling (allocation) logic deems that a sufficient number of pages exist for handling the current page fault. As its name suggests, it finds and remaps the correct allocation into the page fault's parent VMA (assuming that such allocation, of course, exists).
The logic of this function is similar to \hyperref[para:file_operations]{\texttt{my\_shmem\_fops\_mmap}}. For a complete listing, refer to \textcolor{red}{???}.
}
\end{enumerate}
Function \texttt{.close} emulates synchronization behavior whenever a VMA is removed from a process's address space (e.g., due to \texttt{munmap}). Given a removed VMA as argument, it computes the intersecting allocations and invokes \hyperref[code:dcache_clean_poc]{\texttt{dcache\_clean\_poc}} on each such allocations. While this results in conservative approximation of cleaned cache entries, it is nevertheless good for instrumentation purposes, as the amount of pages cleaned per invocation becomes invariable with respect to how the VMA was remapped -- a misaligned VMA will not result in less pages being flushed in a given allocation.
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
static void my_shmem_vmops_close(struct vm_area_struct *vma)
{
size_t vma_pg_count =
(vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
size_t vma_pg_off = vma->vm_pgoff;
/* Lock mutex... */
struct my_shmem_alloc *entry;
list_for_each_entry(entry, &my_shmem_allocs, list) {
const ulong entry_pg_count =
ORDER_TO_PAGE_NR(entry->alloc_order);
/* Loop till entry intersects with start of VMA */
if (vma_pg_off > entry_pg_count) {
vma_pg_off -= entry_pg_count;
continue;
}
/* All of VMA cleaned: exit */
if (!vma_pg_count)
break;
/* entry intersects with VMA -- emulate clean */
struct page *pg = entry->page;
ulong kvaddr_bgn = (ulong) page_address(pg);
ulong kvaddr_end =
kvaddr_bgn + entry_pg_count * PAGE_SIZE;
__dcache_clean_poc(kvaddr_bgn, kvaddr_end); // See $\ref{code:dcache_clean_poc}$
put_page(pg); // decrement refcount
/* Prepare for next iteration */
vma_pg_count -= min(
entry_pg_count - vma_pg_off,
vma_pg_count
);
if (vma_pg_off != 0) // ~ first intersection
vma_pg_off = 0;
}
/* cleanup... */
}
\end{minted}
\paragraph*{\textit{sysfs} Parameter} \label{para:sysfs-param}
Finally, \texttt{my\_shmem} exposes a tunable \textit{sysfs} parameter for adjusting the number of pages allocated per allocation in \texttt{my\_shmem\_vmops\_fault}. The parameter, \texttt{max\_contiguous\_alloc\_order}, defines the order $o$ for allocation from page allocator such that, for each allocation, $2^o$ contiguous pages are allocated at once.
To adjust the parameter (for example, set $o \leftarrow 2$), one may run as follows in a sh-compatible terminal:
\begin{minted}[bgcolor=code-bg]{sh}
$ echo 2 > \
/sys/module/my_shmem/parameters/max_contiguous_alloc_order
\end{minted}
Consequently, all allocations occuring after this change will be allocated with a 4-page contiguous granularity.
\subsection{Instrumentation: \texttt{ftrace} and \textit{eBPF}}
\subsection{Userspace Programs}