diff --git a/tex/misc/background_draft.pdf b/tex/misc/background_draft.pdf index 0fc1043..8e2b276 100644 Binary files a/tex/misc/background_draft.pdf and b/tex/misc/background_draft.pdf differ diff --git a/tex/misc/background_draft.tex b/tex/misc/background_draft.tex index 32536da..dbb4720 100644 --- a/tex/misc/background_draft.tex +++ b/tex/misc/background_draft.tex @@ -303,8 +303,8 @@ a $k$-partitioned object, $k$ global pointers are recorded in the runtime each pointing to the same object, with different offsets and (intuitively) independently-chosen virtual addresses. Note that this design naturally requires virtual addresses within each node to be \emph{pinned} -- the allocated object -cannot be re-addressed to a different virtual address, thus preventing the -global pointer that records the local virtual address from becoming +cannot be re-addressed to a different virtual address, thus preventing the +global pointer that records the local virtual address from becoming spontaneously invalidated. Similar schemes can be observed in other PGAS backends/runtimes, albeit they may @@ -316,54 +316,55 @@ movement manually when working with shared memory over network to maximize their performance metrics of interest. \subsection{Message Passing} -\textit{Message Passing} remains the predominant programming model for -parallelism between loosely-coupled nodes within a computer system, much as it -is ubiquitous in supporting all levels of abstraction within any concurrent -components of a computer system. Specific to cluster computing systems is the -message-passing programming model, where parallel programs (or instances of -the same parallel program) on different nodes within the system communicate -via exchanging messages over network between these nodes. Such models exchange -programming model productivity for more fine-grained control over the messages -passed, as well as more explicit separation between communication and computation -stages within a programming subproblem. +\textit{Message Passing} remains the predominant programming model for +parallelism between loosely-coupled nodes within a computer system, much as it +is ubiquitous in supporting all levels of abstraction within any concurrent +components of a computer system. Specific to cluster computing systems is the +message-passing programming model, where parallel programs (or instances of +the same parallel program) on different nodes within the system communicate +via exchanging messages over network between these nodes. Such models exchange +programming model productivity for more fine-grained control over the messages +passed, as well as more explicit separation between communication and computation +stages within a programming subproblem. -Commonly, message-passing backends function as \textit{middlewares} -- +Commonly, message-passing backends function as \textit{middlewares} -- communication runtimes -- to aid distributed software development -\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend -expose facilities for inter-application communication to frontend developers -while transparently providing security, accounting, and fault-tolerance, much -like how an operating system may provide resource management, scheduling, and -security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. -This is the case for implementing the PGAS programming model, which mostly rely -on common message-passing backends to facilitate orchestrated data manipulation -across distributed nodes. Likewise, message-passing backends, including RDMA API, -form the backbone of many research-oriented DSM systems -\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}. +\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend +expose facilities for inter-application communication to frontend developers +while transparently providing security, accounting, and fault-tolerance, much +like how an operating system may provide resource management, scheduling, and +security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. +This is the case for implementing the PGAS programming model, which mostly rely +on common message-passing backends to facilitate orchestrated data manipulation +across distributed nodes. Likewise, message-passing backends, including RDMA API, +form the backbone of many research-oriented DSM systems +\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019} +{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}. -Message-passing between network-connected nodes may be \textit{two-sided} or -\textit{one-sided}. The former models an intuitive workflow to sending and receiving -datagrams over the network -- the sender initiates a transfer; the receiver -copies a received packet from the network card into a kernel buffer; the +Message-passing between network-connected nodes may be \textit{two-sided} or +\textit{one-sided}. The former models an intuitive workflow to sending and receiving +datagrams over the network -- the sender initiates a transfer; the receiver +copies a received packet from the network card into a kernel buffer; the receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021} -copies the internal message -into the message-passing runtime/middleware's address space; the receiver's -middleware inspects the copied message and performs some procedures accordingly, -likely also involving copying slices of message data to some registered distributed -shared memory buffer for the distributed application to access. Despite it -being a highly intuitive model of data manipulation over the network, this -poses a fundamental performance issue: because the process requires the receiver's -kernel AND userspace to exert CPU-time, upon reception of each message, the -receiver node needs to proactively exert CPU-time to move the received data -from bytes read from NIC devices to userspace. Because this happens concurrently -with other kernel and userspace routines in a multi-processing system, a -preemptable kernel may incur significant latency if the kernel routine for +copies the internal message +into the message-passing runtime/middleware's address space; the receiver's +middleware inspects the copied message and performs some procedures accordingly, +likely also involving copying slices of message data to some registered distributed +shared memory buffer for the distributed application to access. Despite it +being a highly intuitive model of data manipulation over the network, this +poses a fundamental performance issue: because the process requires the receiver's +kernel AND userspace to exert CPU-time, upon reception of each message, the +receiver node needs to proactively exert CPU-time to move the received data +from bytes read from NIC devices to userspace. Because this happens concurrently +with other kernel and userspace routines in a multi-processing system, a +preemptable kernel may incur significant latency if the kernel routine for packet filtering is pre-empted by another kernel routine, userspace, or IRQs. -Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, -allows the network interface card to bypass in-kernel packet filters and -perform DMA on registered memory regions. The NIC can hence notify the CPU via -interrupts, thus allowing the kernel and the userspace programs to perform -callbacks at reception time with reduced latency. Because of this advantage, +Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, +allows the network interface card to bypass in-kernel packet filters and +perform DMA on registered memory regions. The NIC can hence notify the CPU via +interrupts, thus allowing the kernel and the userspace programs to perform +callbacks at reception time with reduced latency. Because of this advantage, many recent studies attempt to leverage RDMA APIs \dots