Compiles on laptop, weird

This commit is contained in:
Zhengyi Chen 2024-02-28 19:07:58 +00:00
parent a6d78ffc04
commit b3c3ec961b
2 changed files with 45 additions and 44 deletions

Binary file not shown.

View file

@ -303,8 +303,8 @@ a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
pointing to the same object, with different offsets and (intuitively) pointing to the same object, with different offsets and (intuitively)
independently-chosen virtual addresses. Note that this design naturally requires independently-chosen virtual addresses. Note that this design naturally requires
virtual addresses within each node to be \emph{pinned} -- the allocated object virtual addresses within each node to be \emph{pinned} -- the allocated object
cannot be re-addressed to a different virtual address, thus preventing the cannot be re-addressed to a different virtual address, thus preventing the
global pointer that records the local virtual address from becoming global pointer that records the local virtual address from becoming
spontaneously invalidated. spontaneously invalidated.
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
@ -316,54 +316,55 @@ movement manually when working with shared memory over network to maximize
their performance metrics of interest. their performance metrics of interest.
\subsection{Message Passing} \subsection{Message Passing}
\textit{Message Passing} remains the predominant programming model for \textit{Message Passing} remains the predominant programming model for
parallelism between loosely-coupled nodes within a computer system, much as it parallelism between loosely-coupled nodes within a computer system, much as it
is ubiquitous in supporting all levels of abstraction within any concurrent is ubiquitous in supporting all levels of abstraction within any concurrent
components of a computer system. Specific to cluster computing systems is the components of a computer system. Specific to cluster computing systems is the
message-passing programming model, where parallel programs (or instances of message-passing programming model, where parallel programs (or instances of
the same parallel program) on different nodes within the system communicate the same parallel program) on different nodes within the system communicate
via exchanging messages over network between these nodes. Such models exchange via exchanging messages over network between these nodes. Such models exchange
programming model productivity for more fine-grained control over the messages programming model productivity for more fine-grained control over the messages
passed, as well as more explicit separation between communication and computation passed, as well as more explicit separation between communication and computation
stages within a programming subproblem. stages within a programming subproblem.
Commonly, message-passing backends function as \textit{middlewares} -- Commonly, message-passing backends function as \textit{middlewares} --
communication runtimes -- to aid distributed software development communication runtimes -- to aid distributed software development
\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend
expose facilities for inter-application communication to frontend developers expose facilities for inter-application communication to frontend developers
while transparently providing security, accounting, and fault-tolerance, much while transparently providing security, accounting, and fault-tolerance, much
like how an operating system may provide resource management, scheduling, and like how an operating system may provide resource management, scheduling, and
security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}.
This is the case for implementing the PGAS programming model, which mostly rely This is the case for implementing the PGAS programming model, which mostly rely
on common message-passing backends to facilitate orchestrated data manipulation on common message-passing backends to facilitate orchestrated data manipulation
across distributed nodes. Likewise, message-passing backends, including RDMA API, across distributed nodes. Likewise, message-passing backends, including RDMA API,
form the backbone of many research-oriented DSM systems form the backbone of many research-oriented DSM systems
\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}. \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
Message-passing between network-connected nodes may be \textit{two-sided} or Message-passing between network-connected nodes may be \textit{two-sided} or
\textit{one-sided}. The former models an intuitive workflow to sending and receiving \textit{one-sided}. The former models an intuitive workflow to sending and receiving
datagrams over the network -- the sender initiates a transfer; the receiver datagrams over the network -- the sender initiates a transfer; the receiver
copies a received packet from the network card into a kernel buffer; the copies a received packet from the network card into a kernel buffer; the
receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021} receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021}
copies the internal message copies the internal message
into the message-passing runtime/middleware's address space; the receiver's into the message-passing runtime/middleware's address space; the receiver's
middleware inspects the copied message and performs some procedures accordingly, middleware inspects the copied message and performs some procedures accordingly,
likely also involving copying slices of message data to some registered distributed likely also involving copying slices of message data to some registered distributed
shared memory buffer for the distributed application to access. Despite it shared memory buffer for the distributed application to access. Despite it
being a highly intuitive model of data manipulation over the network, this being a highly intuitive model of data manipulation over the network, this
poses a fundamental performance issue: because the process requires the receiver's poses a fundamental performance issue: because the process requires the receiver's
kernel AND userspace to exert CPU-time, upon reception of each message, the kernel AND userspace to exert CPU-time, upon reception of each message, the
receiver node needs to proactively exert CPU-time to move the received data receiver node needs to proactively exert CPU-time to move the received data
from bytes read from NIC devices to userspace. Because this happens concurrently from bytes read from NIC devices to userspace. Because this happens concurrently
with other kernel and userspace routines in a multi-processing system, a with other kernel and userspace routines in a multi-processing system, a
preemptable kernel may incur significant latency if the kernel routine for preemptable kernel may incur significant latency if the kernel routine for
packet filtering is pre-empted by another kernel routine, userspace, or IRQs. packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA},
allows the network interface card to bypass in-kernel packet filters and allows the network interface card to bypass in-kernel packet filters and
perform DMA on registered memory regions. The NIC can hence notify the CPU via perform DMA on registered memory regions. The NIC can hence notify the CPU via
interrupts, thus allowing the kernel and the userspace programs to perform interrupts, thus allowing the kernel and the userspace programs to perform
callbacks at reception time with reduced latency. Because of this advantage, callbacks at reception time with reduced latency. Because of this advantage,
many recent studies attempt to leverage RDMA APIs \dots many recent studies attempt to leverage RDMA APIs \dots