Draft done

This commit is contained in:
Zhengyi Chen 2024-03-26 21:15:31 +00:00
parent 37728272e8
commit dd3d3c82db
2 changed files with 139 additions and 45 deletions

Binary file not shown.

View file

@ -111,11 +111,11 @@ from the Informatics Research Ethics committee.
\begin{acknowledgements} \begin{acknowledgements}
\textcolor{red}{[TODO]:} I would like to acknowledge, first, the guidance and education from my supervisor, \textit{Amir Noohi}. Without him, this thesis could not come to fruition.
\textcolor{red}{For unbounded peace and happiness among all peoples of the world.} % Secondly, I would like to acknowledge my mother. It had been a long way, but hopefully this means something, at least.
\textcolor{red}{May we, one day, be able to see each other as equals.} Finally, cats, chicken shawarmas, \textit{Roberto Bolaño}, bus route 45, and Gaza.
\end{acknowledgements} \end{acknowledgements}
@ -380,7 +380,7 @@ void dma_sync_single_for_cpu(
struct device *dev, // kernel repr for DMA device struct device *dev, // kernel repr for DMA device
dma_addr_t addr, // DMA address dma_addr_t addr, // DMA address
size_t size, // Synchronization buffer size size_t size, // Synchronization buffer size
enum dma_data_direction dir, // Data-flow direction enum dma_data_direction dir, // Data-flow direction -- see $\ref{appendix:enum_dma_data_direction}$
) { ) {
/* Translate DMA address to physical address */ /* Translate DMA address to physical address */
phys_addr_t paddr = dma_to_phys(dev, addr); phys_addr_t paddr = dma_to_phys(dev, addr);
@ -439,29 +439,13 @@ extern void dcache_clean_poc(
); );
\end{minted} \end{minted}
\subsubsection{Addendum: \texttt{enum dma\_data\_direction}}
The Linux kernel defines 4 direction \texttt{enum} values for fine-tuning synchronization behaviors:
\begin{minted}[linenos, bgcolor=code-bg]{c}
/* In include/linux/dma-direction.h */
enum dma_data_direction {
DMA_BIDIRECTION = 0, // data transfer direction uncertain.
DMA_TO_DEVICE = 1, // data from main memory to device.
DMA_FROM_DEVICE = 2, // data from device to main memory.
DMA_NONE = 3, // invalid repr for runtime errors.
};
\end{minted}
These values allow for certain fast-paths to be taken at runtime. For example, \texttt{DMA\_TO\_DEVICE} implies that the device reads data from memory without modification, and hence precludes software coherence instructions from being run when synchronizing for CPU after DMA operation.
% TODO: Move to addendum section.
\subsubsection{Use-case: Kernel-space \textit{SMBDirect} Driver} \subsubsection{Use-case: Kernel-space \textit{SMBDirect} Driver}
An example of cache-coherent in-kernel RDMA networking module over heterogeneous ISAs could be found in the Linux implementation of \textit{SMBDirect}. \textit{SMBDirect} is an extension of the \textit{SMB} (\textit{Server Message Block}) protocol for opportunistically establishing the communication protocol over RDMA-capable network interfaces \cite{many.MSFTLearn-SMBDirect.2024}. An example of cache-coherent in-kernel RDMA networking module over heterogeneous ISAs could be found in the Linux implementation of \textit{SMBDirect}. \textit{SMBDirect} is an extension of the \textit{SMB} (\textit{Server Message Block}) protocol for opportunistically establishing the communication protocol over RDMA-capable network interfaces \cite{many.MSFTLearn-SMBDirect.2024}.
We focus on two procedures inside the in-kernel SMBDirect implementation: We focus on two procedures inside the in-kernel SMBDirect implementation:
\paragraph*{Before send: \texttt{smbd\_post\_send}} \paragraph*{Before send: \texttt{smbd\_post\_send}}
\texttt{smbd\_post\_send} is a function downstream of the call-chain of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_post\_send} is called to signal the RDMA NIC for transport. \texttt{smbd\_post\_send} is a function downstream of the call-chain of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_send} calls \texttt{smbd\_post\_send} to signal the RDMA NIC for transport.
The function body is roughly as follows: The function body is roughly as follows:
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c} \begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
@ -928,7 +912,7 @@ Several implementation quirks that warrant attention are as follows:
\item {\label{quirk:__my_shmem_fault_remap} \item {\label{quirk:__my_shmem_fault_remap}
\texttt{\_\_my\_shmem\_fault\_remap} serves as inner logic for when outer page fault handling (allocation) logic deems that a sufficient number of pages exist for handling the current page fault. As its name suggests, it finds and remaps the correct allocation into the page fault's parent VMA (assuming that such allocation, of course, exists). \texttt{\_\_my\_shmem\_fault\_remap} serves as inner logic for when outer page fault handling (allocation) logic deems that a sufficient number of pages exist for handling the current page fault. As its name suggests, it finds and remaps the correct allocation into the page fault's parent VMA (assuming that such allocation, of course, exists).
The logic of this function is similar to \hyperref[para:file_operations]{\texttt{my\_shmem\_fops\_mmap}}. For a code excerpt listing, refer to \textcolor{red}{[TODO] Appendix ???}. The logic of this function is similar to \hyperref[para:file_operations]{\texttt{my\_shmem\_fops\_mmap}}. For a code excerpt listing, refer to appendix \ref{appendix:__my_shmem_fault_remap}.
} }
\end{enumerate} \end{enumerate}
@ -1005,7 +989,7 @@ Because we do not inline \texttt{\_\_dcache\_clean\_poc}, we are able to include
\texttt{bcc-tools}, on the other hand, provide an array of handy instrumentation tools that is compiled just-in-time into \textit{BPF} programs and ran inside a in-kernel virtual machine. Description of how BPF programs are parsed and run inside the Linux kernel is documented in the kernel documentations \cite{N/A.Kernelv6.7-libbpf.2023}. The ability of \texttt{bcc}/\texttt{libbpf} programs to interface with both userspace and kernelspace function tracing mechanisms make \texttt{bcc-tools} ideal as a easy tracing interface for both userspace and kernelspace tracing. \texttt{bcc-tools}, on the other hand, provide an array of handy instrumentation tools that is compiled just-in-time into \textit{BPF} programs and ran inside a in-kernel virtual machine. Description of how BPF programs are parsed and run inside the Linux kernel is documented in the kernel documentations \cite{N/A.Kernelv6.7-libbpf.2023}. The ability of \texttt{bcc}/\texttt{libbpf} programs to interface with both userspace and kernelspace function tracing mechanisms make \texttt{bcc-tools} ideal as a easy tracing interface for both userspace and kernelspace tracing.
\subsection{Userspace Programs} \subsection{Userspace Programs}
Finally, two simple userspace programs are written to invoke the corresponding kernelspace callback operations -- namely, allocation and cleaning of kernel buffers for simulating DMA behaviors. To achieve this, it simply \texttt{mmap}s the amount of pages passed in as argument and either reads or writes the entirety of the buffer (which differentiates the two programs). A listing of their logic is at \textcolor{red}{[TODO] Appendix ???}. Finally, two simple userspace programs are written to invoke the corresponding kernelspace callback operations -- namely, allocation and cleaning of kernel buffers for simulating DMA behaviors. To achieve this, it simply \texttt{mmap}s the amount of pages passed in as argument and either reads or writes the entirety of the buffer (which differentiates the two programs). A listing of their logic is at Appendix \ref{appendix:userspace}.
\section{Results}\label{sec:sw-coherency-results} \section{Results}\label{sec:sw-coherency-results}
\subsection{Controlled Allocation Size; Variable Allocation Count} \subsection{Controlled Allocation Size; Variable Allocation Count}
@ -1214,9 +1198,9 @@ The main contribution of this thesis had swayed significantly since the beginnin
\paragraph*{Cache/Page Replacement Policies wrt. DSM Systems} Much like how this thesis proposed that \emph{2 coherence domains exist for a DSM system -- inter-node and intra-node}, the cache replacement problem also exhibits a (theoretical) duality: \paragraph*{Cache/Page Replacement Policies wrt. DSM Systems} Much like how this thesis proposed that \emph{2 coherence domains exist for a DSM system -- inter-node and intra-node}, the cache replacement problem also exhibits a (theoretical) duality:
\begin{itemize} \begin{itemize}
\item { \item {
\textbf{Intra-node} cache replacement problem -- i.e., \emph{page replacement problem} inside the running OS kernel -- is made complex by the existence of kernel-level DSM allowing for multitudes of replacements available parallel to the traditional page swap mechanism. \textbf{Intra-node} cache replacement problem -- i.e., \emph{page replacement problem} inside the running OS kernel -- is made complex by the existence of remote ramdisks as possible swap target:
Consider, for example, that \texttt{kswapd} scans some page for replacement. Instead of the traditional swapping mechanism which solely considers intra-node resources, we may instead establish swap files over RDMA-reachable resources such that, at placement time, we have the following options: Consider, for example, that \texttt{kswapd} scans some page for replacement. We may instead establish swap files over RDMA-reachable resources such that, at placement time, we have the following options:
\begin{enumerate} \begin{enumerate}
\item { \item {
intra-node \texttt{zram}\footnotemark[7] intra-node \texttt{zram}\footnotemark[7]
@ -1255,33 +1239,143 @@ The main contribution of this thesis had swayed significantly since the beginnin
% You may delete everything from \appendix up to \end{document} if you don't need it. % You may delete everything from \appendix up to \end{document} if you don't need it.
\appendix \appendix
\chapter{Terminologies} % \chapter{Terminologies}
This chapter provides a listing of all terminologies used in this thesis that may be of interest or warrant a quick-reference entry during reading. % This chapter provides a listing of all terminologies used in this thesis that may be of interest or warrant a quick-reference entry during reading.
% \begin{tabular*}{@{}c|c@{}}
% NUMA & {
% Short for \textit{Non-Uniform Memory Access}.
% A \textit{NUMA}-architecture machine describes a machine where theoretically processors access memory with different latencies. Consequently, processors have \textit{affinity} to memory -- performance is maximized when each processor accesses the ``closest'' memory with regards to the defined topology.
% } \\
% \end{tabular*}
\chapter{More on The Linux Kernel} \chapter{More on The Linux Kernel}
This chapter provides some extra background information on the Linux kernel that may have been mentioned or implied but bears insufficient significance to be explained in the \hyperref[chapter:background]{Background} chapter of this thesis. This chapter provides some extra background information on the Linux kernel that may have been mentioned or implied but bears insufficient significance to be explained in the \hyperref[chapter:background]{Background} chapter of this thesis.
\section{Processor Context} \section{Processor Context}
\section{\texttt{enum dma\_data\_direction}} The Linux kernel defines 3 contexts that the CPU could be running in at any time:
\section{Use case for \texttt{dcache\_clean\_poc}: \textit{smbdirect}} \begin{itemize}
\item Hardware Interrupt (IRQ)
\item Softirq / tasklet
\item Process context (userspace or kernelspace)
\end{itemize}
\chapter{Cut \& Extra Work} The ordering between the contexts are top-to-bottom: hardware interrupt code can preempt softirq or process context code, and softirq code can preempt process context code only.
This chapter provides a brief summary of some work that was done during the writing of the thesis, but the author decided against inclusion of into the submitted work. It also explains some assumptions made with regards to the title of this thesis that the author find to have weakness on second thought.
\section{Replacement Policy} \section{\texttt{enum dma\_data\_direction}}\label{appendix:enum_dma_data_direction}
\section{Coherency Protocol} The Linux kernel defines 4 direction \texttt{enum} values for fine-tuning synchronization behaviors:
\section{Listing: Userspace} \begin{minted}[linenos, bgcolor=code-bg]{c}
\section{\textit{Why did you do \texttt{*}?}} /* In include/linux/dma-direction.h */
enum dma_data_direction {
DMA_BIDIRECTION = 0, // data transfer direction uncertain.
DMA_TO_DEVICE = 1, // data from main memory to device.
DMA_FROM_DEVICE = 2, // data from device to main memory.
DMA_NONE = 3, // invalid repr for runtime errors.
};
\end{minted}
These values allow for certain fast-paths to be taken at runtime. For example, asserting \texttt{DMA\_TO\_DEVICE} implies that the device reads data from memory without modification, and hence precludes software coherence instructions from being run when synchronizing for CPU after DMA operation.
\chapter{Extra}
This chapter provides a brief summary of some work that was done during the writing of the thesis, but the author decided against formal inclusion of into the submitted work.
\section{Listing: \texttt{\_\_my\_shmem\_fault\_remap}}\label{appendix:__my_shmem_fault_remap}
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
static vm_fault_t __my_shmem_fault_remap(struct vm_fault *vmf) {
vm_fault_t ret = VM_FAULT_NOPAGE;
const ulong fault_addr = vmf->address;
ulong remap_addr = fault_addr;
const pgoff_t vma_pgoff = vmf->vma->pgoff;
pgoff_t vmf_pgoff = vma_pgoff + vmf->pgoff;
/* either remap all alloced or remap entire vma */
ulong remaining_remappable_pgs = min(
my_shmem_page_count - vmf_pgoff,
vma_pgoff + NR_PAGE_OF_VMA(vmf->vma) - vmf_pgoff
);
struct my_shmem_alloc *curr;
pgoff_t curr_pg_off = 0; // `curr` as page ID
pgoff_t next_pg_off; // next of `curr` as page ID
list_for_each_entry(curr, &my_shmem_allocs, list) {
next_pg_off =
curr_pg_off + ORDER_TO_PAGE_NR(curr->alloc_order);
if (next_pg_off > vmf_pgoff) { // curr remappable
get_page(curr->page);
/* Compute head offset */
pgoff_t off_from_alloc_head = vmf_pgoff - curr_pg_off;
/* Compute nr of pages from head to remap */
ulong remap_range_pgs = min(
next_pg_off - curr_pg_off - off_from_alloc_head,
remaining_remappable_pgs
);
ulong remap_range_bytes = remap_range_pgs * PAGE_SIZE;
ulong remap_pfn =
page_to_pfn(curr->page) + off_from_alloc_head;
/* Remap */
int remap_ret = remap_pfn_range(
vmf->vma,
remap_addr,
remap_pfn,
remap_range_bytes,
vmf->vma->vm_page_prot;
);
/* if (remap_ret) goto error... */
/* Prepare for next iteration */
vmf_pgoff = next_pg_off;
remappable_remaining_pgs -= remap_range_pgs;
remap_addr += remap_range_bytes;
if (remaining_remappable_pgs == 0)
/* goto ok... */
} else { // curr not in remap range
curr_pg_off = next_pg_off;
}
}
/* ... */
}
\end{minted}
\section{Listing: Userspace}\label{appendix:userspace}
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
int main(int argc, char *argv[]) {
/* Set write & alloc amount */
size_t page_count;
/* parse_argument(argc, argv, &page_count); */
const long PAGE_SIZE = sysconf(_SC_PAGESIZE);
const size_t WRITE_AMNT = PAGE_SIZE * page_count;
/* Open device file w/ RW perms */
FILE *fp = fopen(DEVICE_PATH, "r+");
/* if (!fp) error... */
int fd = fileno(fp);
/* if (fd == -1) error... */
/* mmap device */
void *buf = mmap(
NULL, // addr to map to
WRITE_AMNT, // size_t len
PROT_READ | PROT_WRITE, // int prot
MAP_SHARED, // int flags
fd, // int fildes
0 // off_t off
);
/* if (!buf) error... */
/* Write to mmap-ed device */
char *curr_buf = buf;
char to_write[4] = {0xca, 0xfe, 0xbe, 0xef};
while (curr_buf < (char *)buf + WRITE_AMNT) {
memcpy(curr_buf, to_write, 4);
curr_buf += 4;
}
/* Unmap device */
munmap(buf, WRITE_AMNT);
/* Close device */
fclose(fp);
exit(EXIT_SUCCESS);
}
\end{minted}
% Any appendices, including any required ethics information, should be included % Any appendices, including any required ethics information, should be included
% after the references. % after the references.