This commit is contained in:
Zhengyi Chen 2024-03-17 23:50:44 +00:00
parent 6bed890643
commit 6111e79686
4 changed files with 918 additions and 13 deletions

Binary file not shown.

View file

@ -13,10 +13,10 @@
% \usepackage{natbib} % recommended for citations % but I have no experience with natbib...
\usepackage[utf8]{inputenc}
\usepackage[dvipsnames]{xcolor}
\usepackage{hyperref}
\usepackage[justification=centering]{caption}
\usepackage{graphicx}
\usepackage[english]{babel}
\usepackage{float}
% -> biblatex
\usepackage{biblatex}
\addbibresource{mybibfile.bib}
@ -36,6 +36,9 @@
% -> draw textbook-style frames
\usepackage{mdframed}
% <- frames
% -> href (LOAD LAST)
\usepackage{hyperref}
% <- href
\begin{document}
\begin{preliminary}
@ -102,7 +105,9 @@ from the Informatics Research Ethics committee.
\begin{acknowledgements}
Jordanian River to the Mediterranean Sea, maybe\dots
\textcolor{red}{For unbounded peace and happiness among all peoples of the world.}
\textcolor{red}{May we, one day, be able to see each other as equals.}
\end{acknowledgements}
@ -425,7 +430,7 @@ extern void dcache_inval_poc(
unsigned long start, unsigned long end
);
/* Clean data cache region [start, end) to PoC. $\ref{code:dcache_clean_poc}$
/* Clean data cache region [start, end) to PoC. $\label{code:dcache_clean_poc}$
*
* Write-back CPU cache entries that intersect with [start, end),
* such that data from CPU becomes visible to external writers.
@ -554,14 +559,16 @@ The rest of the chapter is structured as follows:
\end{itemize}
\section{Experiment Setup}\label{sec:sw-coherency-setup}
\subsection{QEMU-over-x86: \texttt{star}}
The primary source of experimental data come from a virtualized machine: a virtualized guest running a lightly-customized Linux v6.7.0 preemptive kernel with standard non-graphical Debian 12 distribution installed to provide userspace support. The specifics of this QEMU-emulated ARM64 test-bench, running atop of an x86-64 host PC, is at \ref{table:2}.
\subsection{QEMU-over-x86: \texttt{star}}\label{subsec:spec-star}
The primary source of experimental data come from a virtualized machine: a virtualized guest running a lightly-modified Linux v6.7.0 preemptive kernel with standard non-graphical Debian 12 distribution installed to provide userspace support. Table \ref{table:star} describes the specifics of the QEMU-emulated ARM64 test-bench, while table \ref{table:starhost} describes the specifics of its host.
\begin{table}[h]
\centering
\begin{tabular}{|c|c|}
\hline
Processors & 3x QEMU virt-8.2 (2-way SMT; emulates Cortex-A76) \\
Processors & QEMU virt-8.2 (3 $\times$ 2-way SMT; emulates Cortex-A76) \\
\hline
Frequency & 2.0 GHz (\textit{sic.}\footnotemark[3]) \\
\hline
CPU Flags &
\begin{tabular}{@{}cccccc@{}}
@ -571,31 +578,113 @@ The primary source of experimental data come from a virtualized machine: a virtu
asimdrdm & lrcpc & dcpop & asimddp & & \\
\end{tabular} \\
\hline
NUMA Nodes & 1: $\{P_0, \dots, P_5\}$ \\
NUMA Topology & 1: $\{P_0,\ \dots,\ P_5\}$ \\
\hline
Memory & 4GiB \\
Memory & 1: 4GiB \\
\hline
Kernel & Linux 6.7.0 (modified) SMP Preemptive \\
\hline
Distribution & Debian 12 (bookworm) \\
\hline
\end{tabular}
\caption{Specification of \texttt{star}}
\label{table:2}
\label{table:star}
\end{table}
\footnotetext[3]{As reported from \texttt{lscpu}.}
\begin{table}[h]
\centering
\begin{tabular}{|c|c|}
\hline
Processors & AMD Ryzen 7 4800HS (8-core, 2-way SMT) \\
\hline
Freuqnecy & 2.9 GHz (4.2 GHz Turbo) \\
\hline
NUMA Topology & 1: $\{P_0,\ \dots,\ P_{15}\}$ \\
\hline
Cache Structure &
\begin{tabular}{@{}c|c@{}}
L3 & $P_0 \dots P_7$: 4MiB, $P_8 \dots P_{15}$: 4MiB \\
L2 & Per core\footnotemark[4]: 512KiB \\
L1 & Per core: d-cache 32KiB, i-cache 32KiB \\
\end{tabular} \\
\hline
Memory & 1: 40 GiB DDR4-3200 SO-DIMM \\
\hline
Filesystem & ext4 on Samsung SSD 970 EVO Plus \\
\hline
Kernel & Linux 6.7.9 (arch1-1) SMP Preemptive \\
\hline
Distribution & Arch Linux \\
\hline
\end{tabular}
\caption{Specification of Host}
\label{table:3}
\label{table:starhost}
\end{table}
\subsection{\textit{Neoverse N1}: \texttt{rose}}
% - QEMU-over-x86; preemptive-on-preemptive
% - Native server-ready ARM64 (preemptive), which I didn't run for long ngl
\footnotetext[4]{i.e., per 2 threads. For example: $P_0$, $P_1$ comprises one core.}
\subsection{\textit{Ampere Altra}: \texttt{rose}}\label{subsec:spec-rose}
\begin{table}[H] % suboptimal, but otherwise gets placed in next sec...
\centering
\begin{tabular}{|c|c|}
\hline
Processors & Ampere Altra (32 core; Neoverse N1 microarch.) \\
\hline
Frequency & 1.7 GHz (3.0 GHz max) \\
\hline
NUMA Topology & 1: $\{P_0,\ \dots,\ P_{31}\}$ \\
\hline
Cache Structure &
\begin{tabular}{@{}c|c@{}}
L2 & Per core: 1MiB \\
L1 & Per core: d-cache 64KiB, i-cache 64KiB \\
\end{tabular} \\
\hline
Memory & 1: 256 GiB DDR4-3200 DIMM ECC \\
\hline
Kernel & Linux 6.7.0 (modified) SMP Preemptive \\
\hline
Distribution & Ubuntu 22.04 LTS (Jammy Jellyfish) \\
\hline
\end{tabular}
\caption{Specification of \texttt{rose}}
\label{table:rose}
\end{table}
Additional to virtualized testbench, I have had the honor to access \texttt{rose}, a ARMv8 server rack system hosted by the \textcolor{red}{Systems Group} at the \textit{Informatics Forum}, through the invaluable assistance of my primary advisor, \textit{Amir Noohi}, for instrumentation of similar experimental setups on server-grade bare-metal systems.
The specifications of \texttt{rose} is listed in table \ref{table:rose}.
\section{Methodology}\label{sec:sw-coherency-method}
\subsection{Exporting \texttt{dcache\_clean\_poc}}
As established in subsection \ref{subsec:armv8a-swcoherency}, software cache-coherence maintenance operations (e.g., \texttt{dcache\_[clean|inval]\_poc}) are wrapped behind DMA API function calls and are hence unavailable for direct use in drivers. Moreover, instrumentation of assembly code becomes non-trivial when compared to instrumenting C function symbols, likely due to automatically stripped assembly symbols during kernel linkage. Consequently, it becomes impossible to utilize the existing instrumentation tools available in the Linux kernel (e.g., \texttt{ftrace}) to trace assembly routines.
In order to convert \texttt{dcache\_clean\_poc} to a traceable equivalent, a wrapper function \texttt{\_\_dcache\_clean\_poc} is created as follows:
\begin{minted}[mathescape, linenos, bgcolor=code-bg]{c}
/* In arch/arm64/mm/flush.c */
#include <asm/cacheflush_extra.h>
/* ... */
void __dcache_clean_poc(ulong start, ulong end)
{
dcache_clean_poc(start, end); // see $\ref{code:dcache_clean_poc}$
}
EXPORT_SYMBOL(__dcache_clean_poc);
\end{minted}
Correspondingly, the header \texttt{arch/arm64/include/asm/cacheflush\_extra.h} is created to export the symbol \texttt{\_\_dcache\_clean\_poc} into kernel module namespace. This has the additional benefit of creating a corresponding \texttt{ftrace} target, allowing the symbol to be instrumented using existing Linux instrumentation mechanisms. The entirety of modifications done to the in-tree v6.7.0 kernel culminates to a 44-line patch file (inclusive of metadata, context, etc.). It is expected that the introduction of additional symbols would increment the function latency by (at least) the amount of time necessary to fetch the instruction, but such latency is expected to be miniscule when compared to cache coherency operations.
\subsection{Kernel Module: \texttt{my\_shmem}}
To simulate module-initiated cache coherence behavior over allocated kernel buffers, a kernel module, \texttt{my\_shmem}, is written such that specially-written userspace programs could cause the kernel to invoke \texttt{\_\_dcache\_clean\_poc} at will.
\subsubsection{\texttt{my\_shmem}: Design}
The \texttt{my\_shmem} module is a utility for (lazily) allocating one or more kernel-space pages, re-mapping them into the userspace for reading/writing operations, and invoking cache-coherency operations \emph{as if} accessed via DMA on unmap.
\subsubsection{\texttt{my\_shmem}: Implementation}
\subsection{Instrumentation: \texttt{ftrace} and \textit{eBPF}}
\subsection{Userspace Programs}