This commit is contained in:
Zhengyi Chen 2024-03-26 14:26:00 +00:00
parent bd5bb2564a
commit 83504633b6
5 changed files with 55 additions and 10 deletions

View file

@ -1,2 +0,0 @@
system('source /afs/inf.ed.ac.uk/user/s20/s2018374/Git/00-UOE/unnamed_ba_thesis/tex/draft/pyenv/bin/activate');
$pdflatex = 'lualatex --shell-escape';

View file

@ -690,3 +690,33 @@
year={2018}
}
@misc{ARM.SystemReady_SR.2024,
title={SystemReady SR},
url={https://www.arm.com/architecture/system-architectures/systemready-certification-program/sr},
journal={Arm},
author={Ltd., Arm},
year={2024}
}
@misc{ARM.SBSAv7.1.2022,
title={Arm Server Base System Architecture 7.1},
url={https://developer.arm.com/documentation/den0029/h},
journal={Documentation arm developer},
author={Ltd., Arm},
year={2022}
}
@misc{Rockchip.RK3588.2022,
title={RK3588},
url={https://www.rock-chips.com/a/en/products/RK35_Series/2022/0926/1660.html},
journal={Rockchip-瑞芯微电子股份有限公司},
author={Co., Ltd, Rockchip Electronics},
year={2022}
}
@misc{Raspi.Rpi5-datasheet.2023,
url={https://datasheets.raspberrypi.com/rpi5/raspberry-pi-5-product-brief.pdf},
journal={Raspberry pi 5},
publisher={Raspberry Pi Ltd.},
year={2023}
}

Binary file not shown.

View file

@ -1172,19 +1172,33 @@ Further studies are necessary to examine the latency after coherency maintenance
\subsection{Reflection}
We identify the following weaknesses within our experiment setup that undermines the generalizability of our work.
\paragraph*{Where is \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
\paragraph*{What About \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
\paragraph*{Do Instrumented Statistics Reflect Reality?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
\paragraph*{Do Instrumented Statistics Reflect Real Latency?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which (1) is called under process context and (2) does not disable IRQ downstream. Similar context is also observed for upstream function calls, for example \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which:
\begin{enumerate}
\item {
is (at least) called under process context.
}
\item {
does not disable IRQ downstream.
}
\end{enumerate}
Similar context is also observed for upstream function calls, for example \\ \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
% [XXX] that \\ has to be here, else texttt simply refuses to wrap
On the other hand, it might be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
On the other hand, it may be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
\paragraph*{Lack of hardware diversity}
\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance.
\paragraph*{Inconsistent latency magnitudes across experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems.
\paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations.
\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
% \paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
% Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table!
% Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency)
@ -1193,7 +1207,7 @@ On the other hand, it might be argued that analyzing software coherency operatio
% Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not.
Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
% Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
\chapter{Conclusion}
\section{Summary}