...

2024-03-26 14:26:00 +00:00 · 2024-03-26 14:26:00 +00:00 · 83504633b6
commit 83504633b6
parent bd5bb2564a
5 changed files with 55 additions and 10 deletions
--- a/tex/draft/.latexmkrc
+++ b/tex/draft/.latexmkrc
@ -1,2 +0,0 @@
-system('source /afs/inf.ed.ac.uk/user/s20/s2018374/Git/00-UOE/unnamed_ba_thesis/tex/draft/pyenv/bin/activate');
-$pdflatex = 'lualatex --shell-escape';
--- a/tex/draft/mybibfile.bib
+++ b/tex/draft/mybibfile.bib
@ -690,3 +690,33 @@
  year={2018}
 }

+@misc{ARM.SystemReady_SR.2024, 
+  title={SystemReady SR}, 
+  url={https://www.arm.com/architecture/system-architectures/systemready-certification-program/sr}, 
+  journal={Arm}, 
+  author={Ltd., Arm}, 
+  year={2024}
+} 
+
+@misc{ARM.SBSAv7.1.2022, 
+  title={Arm Server Base System Architecture 7.1}, 
+  url={https://developer.arm.com/documentation/den0029/h}, 
+  journal={Documentation – arm developer}, 
+  author={Ltd., Arm}, 
+  year={2022}
+} 
+
+@misc{Rockchip.RK3588.2022, 
+  title={RK3588}, 
+  url={https://www.rock-chips.com/a/en/products/RK35_Series/2022/0926/1660.html}, 
+  journal={Rockchip-瑞芯微电子股份有限公司}, 
+  author={Co., Ltd, Rockchip Electronics}, 
+  year={2022}
+}
+
+@misc{Raspi.Rpi5-datasheet.2023, 
+  url={https://datasheets.raspberrypi.com/rpi5/raspberry-pi-5-product-brief.pdf}, 
+  journal={Raspberry pi 5}, 
+  publisher={Raspberry Pi Ltd.}, 
+  year={2023}
+} 
--- a/tex/draft/skeleton.pdf
+++ b/tex/draft/skeleton.pdf
--- a/tex/draft/skeleton.tex
+++ b/tex/draft/skeleton.tex
@ -1172,19 +1172,33 @@ Further studies are necessary to examine the latency after coherency maintenance
 \subsection{Reflection}
 We identify the following weaknesses within our experiment setup that undermines the generalizability of our work.

-\paragraph*{Where is \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
+\paragraph*{What About \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.

-\paragraph*{Do Instrumented Statistics Reflect Reality?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
+\paragraph*{Do Instrumented Statistics Reflect Real Latency?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.

-For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which (1) is called under process context and (2) does not disable IRQ downstream. Similar context is also observed for upstream function calls, for example \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
+For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which: 
+\begin{enumerate}
+    \item {
+        is (at least) called under process context.
+    }
+    \item {
+        does not disable IRQ downstream.
+    }
+\end{enumerate}
+Similar context is also observed for upstream function calls, for example \\ \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
+% [XXX] that \\ has to be here, else texttt simply refuses to wrap

-On the other hand, it might be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
+On the other hand, it may be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.

-\paragraph*{Lack of hardware diversity}
+\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance. 

-\paragraph*{Inconsistent latency magnitudes across experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
+On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems.

-\paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
+Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations. 
+
+\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
+
+% \paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
 % Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table!

 % Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency)
@ -1193,7 +1207,7 @@ On the other hand, it might be argued that analyzing software coherency operatio

 % Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not.

-Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
+% Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.

 \chapter{Conclusion}
 \section{Summary}