diff --git a/tex/.gitignore b/tex/.gitignore index 407b60e..7dbcb66 100644 --- a/tex/.gitignore +++ b/tex/.gitignore @@ -303,3 +303,6 @@ TSWLatexianTemp* # minted **/_minted-skeleton **/pyenv + +# Actually I don't want to pass .latexmkrc around envs +**/.latexmkrc \ No newline at end of file diff --git a/tex/draft/.latexmkrc b/tex/draft/.latexmkrc deleted file mode 100644 index 9920651..0000000 --- a/tex/draft/.latexmkrc +++ /dev/null @@ -1,2 +0,0 @@ -system('source /afs/inf.ed.ac.uk/user/s20/s2018374/Git/00-UOE/unnamed_ba_thesis/tex/draft/pyenv/bin/activate'); -$pdflatex = 'lualatex --shell-escape'; \ No newline at end of file diff --git a/tex/draft/mybibfile.bib b/tex/draft/mybibfile.bib index 79c61d8..bcce99b 100644 --- a/tex/draft/mybibfile.bib +++ b/tex/draft/mybibfile.bib @@ -690,3 +690,33 @@ year={2018} } +@misc{ARM.SystemReady_SR.2024, + title={SystemReady SR}, + url={https://www.arm.com/architecture/system-architectures/systemready-certification-program/sr}, + journal={Arm}, + author={Ltd., Arm}, + year={2024} +} + +@misc{ARM.SBSAv7.1.2022, + title={Arm Server Base System Architecture 7.1}, + url={https://developer.arm.com/documentation/den0029/h}, + journal={Documentation – arm developer}, + author={Ltd., Arm}, + year={2022} +} + +@misc{Rockchip.RK3588.2022, + title={RK3588}, + url={https://www.rock-chips.com/a/en/products/RK35_Series/2022/0926/1660.html}, + journal={Rockchip-瑞芯微电子股份有限公司}, + author={Co., Ltd, Rockchip Electronics}, + year={2022} +} + +@misc{Raspi.Rpi5-datasheet.2023, + url={https://datasheets.raspberrypi.com/rpi5/raspberry-pi-5-product-brief.pdf}, + journal={Raspberry pi 5}, + publisher={Raspberry Pi Ltd.}, + year={2023} +} diff --git a/tex/draft/skeleton.pdf b/tex/draft/skeleton.pdf index d4e0a5f..15fd636 100644 Binary files a/tex/draft/skeleton.pdf and b/tex/draft/skeleton.pdf differ diff --git a/tex/draft/skeleton.tex b/tex/draft/skeleton.tex index 9a394c5..26b8852 100644 --- a/tex/draft/skeleton.tex +++ b/tex/draft/skeleton.tex @@ -1172,19 +1172,33 @@ Further studies are necessary to examine the latency after coherency maintenance \subsection{Reflection} We identify the following weaknesses within our experiment setup that undermines the generalizability of our work. -\paragraph*{Where is \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above. +\paragraph*{What About \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above. -\paragraph*{Do Instrumented Statistics Reflect Reality?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations. +\paragraph*{Do Instrumented Statistics Reflect Real Latency?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations. -For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which (1) is called under process context and (2) does not disable IRQ downstream. Similar context is also observed for upstream function calls, for example \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}. +For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which: +\begin{enumerate} + \item { + is (at least) called under process context. + } + \item { + does not disable IRQ downstream. + } +\end{enumerate} +Similar context is also observed for upstream function calls, for example \\ \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}. +% [XXX] that \\ has to be here, else texttt simply refuses to wrap -On the other hand, it might be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions. +On the other hand, it may be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions. -\paragraph*{Lack of hardware diversity} +\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance. -\paragraph*{Inconsistent latency magnitudes across experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine. +On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems. -\paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}} +Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations. + +\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine. + +% \paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}} % Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table! % Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency) @@ -1193,7 +1207,7 @@ On the other hand, it might be argued that analyzing software coherency operatio % Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not. -Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues. +% Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues. \chapter{Conclusion} \section{Summary}