...
This commit is contained in:
parent
bd5bb2564a
commit
83504633b6
5 changed files with 55 additions and 10 deletions
|
|
@ -1,2 +0,0 @@
|
|||
system('source /afs/inf.ed.ac.uk/user/s20/s2018374/Git/00-UOE/unnamed_ba_thesis/tex/draft/pyenv/bin/activate');
|
||||
$pdflatex = 'lualatex --shell-escape';
|
||||
|
|
@ -690,3 +690,33 @@
|
|||
year={2018}
|
||||
}
|
||||
|
||||
@misc{ARM.SystemReady_SR.2024,
|
||||
title={SystemReady SR},
|
||||
url={https://www.arm.com/architecture/system-architectures/systemready-certification-program/sr},
|
||||
journal={Arm},
|
||||
author={Ltd., Arm},
|
||||
year={2024}
|
||||
}
|
||||
|
||||
@misc{ARM.SBSAv7.1.2022,
|
||||
title={Arm Server Base System Architecture 7.1},
|
||||
url={https://developer.arm.com/documentation/den0029/h},
|
||||
journal={Documentation – arm developer},
|
||||
author={Ltd., Arm},
|
||||
year={2022}
|
||||
}
|
||||
|
||||
@misc{Rockchip.RK3588.2022,
|
||||
title={RK3588},
|
||||
url={https://www.rock-chips.com/a/en/products/RK35_Series/2022/0926/1660.html},
|
||||
journal={Rockchip-瑞芯微电子股份有限公司},
|
||||
author={Co., Ltd, Rockchip Electronics},
|
||||
year={2022}
|
||||
}
|
||||
|
||||
@misc{Raspi.Rpi5-datasheet.2023,
|
||||
url={https://datasheets.raspberrypi.com/rpi5/raspberry-pi-5-product-brief.pdf},
|
||||
journal={Raspberry pi 5},
|
||||
publisher={Raspberry Pi Ltd.},
|
||||
year={2023}
|
||||
}
|
||||
|
|
|
|||
Binary file not shown.
|
|
@ -1172,19 +1172,33 @@ Further studies are necessary to examine the latency after coherency maintenance
|
|||
\subsection{Reflection}
|
||||
We identify the following weaknesses within our experiment setup that undermines the generalizability of our work.
|
||||
|
||||
\paragraph*{Where is \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
|
||||
\paragraph*{What About \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
|
||||
|
||||
\paragraph*{Do Instrumented Statistics Reflect Reality?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
|
||||
\paragraph*{Do Instrumented Statistics Reflect Real Latency?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
|
||||
|
||||
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which (1) is called under process context and (2) does not disable IRQ downstream. Similar context is also observed for upstream function calls, for example \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
|
||||
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which:
|
||||
\begin{enumerate}
|
||||
\item {
|
||||
is (at least) called under process context.
|
||||
}
|
||||
\item {
|
||||
does not disable IRQ downstream.
|
||||
}
|
||||
\end{enumerate}
|
||||
Similar context is also observed for upstream function calls, for example \\ \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
|
||||
% [XXX] that \\ has to be here, else texttt simply refuses to wrap
|
||||
|
||||
On the other hand, it might be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
|
||||
On the other hand, it may be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
|
||||
|
||||
\paragraph*{Lack of hardware diversity}
|
||||
\paragraph*{Lack of Hardware Diversity} The majority of data gathered throughout the experiments come from a single, virtualized setup which may not be reflective of real latencies incurred by software coherency maintenance operations. While similar experiments have been conducted in bare-metal systems such as \texttt{rose}, we note that \texttt{rose}'s \textit{Ampere Altra} is certified \textit{SystemReady SR} by ARM \cite{ARM.SystemReady_SR.2024} and hence supports hardware-coherent DMA access (by virture of \textit{ARM Server Base System Architecture} which stipulates hardware-coherent memory access as implemented via MMU) \cite{ARM.SBSAv7.1.2022}, and hence may not be reflective of any real latencies incurred via coherency maintenance.
|
||||
|
||||
\paragraph*{Inconsistent latency magnitudes across experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
|
||||
On the other hand, we note that a growing amount of non-hardware-coherent ARM systems with DMA-capable interface (e.g., PCIe) are quickly becoming mainstream. Newer generation of embedded SoCs are starting to feature PCIe interface as part of their I/O provisions, for example \textit{Rockchip}'s \textit{RK3588} \cite{Rockchip.RK3588.2022} and \textit{Broadcom}'s \textit{BCM2712} \cite{Raspi.Rpi5-datasheet.2023}, both of which were selected for use in embedded and single-board systems, though (at the time of writing) with incomplete kernel support. Moreover, desktop-grade ARM CPUs and SoCs are also becoming increasingly common, spearheaded by \textit{Apple}'s \textit{M}-series processors as well as \textit{Qualcomm}'s equivalent products, all of which, to the author's knowledge, \textbf{do not} implement hardware coherence with their PCIe peripherals. Consequently, it is of interest to evaluate the performance of software-initiated cache coherency operations commonly applied in CPU-DMA interoperations on such non-\textit{SystemReady SR} systems.
|
||||
|
||||
\paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
|
||||
Orthogonally, even though the \textit{virt} emulated platform does not explicitly support hardware-based cache coherency operations, the underlying implementation of its emulation on x86 hosts is not explored in this study. Because (as established) the x86 ISA implements hardware-level guarantee of DMA cache coherence, if no other constraints exist, it may be possible for a ``loose'' emulation of the ARMv8-A ISA to define \textit{PoC} and \textit{PoU} operations as no-ops instead, though this theory cannot be ascertained without any cross-correlation with \textit{virt}'s source code. Figure \ref{fig:coherency-op-multi-page-alloc} also disputes this theory, as a mapping from ARMv8-A \textit{PoC} instructions to x86 no-op instructions would likely not cause differing latency magnitude over variable-sized contiguous allocations.
|
||||
|
||||
\paragraph*{Inconsistent Latency Magnitudes Across Experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
|
||||
|
||||
% \paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
|
||||
% Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table!
|
||||
|
||||
% Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency)
|
||||
|
|
@ -1193,7 +1207,7 @@ On the other hand, it might be argued that analyzing software coherency operatio
|
|||
|
||||
% Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not.
|
||||
|
||||
Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
|
||||
% Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
|
||||
|
||||
\chapter{Conclusion}
|
||||
\section{Summary}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue