This commit is contained in:
Zhengyi Chen 2024-03-25 00:00:48 +00:00
parent 0c2c3a045a
commit bd5bb2564a
3 changed files with 52 additions and 8 deletions

View file

@ -672,4 +672,21 @@
url={https://www.kernel.org/doc/html/v6.7/admin-guide/mm/transhuge.html}, url={https://www.kernel.org/doc/html/v6.7/admin-guide/mm/transhuge.html},
journal={The Linux Kernel documentation}, journal={The Linux Kernel documentation},
year={2023} year={2023}
} }
@inproceedings{Kim_Han_Baek.MARF.2023,
title={MARF: A Memory-Aware CLFLUSH-Based Intra-and Inter-CPU Side-Channel Attack},
author={Kim, Sowoong and Han, Myeonggyun and Baek, Woongki},
booktitle={European Symposium on Research in Computer Security},
pages={120--140},
year={2023},
organization={Springer}
}
@article{Fog.Instr-table-x86.2018,
title={Instruction tables},
author={Fog, Agner},
journal={Technical University of Denmark},
year={2018}
}

Binary file not shown.

View file

@ -1019,7 +1019,7 @@ Because we do not inline \texttt{\_\_dcache\_clean\_poc}, we are able to include
Finally, two simple userspace programs are written to invoke the corresponding kernelspace callback operations -- namely, allocation and cleaning of kernel buffers for simulating DMA behaviors. To achieve this, it simply \texttt{mmap}s the amount of pages passed in as argument and either reads or writes the entirety of the buffer (which differentiates the two programs). A listing of their logic is at \textcolor{red}{[TODO] Appendix ???}. Finally, two simple userspace programs are written to invoke the corresponding kernelspace callback operations -- namely, allocation and cleaning of kernel buffers for simulating DMA behaviors. To achieve this, it simply \texttt{mmap}s the amount of pages passed in as argument and either reads or writes the entirety of the buffer (which differentiates the two programs). A listing of their logic is at \textcolor{red}{[TODO] Appendix ???}.
\section{Results}\label{sec:sw-coherency-results} \section{Results}\label{sec:sw-coherency-results}
\subsection{Controlled Allocation Size; Variable Page Count} \subsection{Controlled Allocation Size; Variable Allocation Count}
Experiments are first conducted over software coherency operation latencies over variable \texttt{mmap}-ed memory area sizes while keeping the underlying allocation sizes to 4KiB (i.e., single-page allocation). All experiments are conducted on \texttt{star} on \texttt{mmap} memory areas ranged from 16KiB till 1GiB, in which we control the number of sampled coherency operations to 1000. Data gathering is performed using the \texttt{trace-cmd} front-end for \texttt{ftrace}. The results of the experiments conducted is listed in figure \ref{fig:coherency-op-per-page-alloc}. Experiments are first conducted over software coherency operation latencies over variable \texttt{mmap}-ed memory area sizes while keeping the underlying allocation sizes to 4KiB (i.e., single-page allocation). All experiments are conducted on \texttt{star} on \texttt{mmap} memory areas ranged from 16KiB till 1GiB, in which we control the number of sampled coherency operations to 1000. Data gathering is performed using the \texttt{trace-cmd} front-end for \texttt{ftrace}. The results of the experiments conducted is listed in figure \ref{fig:coherency-op-per-page-alloc}.
\begin{figure}[h] \begin{figure}[h]
@ -1052,14 +1052,14 @@ Additionally, we also obtain the latencies of TLB flushes due to userspace progr
\label{fig:coherency-op-tlb} \label{fig:coherency-op-tlb}
\end{figure} \end{figure}
\subsubsection*{Notes on Long-Tailed Distribution} \subsubsection*{Notes on Long-Tailed Distribution} \label{subsec:long-tailed}
We identify that a long-tailed distribution of latencies exist for both figures (\ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-multi-page-alloc}). For software coherency operations, we identify this to be partially due to \textit{softirq} preemption (notably, RCU maintenance), which take higher precendence compared to ``regular'' kernel routines. A brief description of \textit{processor contexts} defined in the Linux kernel is listed in \textcolor{red}{Appendix ???}. We identify that a long-tailed distribution of latencies exist for both figures (\ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-multi-page-alloc}). For software coherency operations, we identify this to be partially due to \textit{softirq} preemption (notably, RCU maintenance), which take higher precendence compared to ``regular'' kernel routines. A brief description of \textit{processor contexts} defined in the Linux kernel is listed in \textcolor{red}{Appendix ???}.
For TLB operations, we identify the cluster of long-runtime TLB flush operations (e.g., around $10^4$ {\mu}s) to be interference from \texttt{mm} cleanup on process exit. For TLB operations, we identify the cluster of long-runtime TLB flush operations (e.g., around $10^4$ {\mu}s) to be interference from \texttt{mm} cleanup on process exit.
Moreover, latencies to software coherency operations are highly system-specific. On \texttt{rose}, data gathered from similar experimentations have shown to be $1/10$-th of the latencies gathered from \texttt{star}, which (coincidentially) reduces the likelihood of long-tailed distributions forming due to RCU \textit{softirq} preemption. Moreover, latencies to software coherency operations are highly system-specific. On \texttt{rose}, data gathered from similar experimentations have shown to be $1/10$-th of the latencies gathered from \texttt{star}, which (coincidentially) reduces the likelihood of long-tailed distributions forming due to RCU \textit{softirq} preemption.
\subsection{Controlled Page Count; Variable Allocation Size} \subsection{Controlled Allocation Count; Variable Allocation Size} \label{sec:experiment-var-alloc-cnt}
We also conduct experiments over software coherency operations latencies over fixed \texttt{mmap}-ed memory area sizes while varying the underlying allocation sizes. This is achieved by varying the allocation order -- while 0-order allocation allocates $2^0 = 1$ page per allocation, a 8-order allocation allocates $2^8 = 256$ contiguous pages per allocation. All experiments are conducted on \texttt{star}. The results for all experiments are gathered using \texttt{bcc-tools}, which provide utilities for injecting \textit{BPF}-based tracing routines. The results of these experimentations are visualized in figure \ref{fig:coherency-op-multi-page-alloc}, with $N \ge 64$ per experiment. We also conduct experiments over software coherency operations latencies over fixed \texttt{mmap}-ed memory area sizes while varying the underlying allocation sizes. This is achieved by varying the allocation order -- while 0-order allocation allocates $2^0 = 1$ page per allocation, a 8-order allocation allocates $2^8 = 256$ contiguous pages per allocation. All experiments are conducted on \texttt{star}. The results for all experiments are gathered using \texttt{bcc-tools}, which provide utilities for injecting \textit{BPF}-based tracing routines. The results of these experimentations are visualized in figure \ref{fig:coherency-op-multi-page-alloc}, with $N \ge 64$ per experiment.
\begin{figure}[h] \begin{figure}[h]
@ -1154,7 +1154,7 @@ We identify \textit{transparent hugepage} support as one possible direction to i
Furthermore, further studies remains necessary to check whether the use of (transparent) hugepages significantly benefit a real implementation of an in-kernel DSM system. Current implementation for \texttt{alloc\_pages} does not allow for allocation of hugepages even when the allocation order is sufficiently large. Consequently, future studies need to examine alternative implementations that incorporate transparent hugepages into the DSM system. One candidate that could allow for hugepage allocation, for example, is to directly use \texttt{alloc\_pages\_mpol} instead of \texttt{alloc\_pages}, as is the case for the current implementation of \textit{shmem} in kernel. Furthermore, further studies remains necessary to check whether the use of (transparent) hugepages significantly benefit a real implementation of an in-kernel DSM system. Current implementation for \texttt{alloc\_pages} does not allow for allocation of hugepages even when the allocation order is sufficiently large. Consequently, future studies need to examine alternative implementations that incorporate transparent hugepages into the DSM system. One candidate that could allow for hugepage allocation, for example, is to directly use \texttt{alloc\_pages\_mpol} instead of \texttt{alloc\_pages}, as is the case for the current implementation of \textit{shmem} in kernel.
\subsection{Access Latency After \textit{PoC}} \subsection{Access Latency Post-\textit{PoC}}
This chapter solely explores latencies due to software cache coherency operations. In practice, it may be equally important to explore the latency incurred due to read/write accesses after \textit{PoC} is reached, which is almost always the case for any inter-operation between CPU and DMA engines. This chapter solely explores latencies due to software cache coherency operations. In practice, it may be equally important to explore the latency incurred due to read/write accesses after \textit{PoC} is reached, which is almost always the case for any inter-operation between CPU and DMA engines.
Recall from section \ref{subsec:armv8a-swcoherency} that ARMv8-A defines \textit{Point} of Coherency/Unification within its coherency domains. In practice, it often implies an actual, physical \emph{point} to which cached data is evicted to: Recall from section \ref{subsec:armv8a-swcoherency} that ARMv8-A defines \textit{Point} of Coherency/Unification within its coherency domains. In practice, it often implies an actual, physical \emph{point} to which cached data is evicted to:
@ -1170,6 +1170,21 @@ Recall from section \ref{subsec:armv8a-swcoherency} that ARMv8-A defines \textit
Further studies are necessary to examine the latency after coherency maintenance operations on ARMv8 architectures on various systems, including access from DMA engine vs. access from CPU, etc. Further studies are necessary to examine the latency after coherency maintenance operations on ARMv8 architectures on various systems, including access from DMA engine vs. access from CPU, etc.
\subsection{Reflection} \subsection{Reflection}
We identify the following weaknesses within our experiment setup that undermines the generalizability of our work.
\paragraph*{Where is \texttt{dcache\_inval\_poc}?} Due to time constraints, we were unable to explore the latencies posed by \texttt{dcache\_inval\_poc}, which will be called whenever the DMA driver attempts to prepare the CPU to access data modified by DMA engine. Further studies that expose \texttt{dcache\_inval\_poc} for similar instrumentation should be trivial, as the steps necessary should mirror the case for \texttt{dcache\_clean\_poc} listed above.
\paragraph*{Do Instrumented Statistics Reflect Reality?} It remains debateable whether the method portrayed in section \ref{sec:sw-coherency-method}, specifically via exporting \texttt{dcache\_clean\_poc} to driver namespace as a traceable target, is a good candidate for instrumenting the ``actual'' latencies incurred by software coherency operations.
For one, we specifically opt not to disable IRQ when running \texttt{\_\_dcahce\_clean\_poc}. This mirrors the implementation of \texttt{arch\_sync\_dma\_for\_cpu}, which (1) is called under process context and (2) does not disable IRQ downstream. Similar context is also observed for upstream function calls, for example \texttt{dma\_sync\_single\_for\_device}. As a consequence, kernel routines running inside IRQ/\textit{softirq} contexts are capable of preempting the cache coherency operations, hence preventing early returns. The effect of this on tail latencies have been discussed in section \ref{subsec:long-tailed}.
On the other hand, it might be argued that analyzing software coherency operation latency on a hardware level better reveals the ``real'' latency incurred by coherency maintenance operations during runtime. Indeed, latencies of \texttt{clflush}-family of instructions performed on x86 chipsets measured in units of clock cycles \cites{Kim_Han_Baek.MARF.2023}{Fog.Instr-table-x86.2018} amount to around 250 cycles -- significantly less than microsecond-grade function call latencies for any GHz-capable CPUs. We argue that because an in-kernel implementation of a DSM system would more likely call into the exposed driver API function calls as opposed to individual instructions -- i.e., not writing inline assemblies that ``reinvent the wheel'' -- instrumentation of relatively low-level and synchronous procedure calls is more crucial than instrumenting individual instructions.
\paragraph*{Lack of hardware diversity}
\paragraph*{Inconsistent latency magnitudes across experiments} We recognize \dots. We deduce this is due to one important variable across all experiments that we failed to control -- power supply to host machine.
\paragraph*{Lack of expanded work from \ref{sec:experiment-var-alloc-cnt}}
% Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table! % Bad visualization work on 2, arguably more instructive to DSM design. THis is because ex.2 is an afterthought to ex.1 and is conducted without sufficient time for proper data analysis -- ftrace takes time to analyze and visualize, notably. Maybe add a ftraced max-min etc. table!
% Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency) % Bad analysis on whether this really emulates anything. It may be of no significance right now (as we are solely concerned w/ software latency)
@ -1178,6 +1193,7 @@ Further studies are necessary to examine the latency after coherency maintenance
% Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not. % Note the difference in magnitudes in latency. This may be because of whether laptop is plugged or not. Admit your mistake and lament that you should really really really used a separate hardware with reliable energy source for these data. Note on the otherhand that the growth rate remains consistent whether plugged or not.
Primarily, time constraints limit the ability of the author to effectively resolve and mitigate the aforementioned issues.
\chapter{Conclusion} \chapter{Conclusion}
\section{Summary} \section{Summary}
@ -1188,25 +1204,36 @@ Further studies are necessary to examine the latency after coherency maintenance
% \bibliographystyle{plain} % \bibliographystyle{plain}
% \bibliographystyle{plainnat} % \bibliographystyle{plainnat}
% \bibliography{mybibfile} % \bibliography{mybibfile}
\printbibliography \printbibliography[heading=bibintoc]
% You may delete everything from \appendix up to \end{document} if you don't need it. % You may delete everything from \appendix up to \end{document} if you don't need it.
\appendix \appendix
\chapter{Terminologies} \chapter{Terminologies}
This chapter provides a listing of all terminologies used in this thesis that may be of interest or warrant a quick-reference entry during reading. This chapter provides a listing of all terminologies used in this thesis that may be of interest or warrant a quick-reference entry during reading.
% \begin{tabular*}{@{}c|c@{}}
% NUMA & {
% Short for \textit{Non-Uniform Memory Access}.
% A \textit{NUMA}-architecture machine describes a machine where theoretically processors access memory with different latencies. Consequently, processors have \textit{affinity} to memory -- performance is maximized when each processor accesses the ``closest'' memory with regards to the defined topology.
% } \\
% \end{tabular*}
\chapter{More on The Linux Kernel} \chapter{More on The Linux Kernel}
This chapter provides some extra background information on the Linux kernel that may have been mentioned or implied but bears insufficient significance to be explained in the \hyperref[chapter:background]{Background} chapter of this thesis. This chapter provides some extra background information on the Linux kernel that may have been mentioned or implied but bears insufficient significance to be explained in the \hyperref[chapter:background]{Background} chapter of this thesis.
\section{Processor Context} \section{Processor Context}
\section{\texttt{enum dma\_data\_direction}}
\section{Use case for \texttt{dcache\_clean\_poc}: \textit{smbdirect}}
\chapter{Cut \& Extra Work} \chapter{Cut \& Extra Work}
This chapter provides a brief summary of some work that was done during the writing of the thesis, but the author decided against inclusion of into the submitted work. It also explains some assumptions made with regards to the title of this thesis that the author find to have weakness on second thought. This chapter provides a brief summary of some work that was done during the writing of the thesis, but the author decided against inclusion of into the submitted work. It also explains some assumptions made with regards to the title of this thesis that the author find to have weakness on second thought.
\section{Replacement Policy} \section{Replacement Policy}
\section{Coherency Protocol} \section{Coherency Protocol}
\section{\texttt{enum dma\_data\_direction}}
\section{Use case for \texttt{dcache\_clean\_poc}: \textit{smbdirect}}
\section{Listing: Userspace} \section{Listing: Userspace}
\section{\textit{Why did you do \texttt{*}?}} \section{\textit{Why did you do \texttt{*}?}}