Added idea for discussion

This commit is contained in:
Zhengyi Chen 2024-03-20 15:00:34 +00:00
parent eef0ac7635
commit 71072102f1
2 changed files with 20 additions and 4 deletions

View file

@ -42,10 +42,9 @@
% -> subfigures
\usepackage{subcaption}
% <- subfigures
% -> inconsolata texttt
\usepackage{inconsolata}
% -> font fix
\usepackage[T1]{fontenc}
% <- inconsolata
% <- font fix
\begin{document}
\begin{preliminary}
@ -1050,15 +1049,32 @@ Finally, two simple userspace programs are written to invoke the corresponding k
\end{figure}
\subsection{Controlled Page Count; Variable Allocation Size}
\textcolor{red}{[TODO] Didn't make the graphs yet\dots Run some bcc-tools capture and draw a gnuplot.}
\textcolor{red}{[TODO] See \ref{fig:coherency-op-multi-page-alloc}}
\begin{figure}[h]
\centering
\includegraphics[width=.8\textwidth]{graphics/var_alloc_size.pdf}
\caption{Average coherency op latency of variable-order contiguous allocation}
\label{fig:coherency-op-multi-page-alloc}
\end{figure}
\section{Discussion}\label{sec:sw-coherency-discuss}
\textcolor{red}{[TODO] Idk, something about as follows:}
\begin{itemize}
\item {
Obviously, coherency maintenance operation latency is unrelated with the number of pages allocated (which may be interpreted as how frequent the operation is performed), but correlated with how large the single allocation to be maintained is.
}
\item {
That said, runtime does not grow linearly with allocation size. However, on the other hand total runtime for large allocations should be smaller, as latency from more allocation operations generally overwhelm coherency operation latencies (which quantitatively becomes less prevalent).
}
\item {
The results are implementation-specific, as running similar experiments in bare-metal, server-ready implementations reduce per-page latency by around 10x. Did not have chance to test variable-order allocation latency.
}
\item {
In general, bigger allocation is better. Linux have hugetlbfs and transparent hugepage support but not sure how to utilize them into RDMA mechanism (also not investigated whether they are already used in RDMA driver, regardless how sparingly). This takes a deeper dive into RDMA code which I have not had the time for, simply speaking.
}
\end{itemize}
% - you should also measure the access latency after coherency operation, though this is impl-specific (e.g., one vendor can have a simple PoC mechanism where e.g. you have a shared L2-cache that is snooped by DMA engine, hence flush to L2-cache and call it a day for PoC; but another can just as well call main mem the PoC, dep. on impl.)
\chapter{DSM System Design}