On Discussion
This commit is contained in:
parent
71072102f1
commit
037592488d
3 changed files with 132 additions and 25 deletions
|
|
@ -644,3 +644,25 @@
|
||||||
journal={The Linux Kernel documentation},
|
journal={The Linux Kernel documentation},
|
||||||
year={2023}
|
year={2023}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@inproceedings{Yang_Izraelevitz_Swanson.FileMR-RDMA.2020,
|
||||||
|
title={$\{$FileMR$\}$: Rethinking $\{$RDMA$\}$ Networking for Scalable Persistent Memory},
|
||||||
|
author={Yang, Jian and Izraelevitz, Joseph and Swanson, Steven},
|
||||||
|
booktitle={17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)},
|
||||||
|
pages={111--125},
|
||||||
|
year={2020}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{N/A.Kernelv6.7-hugetlb.2023,
|
||||||
|
title={HugeTLB Pages},
|
||||||
|
url={https://www.kernel.org/doc/html/v6.7/admin-guide/mm/hugetlbpage.html},
|
||||||
|
journal={The Linux Kernel documentation},
|
||||||
|
year={2023}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{N/A.Kernelv6.7-arm64-hugetlb.2023,
|
||||||
|
title={HugeTLBpage on ARM64},
|
||||||
|
url={https://www.kernel.org/doc/html/v6.7/arch/arm64/hugetlbpage.html},
|
||||||
|
journal={The Linux Kernel documentation},
|
||||||
|
year={2023}
|
||||||
|
}
|
||||||
Binary file not shown.
|
|
@ -136,7 +136,7 @@ This thesis paper builds upon an ongoing research effort in implementing a tight
|
||||||
}
|
}
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\chapter{Background}
|
\chapter{Background}\label{chapter:background}
|
||||||
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
|
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
|
||||||
|
|
||||||
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
|
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
|
||||||
|
|
@ -1018,7 +1018,7 @@ Finally, two simple userspace programs are written to invoke the corresponding k
|
||||||
|
|
||||||
\section{Results}\label{sec:sw-coherency-results}
|
\section{Results}\label{sec:sw-coherency-results}
|
||||||
\subsection{Controlled Allocation Size; Variable Page Count}
|
\subsection{Controlled Allocation Size; Variable Page Count}
|
||||||
\textcolor{red}{[TODO] See \ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-tlb}.}
|
Experiments are first conducted over software coherency operation latencies over variable \texttt{mmap}-ed memory area sizes while keeping the underlying allocation sizes to 4KiB (i.e., single-page allocation). All experiments are conducted on \texttt{star} on \texttt{mmap} memory areas ranged from 16KiB till 1GiB, in which we control the number of sampled coherency operations to 1000. Data gathering is performed using the \texttt{trace-cmd} front-end for \texttt{ftrace}. The results of the experiments conducted is listed in figure \ref{fig:coherency-op-per-page-alloc}.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\centering
|
\centering
|
||||||
|
|
@ -1030,10 +1030,12 @@ Finally, two simple userspace programs are written to invoke the corresponding k
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{graphics/out-log-new.pdf}
|
\includegraphics[width=\textwidth]{graphics/out-log-new.pdf}
|
||||||
\end{subfigure}
|
\end{subfigure}
|
||||||
\caption{Per-page allocation, coherency operations}
|
\caption{Coherency operation latency. Allocation on per-page basis. Vertical lines represent 25th, 50th, and 75th percentiles respectively.}
|
||||||
\label{fig:coherency-op-per-page-alloc}
|
\label{fig:coherency-op-per-page-alloc}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
Additionally, we also obtain the latencies of TLB flushes due to userspace programs, as listed in figure \ref{fig:coherency-op-tlb}.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\centering
|
\centering
|
||||||
\begin{subfigure}{.8\textwidth}
|
\begin{subfigure}{.8\textwidth}
|
||||||
|
|
@ -1044,61 +1046,144 @@ Finally, two simple userspace programs are written to invoke the corresponding k
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{graphics/tlb-log.pdf}
|
\includegraphics[width=\textwidth]{graphics/tlb-log.pdf}
|
||||||
\end{subfigure}
|
\end{subfigure}
|
||||||
\caption{Per-page allocation, TLB operations}
|
\caption{TLB operation latency. Allocation on per-page basis. Vertical lines represent 25th, 50th, and 75th percentiles respectively.}
|
||||||
\label{fig:coherency-op-tlb}
|
\label{fig:coherency-op-tlb}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
\subsubsection*{Notes on Long-Tailed Distribution}
|
||||||
|
We identify that a long-tailed distribution of latencies exist for both figures (\ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-multi-page-alloc}). For software coherency operations, we identify this to be partially due to \textit{softirq} preemption (notably, RCU maintenance), which take higher precendence compared to ``regular'' kernel routines. A brief description of \textit{processor contexts} defined in the Linux kernel is listed in \textcolor{red}{Appendix ???}.
|
||||||
|
|
||||||
|
For TLB operations, we identify the cluster of long-runtime TLB flush operations (e.g., around $10^4$ {\mu}s) to be interference from \texttt{mm} cleanup on process exit.
|
||||||
|
|
||||||
|
Moreover, latencies to software coherency operations are highly system-specific. On \texttt{rose}, data gathered from similar experimentations have shown to be $1/10$-th of the latencies gathered from \texttt{star}, which (coincidentially) reduces the likelihood of long-tailed distributions forming due to RCU \textit{softirq} preemption.
|
||||||
|
|
||||||
\subsection{Controlled Page Count; Variable Allocation Size}
|
\subsection{Controlled Page Count; Variable Allocation Size}
|
||||||
\textcolor{red}{[TODO] See \ref{fig:coherency-op-multi-page-alloc}}
|
We also conduct experiments over software coherency operations latencies over fixed \texttt{mmap}-ed memory area sizes while varying the underlying allocation sizes. This is achieved by varying the allocation order -- while 0-order allocation allocates $2^0 = 1$ page per allocation, a 8-order allocation allocates $2^8 = 256$ contiguous pages per allocation. All experiments are conducted on \texttt{star}. The results for all experiments are gathered using \texttt{bcc-tools}, which provide utilities for injecting \textit{BPF}-based tracing routines. The results of these experimentations are visualized in figure \ref{fig:coherency-op-multi-page-alloc}, with $N \ge 64$ per experiment.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=.8\textwidth]{graphics/var_alloc_size.pdf}
|
\includegraphics[width=.8\textwidth]{graphics/var_alloc_size.pdf}
|
||||||
\caption{Average coherency op latency of variable-order contiguous allocation}
|
\caption{Average coherency op latency of variable-order contiguous allocation.}
|
||||||
\label{fig:coherency-op-multi-page-alloc}
|
\label{fig:coherency-op-multi-page-alloc}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\section{Discussion}\label{sec:sw-coherency-discuss}
|
\section{Discussion}\label{sec:sw-coherency-discuss}
|
||||||
\textcolor{red}{[TODO] Idk, something about as follows:}
|
Figures \ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-multi-page-alloc} exhibits that, in general, coherency maintenance operation is \textbf{unrelated with the size of the mapped memory area} and \textbf{correlated with how large a single contiguous allocation is made}. We especially note that the runtime of each software-initiated coherency maintenance operation \textbf{does not grow linearly with allocation size}. Given that both axis of figure \ref{fig:coherency-op-multi-page-alloc} is on a log-scale, with the ``order'' axis interpretable as a ${\log}_2$ scale of number of contiguous 4K pages, a perfect linear correlation between allocation size and latency would see a roughly linear interpolation between the data points. This is obviously not the case for figure \ref{fig:coherency-op-multi-page-alloc}, which sees software coherency operation latency increasing drastically once order $\ge$ 6 (i.e., 64 contiguous pages), but remain roughly comparable for smaller orders.
|
||||||
\begin{itemize}
|
|
||||||
|
On the other hand, linearly increasing coherency operation latencies exhibited for higher-order allocations have their runtimes amortized by two factors:
|
||||||
|
\begin{enumerate}
|
||||||
\item {
|
\item {
|
||||||
Obviously, coherency maintenance operation latency is unrelated with the number of pages allocated (which may be interpreted as how frequent the operation is performed), but correlated with how large the single allocation to be maintained is.
|
Exponentially-decreasing number of buffers (allocations) made in the underlying kernel module.
|
||||||
}
|
}
|
||||||
\item {
|
\item {
|
||||||
That said, runtime does not grow linearly with allocation size. However, on the other hand total runtime for large allocations should be smaller, as latency from more allocation operations generally overwhelm coherency operation latencies (which quantitatively becomes less prevalent).
|
Latency of contiguous allocation operations (i.e., \texttt{alloc\_pages}) \textbf{does not} grow in relation to the size of the allocation.
|
||||||
}
|
}
|
||||||
\item {
|
\end{enumerate}
|
||||||
The results are implementation-specific, as running similar experiments in bare-metal, server-ready implementations reduce per-page latency by around 10x. Did not have chance to test variable-order allocation latency.
|
|
||||||
|
Due to both factors, it remains economic to allocate larger contiguous allocations for DMA pages that are subject to frequent cache coherency maintenance operations than applying a ``scatter-gather'' paradigm to the underlying allocations.
|
||||||
|
|
||||||
|
\subsection{\textit{Hugepages} and RDMA-based DSM}
|
||||||
|
\textit{Hugepage} is an architectural feature that allows an aligned, larger-than-page-size contiguous memory region to be represented using a single TLB entry. x86-64, for example, supports (huge)pages to the size of 4KiB, 2MiB, or 1GiB \cite{N/A.Kernelv6.7-hugetlb.2023}. ARM64 supports a more involved implementation of TLB entries, allowing it to represent more variable pages sizes in one TLB entry (up to 16GiB!) \cite{N/A.Kernelv6.7-arm64-hugetlb.2023}. Hypothetically, using hugepages as backing store for very large RDMA buffers reduces address translation overhead, either by relieving TLB pressure or through reduced page table indirections \cite{Yang_Izraelevitz_Swanson.FileMR-RDMA.2020}.
|
||||||
|
|
||||||
|
The use of hugepages could be transparently enabled using \textit{transparent hugepages} as supported by contemporary Linux kernels, as shown in the implementation of \texttt{alloc\_pages\_mpol} (as called from \texttt{alloc\_pages}):
|
||||||
|
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
|
||||||
|
/* In mm/mempolicy.c */
|
||||||
|
struct page *alloc_pages_mpol(
|
||||||
|
gfp_t gfp, uint order,
|
||||||
|
struct mempolicy *pol, // memory policy wrt. NUMA
|
||||||
|
pgoff_t ilx, // index for "interleave mempolicy"$\footnotemark[5]$
|
||||||
|
int nid // Preferred NUMA node
|
||||||
|
) {
|
||||||
|
/* Get nodemask for filtering NUMA node */
|
||||||
|
nodemask_t *nodemask =
|
||||||
|
policy_nodemask(gfp, pol, ilx, &nid);
|
||||||
|
/* ... */
|
||||||
|
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
||||||
|
order == HPAGE_PMD_ORDER &&
|
||||||
|
ilx != NO_INTERLEAVE_INDEX
|
||||||
|
) {
|
||||||
|
/* tl;dr: if use hugepage */
|
||||||
|
if (pol->mode != MPOL_INTERLEAVE &&
|
||||||
|
!nodemask || node_isset(nid, *nodemask)
|
||||||
|
) {
|
||||||
|
/* Try allocate from current/preferred node */
|
||||||
|
page = __aloc_pages_node(
|
||||||
|
nid,
|
||||||
|
gfp | __GFP_THISNODE | __GFP_NORETRY,
|
||||||
|
order
|
||||||
|
);
|
||||||
|
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
|
||||||
|
/* i.e., success or no-reclaim on alloc */
|
||||||
|
return page;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
\item {
|
/* ... */
|
||||||
In general, bigger allocation is better. Linux have hugetlbfs and transparent hugepage support but not sure how to utilize them into RDMA mechanism (also not investigated whether they are already used in RDMA driver, regardless how sparingly). This takes a deeper dive into RDMA code which I have not had the time for, simply speaking.
|
}
|
||||||
}
|
\end{minted}
|
||||||
\end{itemize}
|
|
||||||
|
% Further studies needed to check if this is beneficial for performance, etc. CONFIG_THP is enabled on star.
|
||||||
|
|
||||||
|
|
||||||
|
\footnotetext[5]{
|
||||||
|
When the NUMA \texttt{mempolicy} of a given virtual memory area is set to \textit{interleave}, the system allocator tries to allocate page allocations across each node allowed from the same policy.
|
||||||
|
}
|
||||||
|
|
||||||
|
\subsection{Reflection}
|
||||||
|
|
||||||
|
% \textcolor{red}{[TODO] Idk, something about as follows:}
|
||||||
|
% \begin{itemize}
|
||||||
|
% \item {
|
||||||
|
% Obviously, coherency maintenance operation latency is unrelated with the number of pages allocated (which may be interpreted as how frequent the operation is performed), but correlated with how large the single allocation to be maintained is.
|
||||||
|
% }
|
||||||
|
% \item {
|
||||||
|
% That said, runtime does not grow linearly with allocation size. However, on the other hand total runtime for large allocations should be smaller, as latency from more allocation operations generally overwhelm coherency operation latencies (which quantitatively becomes less prevalent).
|
||||||
|
% }
|
||||||
|
% \item {
|
||||||
|
% The results are implementation-specific, as running similar experiments in bare-metal, server-ready implementations reduce per-page latency by around 10x. Did not have chance to test variable-order allocation latency.
|
||||||
|
% }
|
||||||
|
% \item {
|
||||||
|
% In general, bigger allocation is better. Linux have hugetlbfs and transparent hugepage support but not sure how to utilize them into RDMA mechanism (also not investigated whether they are already used in RDMA driver, regardless how sparingly). This takes a deeper dive into RDMA code which I have not had the time for, simply speaking.
|
||||||
|
% }
|
||||||
|
% \end{itemize}
|
||||||
|
|
||||||
% - you should also measure the access latency after coherency operation, though this is impl-specific (e.g., one vendor can have a simple PoC mechanism where e.g. you have a shared L2-cache that is snooped by DMA engine, hence flush to L2-cache and call it a day for PoC; but another can just as well call main mem the PoC, dep. on impl.)
|
% - you should also measure the access latency after coherency operation, though this is impl-specific (e.g., one vendor can have a simple PoC mechanism where e.g. you have a shared L2-cache that is snooped by DMA engine, hence flush to L2-cache and call it a day for PoC; but another can just as well call main mem the PoC, dep. on impl.)
|
||||||
|
|
||||||
\chapter{DSM System Design}
|
|
||||||
|
|
||||||
\chapter{Conclusion}
|
\chapter{Conclusion}
|
||||||
|
\section{Summary}
|
||||||
|
|
||||||
|
\section{Future Work}
|
||||||
|
|
||||||
|
|
||||||
% \bibliographystyle{plain}
|
% \bibliographystyle{plain}
|
||||||
% \bibliographystyle{plainnat}
|
% \bibliographystyle{plainnat}
|
||||||
% \bibliography{mybibfile}
|
% \bibliography{mybibfile}
|
||||||
\printbibliography
|
\printbibliography
|
||||||
|
|
||||||
|
|
||||||
% You may delete everything from \appendix up to \end{document} if you don't need it.
|
% You may delete everything from \appendix up to \end{document} if you don't need it.
|
||||||
\appendix
|
\appendix
|
||||||
|
\chapter{Terminologies}
|
||||||
|
This chapter provides a listing of all terminologies used in this thesis that may be of interest or warrant a quick-reference entry during reading.
|
||||||
|
|
||||||
\chapter{First appendix}
|
\chapter{More on The Linux Kernel}
|
||||||
|
This chapter provides some extra background information on the Linux kernel that may have been mentioned or implied but bears insufficient significance to be explained in the \hyperref[chapter:background]{Background} chapter of this thesis.
|
||||||
|
|
||||||
\section{First section}
|
\section{Processor Context}
|
||||||
|
|
||||||
Any appendices, including any required ethics information, should be included
|
\chapter{Cut \& Extra Work}
|
||||||
after the references.
|
This chapter provides a brief summary of some work that was done during the writing of the thesis, but the author decided against inclusion of into the submitted work. It also explains some assumptions made with regards to the title of this thesis that the author find to have weakness on second thought.
|
||||||
|
|
||||||
Markers do not have to consider appendices. Make sure that your contributions
|
\section{Replacement Policy}
|
||||||
are made clear in the main body of the dissertation (within the page limit).
|
\section{Coherency Protocol}
|
||||||
|
\section{\texttt{enum dma\_data\_direction}}
|
||||||
|
\section{Use case for \texttt{dcache\_clean\_poc}: \textit{smbdirect}}
|
||||||
|
\section{Listing: Userspace}
|
||||||
|
\section{\textit{Why did you do \texttt{*}?}}
|
||||||
|
|
||||||
|
% Any appendices, including any required ethics information, should be included
|
||||||
|
% after the references.
|
||||||
|
|
||||||
|
% Markers do not have to consider appendices. Make sure that your contributions
|
||||||
|
% are made clear in the main body of the dissertation (within the page limit).
|
||||||
|
|
||||||
% \chapter{Participants' information sheet}
|
% \chapter{Participants' information sheet}
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue