...
This commit is contained in:
parent
037592488d
commit
8e4c770ff6
3 changed files with 38 additions and 60 deletions
|
|
@ -666,3 +666,10 @@
|
||||||
journal={The Linux Kernel documentation},
|
journal={The Linux Kernel documentation},
|
||||||
year={2023}
|
year={2023}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@misc{N/A.Kernelv6.7-transparent-hugepage.2023,
|
||||||
|
title={Transparent Hugepage Support},
|
||||||
|
url={https://www.kernel.org/doc/html/v6.7/admin-guide/mm/transhuge.html},
|
||||||
|
journal={The Linux Kernel documentation},
|
||||||
|
year={2023}
|
||||||
|
}
|
||||||
Binary file not shown.
|
|
@ -1072,81 +1072,52 @@ Figures \ref{fig:coherency-op-per-page-alloc}, \ref{fig:coherency-op-multi-page-
|
||||||
|
|
||||||
On the other hand, linearly increasing coherency operation latencies exhibited for higher-order allocations have their runtimes amortized by two factors:
|
On the other hand, linearly increasing coherency operation latencies exhibited for higher-order allocations have their runtimes amortized by two factors:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item {
|
\item { \label{factor:1}
|
||||||
Exponentially-decreasing number of buffers (allocations) made in the underlying kernel module.
|
Exponentially-decreasing number of buffers (allocations) made in the underlying kernel module, which corresponds to less memory allocation calls made during runtime.
|
||||||
}
|
}
|
||||||
\item {
|
\item { \label{factor:2}
|
||||||
Latency of contiguous allocation operations (i.e., \texttt{alloc\_pages}) \textbf{does not} grow in relation to the size of the allocation.
|
Latency of contiguous allocation operations (i.e., \texttt{alloc\_pages}) \textbf{does not} grow significantly in relation to the size of the allocation.
|
||||||
}
|
}
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
Due to both factors, it remains economic to allocate larger contiguous allocations for DMA pages that are subject to frequent cache coherency maintenance operations than applying a ``scatter-gather'' paradigm to the underlying allocations.
|
Due to both factors, it remains economic to allocate larger contiguous allocations for DMA pages that are subject to frequent cache coherency maintenance operations than applying a ``scatter-gather'' paradigm to the underlying allocations.
|
||||||
|
|
||||||
\subsection{\textit{Hugepages} and RDMA-based DSM}
|
\subsection{\textit{Hugepages} and RDMA-based DSM}
|
||||||
\textit{Hugepage} is an architectural feature that allows an aligned, larger-than-page-size contiguous memory region to be represented using a single TLB entry. x86-64, for example, supports (huge)pages to the size of 4KiB, 2MiB, or 1GiB \cite{N/A.Kernelv6.7-hugetlb.2023}. ARM64 supports a more involved implementation of TLB entries, allowing it to represent more variable pages sizes in one TLB entry (up to 16GiB!) \cite{N/A.Kernelv6.7-arm64-hugetlb.2023}. Hypothetically, using hugepages as backing store for very large RDMA buffers reduces address translation overhead, either by relieving TLB pressure or through reduced page table indirections \cite{Yang_Izraelevitz_Swanson.FileMR-RDMA.2020}.
|
\textit{Hugepage} is an architectural feature that allows an aligned, larger-than-page-size contiguous memory region to be represented using a single TLB entry. x86-64, for example, supports (huge)pages to the size of 4KiB, 2MiB, or 1GiB \cite{N/A.Kernelv6.7-hugetlb.2023}. ARM64 supports a more involved implementation of TLB entries, allowing it to represent more variable pages sizes in one TLB entry (up to 16GiB) \cite{N/A.Kernelv6.7-arm64-hugetlb.2023}. Hypothetically, using hugepages as backing store for very large RDMA buffers reduces address translation overhead, either by relieving TLB pressure or through reduced page table indirections \cite{Yang_Izraelevitz_Swanson.FileMR-RDMA.2020}.
|
||||||
|
|
||||||
The use of hugepages could be transparently enabled using \textit{transparent hugepages} as supported by contemporary Linux kernels, as shown in the implementation of \texttt{alloc\_pages\_mpol} (as called from \texttt{alloc\_pages}):
|
Specifically, the kernel developers identify the following factors that allow \textit{hugepages} to create faster large-working-set programs \cite{N/A.Kernelv6.7-transparent-hugepage.2023}:
|
||||||
\begin{minted}[linenos, mathescape, bgcolor=code-bg]{c}
|
\begin{enumerate}
|
||||||
/* In mm/mempolicy.c */
|
\item {
|
||||||
struct page *alloc_pages_mpol(
|
TLB misses run faster.
|
||||||
gfp_t gfp, uint order,
|
|
||||||
struct mempolicy *pol, // memory policy wrt. NUMA
|
|
||||||
pgoff_t ilx, // index for "interleave mempolicy"$\footnotemark[5]$
|
|
||||||
int nid // Preferred NUMA node
|
|
||||||
) {
|
|
||||||
/* Get nodemask for filtering NUMA node */
|
|
||||||
nodemask_t *nodemask =
|
|
||||||
policy_nodemask(gfp, pol, ilx, &nid);
|
|
||||||
/* ... */
|
|
||||||
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
|
||||||
order == HPAGE_PMD_ORDER &&
|
|
||||||
ilx != NO_INTERLEAVE_INDEX
|
|
||||||
) {
|
|
||||||
/* tl;dr: if use hugepage */
|
|
||||||
if (pol->mode != MPOL_INTERLEAVE &&
|
|
||||||
!nodemask || node_isset(nid, *nodemask)
|
|
||||||
) {
|
|
||||||
/* Try allocate from current/preferred node */
|
|
||||||
page = __aloc_pages_node(
|
|
||||||
nid,
|
|
||||||
gfp | __GFP_THISNODE | __GFP_NORETRY,
|
|
||||||
order
|
|
||||||
);
|
|
||||||
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
|
|
||||||
/* i.e., success or no-reclaim on alloc */
|
|
||||||
return page;
|
|
||||||
}
|
}
|
||||||
|
\item {
|
||||||
|
A single TLB entry corresponds to a much larger section of virtual memory, thereby reducing miss rate.
|
||||||
}
|
}
|
||||||
/* ... */
|
\end{enumerate}
|
||||||
}
|
|
||||||
\end{minted}
|
|
||||||
|
|
||||||
% Further studies needed to check if this is beneficial for performance, etc. CONFIG_THP is enabled on star.
|
In general, performance critical computing applications dealing with large memory working sets will be running on top of \textit{hugetlbfs} -- hugepage mechanism exposed by the Linux kernel to userspace \cite{N/A.Kernelv6.7-transparent-hugepage.2023}. Alternatively, the use of hugepages could be dynamically and transparently enabled and disabled in userspace using \textit{transparent hugepages} supported by contemporary Linux kernels \cite{N/A.Kernelv6.7-transparent-hugepage.2023}. This enhances programmer productivity in userspace programs when relying on a hypothetical \textit{transparent hugepage}-enabled in-kernel DSM system for heterogenenous data processing tasks on variable-sized buffers, though few in-kernel mechanisms actually incorporate \textit{transparent hugepages} support -- at the time of writing, only anonymous \textit{vma}s (e.g., stack, heap, etc.) and \textit{tmpfs/shmem} incorporates \textit{transparent hugepage} \cite{N/A.Kernelv6.7-transparent-hugepage.2023}.
|
||||||
|
|
||||||
|
We identify \textit{transparent hugepage} support as one possible direction to improving in-kernel DSM system performance. Traditionally, userspace programs who really wishes to allocate hugepages rely on \textit{libhugetlbfs} as interface to the Linux kernel's \textit{hugetlbfs} mechanism. These techniques remain heavily reliant on programmer discretion which is fundamentally at odds with what the parent project of this paper envisions: a remote compute node is exposed as a DMA-capable accelerator to another, whereby two compute nodes could transparently perform computation on each other's memory via heterogeneous memory management mechanism. Because this process is transparent to the userspace programmer (who only have access to e.g., \texttt{/dev/my\_shmem}), ideally the underlying kernel handler to \texttt{/dev/my\_shmem} should abstract away the need for hugepages for very large allocations (since this is not handled by \textit{libhugetlbfs}). Furthermore, transparent hugepage support would also hypothetically allow for shared pages to be promoted and demoted on ownership transfer time, thereby allowing for dynamically-grained memory sharing while maximizing address translation performance.
|
||||||
|
|
||||||
\footnotetext[5]{
|
Furthermore, further studies remains necessary to check whether the use of (transparent) hugepages significantly benefit a real implementation of an in-kernel DSM system. Current implementation for \texttt{alloc\_pages} does not allow for allocation of hugepages even when the allocation order is sufficiently large. Consequently, future studies need to examine alternative implementations that incorporate transparent hugepages into the DSM system. One candidate that could allow for hugepage allocation, for example, is to directly use \texttt{alloc\_pages\_mpol} instead of \texttt{alloc\_pages}, as is the case for the current implementation of \textit{shmem} in kernel.
|
||||||
When the NUMA \texttt{mempolicy} of a given virtual memory area is set to \textit{interleave}, the system allocator tries to allocate page allocations across each node allowed from the same policy.
|
|
||||||
|
\subsection{Access Latency After \textit{PoC}}
|
||||||
|
This chapter solely explores latencies due to software cache coherency operations. In practice, it may be equally important to explore the latency incurred due to read/write accesses after \textit{PoC} is reached, which is almost always the case for any inter-operation between CPU and DMA engines.
|
||||||
|
|
||||||
|
Recall from section \ref{subsec:armv8a-swcoherency} that ARMv8-A defines \textit{Point} of Coherency/Unification within its coherency domains. In practice, it often implies an actual, physical \emph{point} to which cached data is evicted to:
|
||||||
|
\begin{itemize}
|
||||||
|
\item {
|
||||||
|
Consider a ARMv8-A system design with a shared L2/lowest-level cache that is also snooped by the DMA engine. Here, the \textit{Point-of-Coherency} could be defined as the shared L2 cache to which higher-level cache entries are cleaned or invalidated.
|
||||||
}
|
}
|
||||||
|
\item {
|
||||||
|
Alternatively, a DMA engine may be capable of snooping all processor caches. The \textit{Point-of-Coherency} could then be defined merely as the L1 cache, with some overhead depending on how the DMA engine accesses these caches.
|
||||||
|
}
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Further studies are necessary to examine the latency after coherency maintenance operations on ARMv8 architectures on various systems, including access from DMA engine vs. access from CPU, etc.
|
||||||
|
|
||||||
\subsection{Reflection}
|
\subsection{Reflection}
|
||||||
|
|
||||||
% \textcolor{red}{[TODO] Idk, something about as follows:}
|
|
||||||
% \begin{itemize}
|
|
||||||
% \item {
|
|
||||||
% Obviously, coherency maintenance operation latency is unrelated with the number of pages allocated (which may be interpreted as how frequent the operation is performed), but correlated with how large the single allocation to be maintained is.
|
|
||||||
% }
|
|
||||||
% \item {
|
|
||||||
% That said, runtime does not grow linearly with allocation size. However, on the other hand total runtime for large allocations should be smaller, as latency from more allocation operations generally overwhelm coherency operation latencies (which quantitatively becomes less prevalent).
|
|
||||||
% }
|
|
||||||
% \item {
|
|
||||||
% The results are implementation-specific, as running similar experiments in bare-metal, server-ready implementations reduce per-page latency by around 10x. Did not have chance to test variable-order allocation latency.
|
|
||||||
% }
|
|
||||||
% \item {
|
|
||||||
% In general, bigger allocation is better. Linux have hugetlbfs and transparent hugepage support but not sure how to utilize them into RDMA mechanism (also not investigated whether they are already used in RDMA driver, regardless how sparingly). This takes a deeper dive into RDMA code which I have not had the time for, simply speaking.
|
|
||||||
% }
|
|
||||||
% \end{itemize}
|
|
||||||
|
|
||||||
% - you should also measure the access latency after coherency operation, though this is impl-specific (e.g., one vendor can have a simple PoC mechanism where e.g. you have a shared L2-cache that is snooped by DMA engine, hence flush to L2-cache and call it a day for PoC; but another can just as well call main mem the PoC, dep. on impl.)
|
|
||||||
|
|
||||||
\chapter{Conclusion}
|
\chapter{Conclusion}
|
||||||
\section{Summary}
|
\section{Summary}
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue