Reorganization + Added intro to skeleton

This commit is contained in:
Zhengyi Chen 2024-03-15 23:19:54 +00:00
parent 0d78e11a97
commit 9bef473315
32 changed files with 788 additions and 2781 deletions

28
tex/draft/README.md Normal file
View file

@ -0,0 +1,28 @@
# Template for Informatics UG final-year projects
Please base your project report on `skeleton.tex`, reading the instructions in
that example file carefully.
To compile the `skeleton.pdf` report, with all cross-references resolved:
```
pdflatex skeleton.tex
bibtex skeleton.aux
pdflatex skeleton.tex
pdflatex skeleton.tex
```
Many TeX distributions have (or can install) a `latexmk` command that will
automatically compile everything that is needed:
```
latexmk -pdf skeleton.tex
```
If the logo causes compilation problems (errors related to `eushield`), it isn't
necessary, you may remove the `logo` option from the first line of code.
Although check first that you are using `pdflatex` or the `-pdf` option above.
As directed in `skeleton.tex` do not change other template or layout options.
Occassionally latex gets really confused by errors, and intermediate files need
to be deleted before the report will compile again. We strongly recommend that
you keep your files in version control so that you can unpick any problems.
Remember also to keep off-site backups.

View file

@ -1,28 +0,0 @@
@inproceedings{!BGW.2010.CDN,
title={Distributed caching algorithms for content distribution networks},
author={Borst, Sem and Gupta, Varun and Walid, Anwar},
booktitle={2010 Proceedings IEEE INFOCOM},
pages={1--9},
year={2010},
organization={IEEE}
}
@article{KD.2002.Akamai_CoordCacheRepl,
title={Coordinated placement and replacement for large-scale distributed caches},
author={Korupolu, Madhukar R. and Dahlin, Michael},
journal={IEEE Transactions on Knowledge and Data Engineering},
volume={14},
number={6},
pages={1317--1329},
year={2002},
publisher={IEEE}
}
@misc{Z.2022.Linux_LRU_GEN,
title={Multi-Gen LRU},
url={https://www.kernel.org/doc/html/v6.6-rc5/mm/multigen_lru.html},
journal={The Linux Kernel documentation},
author={Zhao, Yu},
editor={Alumbaugh, T JEditor},
year={2022}
}

View file

@ -1,10 +0,0 @@
> A High-Performance Framework for Dynamic Cache-Replacement-Strategy-Selection in Distributed Shared Memory Systems
# Background
> Various Kinds of (Distributed) Systems (What makes a system "distributed", anyways?) $\rightarrow$
> (Distributed) Cache Replacement Algorithms (Strategies) $\rightarrow$
> Limitations to common distributed cache replacement practices in extremely time-sensitive scenarios (like ours) $\rightarrow$
> Variables that need to be accounted for in cache replacement problms $\rightarrow$
> Need for dynamic manipulation to cache replacement strategy, which implies probing & measurement & comparison, etc. $\rightarrow$
> Framework for such a thing, which is what we explore in this paper.

Binary file not shown.

View file

@ -1,63 +0,0 @@
\documentclass{article}
\usepackage{biblatex}
\title{Thesis Background}
\author{Zhengyi Chen}
\date{\today}
\addbibresource{../main.bib}
\addbibresource{background.bib}
\begin{document}
\maketitle
% Phil Karlton's famous quote about the 2 hard problems in CS here, maybe.
The problem of cache replacement is general to computer systems of all scales and topologies:
topologically massive systems, such as cellular stations\cite{GWHSZ.2014.CacheReplAsMDP-QLearning}
and CDNs\cites{EHOFK.2020.IBM-LRUvsFIFO}{!BGW.2010.CDN}{KD.2002.Akamai_CoordCacheRepl}, and
data-path level implementations for processors\cites{QJPSE.2007.DIP}{JTSE.2010.RRIP}{SYS.2021.RLR}
alike requires good solutions to maintain and maximize application performance
to various levels of granularity. On the other hand, the set of feasible/performant solutions
(i.e., cache replacement policies) to one system may or may not be inspiring to performance
improvement on another system of different scale, objectives, tasks, constrained by a
(mostly) different context of available inputs, metadata, etc.
We propose a framework for dynamic cache-replacement-strategy selection that balances computation
cost, optimality, and working-set estimation for each strategy while incurring minimal performance
penalties for a shared-kernel cooperative Distributed Shared Memory system. (We identify \dots)
\section[1]{Existing Cache Replacement Strategies}
\subsection[1.1]{LRU-derived Algorithms}
\subsection[1.2]{FIFO-derived Algorithms}
\subsection[1.3]{Cache Replacement in Processors}
\subsection[1.4]{Machine Learning and Heuristics}
\section[2]{The Cache Replacement Problem}
\section[3]{Page Replacement in (SMP or?) Linux}
%-- But LRU_GEN is interop-ed with an array of other systems,
% how could we trivially implement alternative page replacement algorithms with maximum feature
% compliance?
%
% Cache replacement strategies local to its own resources, for example CPU cache line replacement stategies, may not optimally perform cache eviction and
% replacement for CDNs which (1) center \textit{freqency} over \textit{recency} and (2) could
% cooperate to utilize a nearby cache with small additional transfer cost\cite{KD.2002.Akamai_CoordCacheRepl}.
% Orthogonally, cache replacement strategies that perform well on one task might perform less well on
% another, as implied by \cite{SYS.2021.RLR} among others.
% this is the case for Linux's \textit{multi-gen LRU} page replacement algorithm which
% by default prioritizes memory access via page table to be stored in cache over those via file
% descriptors (though it dynamically self-adjusts)\cite{Z.2022.Linux_LRU_GEN} -- the kernel developers
% assume that the former is costlier upon page fault. This is well and good for programs with
% This is not to say that some amount of "technological transfer" from cache replacement strategies
% intended for one specific setting could not be
% A performant cache replacement strategy, relative to its hosting
% system, needs to strike balance between optimality and the necessary computation needed to make a
% replacement/eviction decision.
\printbibliography
\end{document}

Binary file not shown.

54
tex/draft/eushield.sty Normal file
View file

@ -0,0 +1,54 @@
%% eushield.sty -- commands to manipulate the inclusion of the EU shield
%% graphic.
%%
%% Version 1.0 [2000/11/23] -- initial version
%% Version 1.1 [2006/08/28] -- fixed PDF detection for teTeX 3
%%
%% Mary Ellen Foster <M.E.Foster@ed.ac.uk>
\def\filedate{2006/08/28}
\def\fileversion{1.1}
\ProvidesPackage{eushield}[\filedate\ v\fileversion\
Commands for including the EU shield graphic]
\RequirePackage{graphics}
\RequirePackage{ifpdf}
%% Possible values for shieldtype:
%% 0: regular monochrome
%% 1: monochrome with no background lines
%% 2: reverse monochrome
%% 3: two colours: navy and red
%% 4: full colour
\newcommand{\eushield}{}
\newcommand{\@endspecial}{}
\newcommand{\shieldtype}[1]{%
\def\@shieldtype{#1}
\ifpdf
\ifnum\@shieldtype=0
\renewcommand{\eushield}{eushield-normal}
\else\ifnum\@shieldtype=1
\renewcommand{\eushield}{eushield-noback}
\else\ifnum\@shieldtype=2
\renewcommand{\eushield}{eushield-reversed}
\else\ifnum\@shieldtype=3
\renewcommand{\eushield}{eushield-twocolour}
\else\ifnum\@shieldtype=4
\renewcommand{\eushield}{eushield-fullcolour}
\fi\fi\fi\fi\fi
\else
\renewcommand{\eushield}{eushield}
\renewcommand{\@endspecial}{%
\special{!/crestversion #1 def}}
\fi
}
\shieldtype{0}
\newcommand{\includeshield}{%
\includegraphics{\eushield}}
\ifpdf
\else
\AtBeginDocument{\@endspecial}
\fi

698
tex/draft/infthesis.cls Normal file
View file

@ -0,0 +1,698 @@
%%
%% File : infthesis.cls (LaTeX2e class file)
%% Author : Version 3.8 by Sharon Goldwater <sgwater@inf.ed.ac.uk>
%% Version 3.7 updated by Jennifer Oxley <joxley@inf.ed.ac.uk>
%% Version 3.6 by Charles Sutton <csutton@inf.ed.ac.uk>
%% Version 3.5 by Vasilis Vasaitis <v.vasaitis@sms.ed.ac.uk>
%% Version 3.4.1 updated by Tiejun Ma (t.j.ma@ed.ac.uk)
%% Version 3.0 by Mary Ellen Foster <mef@cogsci.ed.ac.uk>
%% Original version by Martin Reddy (mxr@dcs.ed.ac.uk)
%% Version : 3.8
%% Updates : 1.0 [9/11/95] - initial release.
%% 1.1 [24/4/96] - fixed bibliography bug caused by new report.cls
%% 1.2 [13/5/96] - \dedication & \thesiscaption[]
%% 1.3 [28/5/96] - abbrevs, parskip, minitoc fix, \headfootstyle
%% 1.4 [12/7/96] - appendices okay now, \cleardoublepage's added
%% 1.45 [6/8/96] - added space between chapter & numb on toc
%% 1.5 [13/8/96] - tailmargin was too small by 0.7cm!!
%% 2.0 [20/9/96] - \SetPrinter for margin settings (default=hp24)
%% no header, more abbreviations
%% 3.0 [16/10/2000] - Changed name and some formatting to become
%% "infthesis" instead
%% 3.1 [20/10/2000] - Added sans-serif running heads
%% Added pslatex by default (unless "cmfonts")
%% Added back in the code to create empty pages
%% on cleardoublepage (from titlesec.sty)
%% 3.2 [13/11/2000] - Changed name of font-setting commands
%% - Changed name of shield input file
%% 3.3 [23/11/2000] - Use the new and improved shield graphic
%% 3.4 [23/11/2000] - Political changes... also fixed a problem
%% with the margins on two-sided documents.
%% 3.4.1[09/03/2006] -add [a4paper] parameter for geometry package
%% 3.5 [02/02/2011] - fix double-sided margins with modern
%% versions of the geometry package
%% 3.6 [15/03/2012] - addition to support the MInf degree
%% 3.7 [07/02/2013] - fix MInf definition
%% 3.8 [06/06/2019] - added support for MSc degrees:
%% adi, datasci, di, cyber
%% This file contains a class definition, infthesis, for the LaTeX2e
%% system which defines the layout of theses which are submitted in
%% the School of Informatics at the University of Edinburgh.
%%
%% For information on this class, please refer to "texdoc infthesis"
%%
\NeedsTeXFormat{LaTeX2e}[1994/12/01]
\ProvidesClass{infthesis}[2013/02/07 v3.7 School of Informatics Thesis Class]
%%
%% --- Initial Code ---
%%
%% Required packages:
%% - ifthen for conditionals
%% - geometry for margin-setting
%% - graphics for including the (scaled) logo on the front page
%% - xspace if abbreviations are used
\usepackage[paper=a4paper]{geometry}
\RequirePackage{ifthen,%
graphics,%
xspace,
eushield}
%%%% The following packages are also used but are included further on rather
%%%% than here, either because their options depend on the class options or
%%%% because they only need to be loaded in certain cases.
%% - parskip for alternate formatting of paragraphs
%% - tocbibind to put LOF and LOT into table of contents
%% - sectsty to change format of section headers
%% - caption to change format of captions
%% - pslatex to set default fonts (if they don't specify notimes)
%% Default values for various fields
\newcommand{\@degreetext}{}
\newcommand{\@infschool}{School of Informatics}
\newcommand{\@university}{University of Edinburgh}
\newcommand{\@chapteralignment}{\centering}
\newcommand{\@chapterfont}{}
\newcommand{\@thesisside}{}
\newcommand{\@thesisopen}{}
\newcommand{\@thesispoints}{}
\newcommand{\@draftmessage}{}
%% Lots of boolean things to keep track of options
\newboolean{draftthesis}
\newboolean{usequotemarks}
\newboolean{usesinglespacing}
\newboolean{usedoublespacing}
\newboolean{usefullspacing}
\newboolean{usedeptreport}
\newboolean{useabbrevs}
\newboolean{sansheadings}
\newboolean{ltoc}
\newboolean{romanpre}
\newboolean{logo}
\newboolean{frontabs}
\newboolean{strict}
\newboolean{timesfonts}
%% Choose the monochrome crest for the front page (if crests used)
\shieldtype{0}
%%
%% --- Options ---
%%
%% Current options: phd, mphil, msc, mscres, bsc
%% deptreport
%% draft
%% usequotes
%% singlespacing, doublespacing, fullspacing
%% cent{er,re}chapter, leftchapter, rightchapter,
%% + all report.cls options
%%
%% Commands to set course and project (for 4th year -- too many possibilities
%% to use options).
\let\@course\@empty
\newcommand{\course}[1]{\gdef\@course{#1}}
\newcommand{\@project}{Fourth Year Project Report}
\newcommand{\project}[1]{\gdef\@project{#1}}
%% Options to specify school and/or degree
% MSc degree: default is none specified
\let\@mscdegree\@empty
\DeclareOption{adi}{\gdef\@mscdegree{Advanced Design Informatics}}
\DeclareOption{ai}{\gdef\@mscdegree{Artificial Intelligence}}
\DeclareOption{cogsci}{\gdef\@mscdegree{Cognitive Science}}
\DeclareOption{cs}{\gdef\@mscdegree{Computer Science}}
\DeclareOption{cyber}{\gdef\@mscdegree{Cyber Security, Privacy and Trust}}
\DeclareOption{datasci}{\gdef\@mscdegree{Data Science}}
\DeclareOption{di}{\gdef\@mscdegree{Design Informatics}}
\DeclareOption{dsti}{\gdef\@mscdegree{Data Science, Technology, and Innovation}}
\DeclareOption{inf}{\gdef\@mscdegree{Informatics}}
% Institute
\let\@institute\@empty
\DeclareOption{aiai}{\gdef\@institute{Artificial Intelligence
Applications Institute}}
\DeclareOption{cisa}{\gdef\@institute{Centre for Intelligent Systems
and their Applications}}
\DeclareOption{icsa}{\gdef\@institute{Institute of Computing Systems
Architecture}}
\DeclareOption{ianc}{\gdef\@institute{Institute for Adaptive and
Neural Computation}}
\DeclareOption{ilcc}{\gdef\@institute{Institute for Language, Cognition and
Computation}}
\DeclareOption{ipab}{\gdef\@institute{Institute of Perception, Action
and Behaviour}}
\DeclareOption{lfcs}{\gdef\@institute{Laboratory for Foundations of
Computer Science}}
% Degree
\def\@researchdegree#1{%
\renewcommand{\@degreetext}{#1 \\
\ifx\@empty\@institute
\PackageWarning{infthesis}{No institute specified for research
degree}
\else
\@institute \\
\fi
\@infschool \\ \@university}
\setboolean{strict}{true}}
\DeclareOption{phdproposal}{\@researchdegree{PhD Proposal}}
\DeclareOption{phd}{\@researchdegree{Doctor of Philosophy}}
\DeclareOption{mphil}{\@researchdegree{Master of Philosophy}}
\DeclareOption{mscres}{\@researchdegree{Master of Science by Research}}
\def\@taughtdegree#1#2{%
\renewcommand{\@degreetext}{#1 \\
\ifx\@empty#2
% \PackageWarning{infthesis}{No course/school specified for taught
% degree}
\else
#2 \\
\fi
\@infschool \\ \@university}%
\setboolean{strict}{false}}
\DeclareOption{msc}{\@taughtdegree{Master of Science}{\@mscdegree}}
\DeclareOption{minf}{\@taughtdegree{Master of Informatics}{~}}
\DeclareOption{bsc}{\@taughtdegree{\@project}{\@course}}
%% Chapter header alignment, font of headings
\DeclareOption{centerchapter,centrechapter}
{\renewcommand{\@chapteralignment}{\centering}}
\DeclareOption{leftchapter}
{\renewcommand{\@chapteralignment}{\raggedright}}
\DeclareOption{rightchapter}
{\renewcommand{\@chapteralignment}{\raggedleft}}
\DeclareOption{sansheadings}{%
\setboolean{sansheadings}{true}}
\DeclareOption{normalheadings}{%
\setboolean{sansheadings}{false}}
%% Sidedness, openright-ness, and font size (so that the draft option can
%% override them as needed.
\DeclareOption{twoside}{\renewcommand{\@thesisside}{twoside}}
\DeclareOption{oneside}{\renewcommand{\@thesisside}{oneside}}
\DeclareOption{openany}{\renewcommand{\@thesisopen}{openany}}
\DeclareOption{openright}{\renewcommand{\@thesisopen}{openright}}
\DeclareOption{10pt}{\renewcommand{\@thesispoints}{10pt}}
\DeclareOption{11pt}{\renewcommand{\@thesispoints}{11pt}}
\DeclareOption{12pt}{\renewcommand{\@thesispoints}{12pt}}
%% Font options.
\DeclareOption{notimes}{\setboolean{timesfonts}{false}}
\DeclareOption{timesfonts}{\setboolean{timesfonts}{true}}
%% Whether it's a draft (if so, single space and wider margins to fit more)
\DeclareOption{draft}{
\setboolean{draftthesis}{true}
\ExecuteOptions{10pt,openany,twoside}
\renewcommand{\@draftmessage}{(Draft Copy)}}
%% Use quotation marks in quotation environment?
\DeclareOption{usequotes}{\setboolean{usequotemarks}{true}}
%% Use useful abbrevations?
\DeclareOption{abbrevs}{\setboolean{useabbrevs}{true}}
%% Select spacing (default: fullspacing)
\DeclareOption{singlespacing}{\setboolean{usesinglespacing}{true}}
\DeclareOption{doublespacing}{\setboolean{usedoublespacing}{true}}
\DeclareOption{fullspacing}{\setboolean{usefullspacing}{true}}
%% Options to control the format of the cover page
\DeclareOption{deptreport}{\setboolean{usedeptreport}{true}}
\DeclareOption{logo}{%
\setboolean{logo}{true}}
\DeclareOption{frontabs}{%
\setboolean{frontabs}{true}}
%% Use parskip formatting of paragraphs
\DeclareOption{parskip}{\AtEndOfClass{\RequirePackage{parskip}}}
%% Whether to put list of figures and tables into TOC (default: no)
\DeclareOption{listsintoc}{%
\setboolean{ltoc}{true}}
%% Pre pages can be numbered separately or with the rest of the thesis.
\DeclareOption{romanprepages}{\setboolean{romanpre}{true}}
\DeclareOption{plainprepages}{\setboolean{romanpre}{false}}
%% Pass all other options directly to report class
\DeclareOption*{\PassOptionsToClass{\CurrentOption}{report}}
%% Set default options and process the ones we were given.
\ExecuteOptions{phd,centerchapter,romanprepages,%
sansheadings,openright,oneside,12pt,timesfonts}
\ProcessOptions
%%
%% --- Class Loading (built ontop of report.cls) ---
%%
\LoadClass[a4paper,\@thesispoints,\@thesisside,\@thesisopen]{report}
\ifthenelse{\boolean{draftthesis}}
{}
{\ifthenelse{\boolean{strict}}
{\if@twoside
\if@openright
\else
\PackageWarning{infthesis}{A two-sided PhD or MPhil thesis must
not use the "openany" option}
\fi
\fi}
{}}
%%
%% --- Main Code ---
%%
\newboolean{isspecialchapter}
\setboolean{isspecialchapter}{false}
%%
%% First we will sort out the page layout. The following is a brief
%% summary of the university typesetting regulations:
%% Printed on A4 paper, entirely on rectos (single-sided) or on both sides
%% but with chapters starting on even pages
%% 4cm binding margin
%% 2cm head margin
%% 2.5cm fore-edge margin
%% 4cm tail margin
%% spacing: not less then 1.5 spacing (18pt leading)
%% quotations & footnotes in single spacing
%% bibliography may be in single spacing
%% character size: not exceed 12pt for body text (at least 10pt)
%% style: a serif font should be used, with a sans-serif for headings
%% hyphenation should be avoided if possible
%% Try to set up the margins according to the specifications in the thesis
%% regulations. I have removed all of the old code (ca. 1996) that attempted
%% to compensate for particular printers in the Department of Computer
%% Science.
\ifthenelse{\boolean{draftthesis}}
{\geometry{a4paper,margin=2cm,twoside}}
{\if@twoside
\geometry{a4paper,left=4cm,top=2cm,right=2.5cm,bottom=4cm,twoside}
\else
\geometry{a4paper,left=4cm,top=2cm,right=2.5cm,bottom=4cm}
\fi}
%% We should make pages created by "cleardoublepage" be
%% really empty. Taken from titlesec.sty
\def\cleardoublepage{%
\clearpage{\ps@empty\if@twoside\ifodd\c@page\else
\hbox{}\newpage\if@twocolumn\hbox{}\newpage\fi\fi\fi}}
%%
%% Hack to make minitoc work with csthesis. We declare a new chapter
%% variable called starchapter to be used by \addcontentsline when we
%% add contents lines for List of Figures/Tables. If we don't, then
%% minitoc treats the LOF/LOT sections as chapters of the thesis.
%%
\@ifundefined{chapter}{}{\let\l@starchapter\l@chapter}
%%
%% This bit will set up the header format for the thesis.
%% This currently uses a "headings" style showing the pagenumber
%% and chapter number/title. (in slanted text) If the document is two-sided,
%% the right-hand page will show the section number and title instead.
%%
\newcommand{\headfootstyle}{\normalsize} % font size of headers and footers
% This will have \sffamily added to it if "sansheadings" is specified.
%% Set up different headers for right and left-hand pages. Those \defs are a
%% bit magic, but they seem to do the trick... :-)
%% Adapted from Francois Pitt's ut-thesis class from the University of
%% Toronto.
\if@twoside % if two-sided printing
\renewcommand{\ps@headings}{
\let\@mkboth\markboth
\def\@oddfoot{}
\let\@evenfoot\@oddfoot
\def\@oddhead{{\headfootstyle\itshape \rightmark}\hfil \headfootstyle\thepage}%
\def\@evenhead{\headfootstyle\thepage\hfil
{\headfootstyle\itshape\leftmark}}%
\def\chaptermark##1{\markboth
{\ifnum\c@secnumdepth >\m@ne
\@chapapp\ \thechapter. \ \fi ##1}{}}%
\def\sectionmark##1{\markright
{\ifnum\c@secnumdepth >\z@
\thesection. \ \fi ##1}}%
}%ps@headings
\else % if one-sided printing
\renewcommand{\ps@headings}{
\let\@mkboth\markboth
\def\@oddfoot{}
\def\@oddhead{{\headfootstyle\itshape\rightmark}\hfil
\headfootstyle\thepage}%
\def\chaptermark##1{\markright
{\ifnum\c@secnumdepth >\m@ne
\@chapapp\ \thechapter. \ \fi ##1}}%
}%ps@headings
\fi%@twoside
\renewcommand{\ps@plain}{
\renewcommand{\@oddfoot}{\hfil\headfootstyle\thepage\hfil}
\renewcommand{\@evenfoot}{\hfil\headfootstyle\thepage\hfil}
\renewcommand{\@evenhead}{}
\renewcommand{\@oddhead}{}
}
%%
%% And now setup that headings style as default
%%
\newcommand{\@textpagenumbering}{arabic}
\newcommand{\@preamblepagenumbering}{roman}
\newcommand{\@textpagestyle}{headings}
\newcommand{\@preamblepagestyle}{plain}
\pagestyle{\@textpagestyle}
\setcounter{secnumdepth}{6}
%%
%% Set up the default names for the various chapter headings
%%
\renewcommand{\contentsname}{Table of Contents}
\renewcommand{\listfigurename}{List of Figures}
\renewcommand{\listtablename}{List of Tables}
\renewcommand{\bibname}{Bibliography}
\renewcommand{\indexname}{Index}
\renewcommand{\abstractname}{Abstract}
%%
%% Some sundry commands which are generally useful...
%%
\ifthenelse{\boolean{useabbrevs}}
{\newcommand{\NB}{N.B.\@\xspace}
\newcommand{\eg}{e.g.\@\xspace}
\newcommand{\Eg}{E.g.\@\xspace}
\newcommand{\ie}{i.e.\@\xspace}
\newcommand{\Ie}{I.e.\@\xspace}
\newcommand{\etc}{etc.\@\xspace}
\newcommand{\etal}{{\em et al}.\@\xspace}
\newcommand{\etseq}{{\em et seq}.\@\xspace}
\newcommand{\precis}{pr\'ecis\xspace}
\newcommand{\Precis}{Pr\'ecis\xspace}
\newcommand{\role}{r\^ole\xspace}
\newcommand{\Role}{R\^ole\xspace}
\newcommand{\naive}{na\"\i ve\xspace}
\newcommand{\Naive}{Na\"\i ve\xspace}
\newcommand{\tm}{\raisebox{1ex}{\tiny TM}\xspace}
\newcommand{\cpright}{\raisebox{1ex}{\tiny\copyright}\xspace}
\newcommand{\degrees}{\raisebox{1.2ex}{\tiny\ensuremath{\circ}}\xspace}}
{}
%%
%% Set up the double spacing and provide commands to alter the
%% spacing for the subsequent text. By default, 1.5 spacing will be
%% used. This can be modified through the singlespacing, doublespacing
%% or draft class options.
%%
\newcommand{\doublespace}{%
\renewcommand{\baselinestretch}{1.66}\normalsize}
\newcommand{\oneandahalfspace}{%
\renewcommand{\baselinestretch}{1.33}\normalsize}
\newcommand{\singlespace}{%
\renewcommand{\baselinestretch}{1}\normalsize}
\ifthenelse{\boolean{draftthesis}}
{\AtBeginDocument{\singlespace}}% \SetPrinterDraft}
{\ifthenelse{\boolean{usesinglespacing}}
{\AtBeginDocument{\singlespace}%
\ifthenelse{\boolean{strict}}
{\PackageWarning{infthesis}{Single spacing is not permitted in the
regulations for PhD and MPhil theses}}
{}}
{\ifthenelse{\boolean{usedoublespacing}}
{\AtBeginDocument{\doublespace}}
{\AtBeginDocument{\oneandahalfspace}}
}
}
%%
%% We must ensure that the thesis ends on a lef-hand page. We
%% do the latter by issuing a \cleardoublepage at the end of document.
%% MEF: deleted this, there's no point -- it'll print the other side of that
%% page anyway!
%%
% \AtEndDocument{\cleardoublepage}
%%
%% A couple of commands for figures/captions
%%
\newcommand{\thesiscaption}[3][]{
\ifthenelse{\equal{#1}{}}
{\parbox{5in}{\caption{{\em #2\/}}\label{#3}}}
{\parbox{5in}{\caption[#1]{{\em #2\/}}\label{#3}}}
}
%%
%% Quotations are supposed to be in single-space, so we will
%% explicitly redefine the quotation env. to support this.
%% And introduce a quotetext env. which can add an attribution.
%%
\let\old@quote\quote
\let\old@endquote\endquote
\renewcommand{\quote}{\old@quote\singlespace\ifthenelse{\boolean{usequotemarks}}{``}{}}
\renewcommand{\endquote}{\ifthenelse{\boolean{usequotemarks}}{''}{}\old@endquote}
\let\old@quotation\quotation
\let\old@endquotation\endquotation
\renewcommand{\quotation}{\old@quotation\singlespace}
\renewcommand{\endquotation}{\old@endquotation}
% \renewenvironment{quote}
% {\old@quote\singlespace
% \ifthenelse{\boolean{usequotemarks}}{``}{}}
% {\ifthenelse{\boolean{usequotemarks}}{''}{}\end{quote}}
\newenvironment{iquote}
{\begin{quote}\it}
{\rm\end{quote}}
\newcommand{\quotationname}{}
\newenvironment{quotetext}[1]
{\renewcommand{\quotationname}{#1}\begin{iquote}\singlespace
\ifthenelse{\boolean{usequotemarks}}{``}{}\it}
{\ifthenelse{\boolean{usequotemarks}}{\rm''}{}
\hspace*{\fill}\nolinebreak[1]\hspace*{\fill}
\rm (\quotationname)\end{iquote}}
%%
%% Footnotes should also be single-spaced.
%%
\let\tmp@footnotetext=\@footnotetext
\renewcommand{\@footnotetext}[1]%
{{\singlespace\tmp@footnotetext{#1}}}
%% "preliminary" environment to control numbering of pages between title and
%% first chapter. This will only kick in if romanprepages is specificed (the
%% default).
%% Based on Francois Pitt's ut-thesis.cls from University of Toronto.
\newenvironment{preliminary}
{\ifthenelse{\boolean{romanpre}}%
{\pagestyle{plain}\pagenumbering{roman}}
{\pagestyle{empty}}}%
{\cleardoublepage%
\ifthenelse{\boolean{romanpre}}%
{\pagenumbering{arabic}}%
{}}
%%
%% Let's have a dedication page so I can thank my mummy.
%%
\newcommand{\dedication}[1]
{\vspace*{\fill}
\begin{center}#1\end{center}
\vspace*{\fill}}
%% A generic "frontmatter" environment, for use with abstract, dedication etc.
%% You specify the title of the environment and the font size to use (so that
%% both normal abstracts and those on the front page can be accommodated.)
\newenvironment{frontenv}[2]
{\vspace{1cm}
\begin{center}
\@chapterfont\bfseries \LARGE#1
\end{center}}
{\par\vfil}
%% You specify the abstract with the \abs command; it gets automatically
%% inserted into the document where appropriate (title page or first main
%% page.
\def\@abs{}
\renewcommand{\abstract}[1]{\gdef\@abs{#1}}
\newenvironment{mainabs}
{\begin{frontenv}{\abstractname}{\LARGE}}
{\end{frontenv}\newpage}
\newenvironment{frontabs}
{\begin{frontenv}{\abstractname}{\large}
\begin{quotation}\rm}
{\end{quotation}\end{frontenv}}
%%
%% Based upon the above abstract env., provide wrappers for
%% an acknowledgements and declation env.
%%
\newenvironment{acknowledgements}
{\renewcommand{\abstractname}{Acknowledgements}\begin{mainabs}}
{\end{mainabs}\renewcommand{\abstractname}{Abstract}}
\newenvironment{declaration}
{\renewcommand{\abstractname}{Declaration}\begin{mainabs}}
{\end{mainabs}\renewcommand{\abstractname}{Abstract}}
\newcommand{\standarddeclaration}{
\begin{declaration}
I declare that this thesis was composed by myself,
that the work contained herein is my own
except where explicitly stated otherwise in the text,
and that this work has not been submitted for any other degree or
professional qualification except as specified.\par
\vspace{1in}\raggedleft({\em \@author\/})
\end{declaration}
}
%%
%% Now let's look at the format for the title page of the
%% thesis. This is done by redefining \maketitle, and allowing
%% some extra input options: \submityear and \graduationdate
%%
\def\submityear#1{\gdef\@submityear{#1}}
\gdef\@submityear{\the\year}
\def\graduationdate#1{\gdef\@graduationdate{#1}}
\gdef\@graduationdate{}
%% If usedeptreport is specified, then none of the other funky options kick
%% in. If not, then if frontabs is specified then it is used; otherwise,
%% the logo is inserted if required. If the abstract is not put on the front
%% page, then \maketitle will also create the abstract page as the first page
%% of the actual document.
\ifthenelse{\boolean{usedeptreport}}{
\renewcommand{\maketitle}{
\begin{titlepage}
\addtolength{\oddsidemargin}{-0.75cm}
\begin{center}
\null\vskip 6.1cm
\begin{minipage}[t][7.6cm]{10.5cm}
\begin{center}
{\LARGE\bfseries \@chapterfont \@title \par
\ifthenelse{\boolean{draftthesis}}{\large \@draftmessage}{}
}\vfil
{\Large\itshape \@author \par}
\end{center}
\end{minipage}\\
{\large \@degreetext \par \@submityear \par}
\ifthenelse{\equal{\@graduationdate}{}}{}
{\vskip 1cm {\large \ttfamily (Graduation date: \@graduationdate)}}
\end{center}
\end{titlepage}\cleardoublepage
\begin{mainabs}\@abs\end{mainabs}
}}{\ifthenelse{\boolean{frontabs}}{
\ifthenelse{\boolean{strict}}
{\PackageWarning{infthesis}{The regulations for PhD and MPhil theses
do not permit the abstract on the front page}}
{}
\renewcommand{\maketitle}{
\begin{titlepage}\begin{center}
{\LARGE\bfseries \@chapterfont \@title \par
\ifthenelse{\boolean{draftthesis}}{\large (Draft Copy)}{}
}\vspace{3cm}
{\Large\itshape \@author \par}\vspace{3cm}
{\large \@degreetext \par \@submityear \par}
\vskip 1cm
\ifthenelse{\equal{\@graduationdate}{}}{}
{{\large \ttfamily (Graduation date: \@graduationdate)}}
\end{center}
\begin{frontabs}\@abs\end{frontabs}
\end{titlepage}
}}{
\renewcommand{\maketitle}{
\begin{titlepage}\begin{center}
\null\vfil\vskip 60\p@
{\LARGE\bfseries \@chapterfont \@title \par
\ifthenelse{\boolean{draftthesis}}{\large (Draft Copy)}{}
}\vfill
{\Large\itshape \@author \par}\vskip 1cm\vfill
\ifthenelse{\boolean{logo}}%
{\resizebox{30mm}{!}{\includeshield}\\\vfill}
{}
{\large \@degreetext \par \@submityear \par}
\vskip 1cm
\ifthenelse{\equal{\@graduationdate}{}}{}
{{\large \ttfamily (Graduation date: \@graduationdate)}}
\end{center}
\end{titlepage}\cleardoublepage
\begin{mainabs}\@abs\end{mainabs}
}}}
%% If requested, put the list of figures and list of tables into the table of
%% contents.
\ifthenelse{\boolean{ltoc}}
{\RequirePackage[nottoc,notbib]{tocbibind}}
{}
%% Use the "pslatex" fonts unless they told us not to
\ifthenelse{\boolean{timesfonts}}
{\RequirePackage{pslatex}}
{}
%% ALWAYS put the bibliography into the TOC.
%% Thanks to Peter Wilson <peter.r.wilson@boeing.com> for pointing me in the
%% right direction, and to Heiko Oberdiek <oberdiek@ruf.uni-freiburg.de>
%% and Michael J Downes <epsmjd@ams,org> for together coming up with this
%% solution.
%%
%% However the bibliography is defined, this will append the \addcontentsline
%% statement to it. Also, it will put the bibname into the page headers.
\AtBeginDocument{%
\expandafter\def\expandafter\thebibliography\expandafter
#\expandafter1\expandafter{\thebibliography{#1}%
{\addcontentsline{toc}{chapter}{\bibname}}
\markboth{\bibname}{\bibname}}}
%% Do what is requested with headings and captions... can't include these
%% packages above because sectsty won't work except from within report.cls
%% itself. Must save \@chapterfont because the front environments (abstract
%% etc) also need to use it.
\RequirePackage{sectsty,caption}
\ifthenelse{\boolean{sansheadings}}
{\allsectionsfont{\sffamily}
\renewcommand{\@chapterfont}{\sffamily}
\renewcommand{\captionfont}{\sffamily}
\renewcommand{\headfootstyle}{\normalsize\sffamily}}
{}
%% To make sure we get chapters aligned correctly, we set it here instead.
\chapterfont{\@chapterfont\@chapteralignment}
%%
%% EOF: infthesis.cls
%%

623
tex/draft/mybibfile.bib Normal file
View file

@ -0,0 +1,623 @@
@article{Aguilar_Leiss.Coherence-Replacement.2006,
title = {A Coherence-Replacement Protocol For Web Proxy Cache Systems},
author = {J. Aguilar and E.L. Leiss},
year = 2006,
journal = {International Journal of Computers and Applications},
publisher = {Taylor & Francis},
volume = 28,
number = 1,
pages = {12--18},
doi = {10.1080/1206212X.2006.11441783},
url = {https://doi.org/10.1080/1206212X.2006.11441783},
eprint = {https://doi.org/10.1080/1206212X.2006.11441783}
}
@article{Amza_etal.Treadmarks.1996,
title = {Treadmarks: Shared memory computing on networks of workstations},
author = {Amza, Cristiana and Cox, Alan L and Dwarkadas, Sandhya and Keleher, Pete and Lu, Honghui and Rajamony, Ramakrishnan and Yu, Weimin and Zwaenepoel, Willy},
journal = {Computer},
volume = {29},
number = {2},
pages = {18--28},
year = {1996},
publisher = {IEEE}
}
@misc{ARM.ARMv8-A.v1.0.2015,
title = {ARM® Cortex®-A Series Programmer's Guide for ARMv8-A},
url = {https://developer.arm.com/documentation/den0024/a},
journal = {Documentation - arm developer},
publisher = {ARM},
author = {ARM},
year = {2015}
}
@book{AST_Steen.Distributed_Systems-3ed.2017,
title = {Distributed systems},
author = {Van Steen, Maarten and Tanenbaum, Andrew S},
year = {2017},
publisher = {Maarten van Steen Leiden, The Netherlands}
}
@article{Bell_Gray.HPC_is_Cluster.2002,
title = {What's next in high-performance computing?},
author = {Bell, Gordon and Gray, Jim},
journal = {Communications of the ACM},
volume = {45},
number = {2},
pages = {91--95},
year = {2002},
publisher = {ACM New York, NY, USA}
}
@book{BOOK.Hennessy_Patterson.CArch.2011,
title = {Computer architecture: a quantitative approach},
author = {Hennessy, John L and Patterson, David A},
year = 2011,
publisher = {Elsevier}
}
@inproceedings{Cabezas_etal.GPU-SM.2015,
title = {GPU-SM: shared memory multi-GPU programming},
author = {Cabezas, Javier and Jord{\`a}, Marc and Gelado, Isaac and Navarro, Nacho and Hwu, Wen-mei},
year = 2015,
booktitle = {Proceedings of the 8th Workshop on General Purpose Processing using GPUs},
pages = {13--24}
}
@article{Cai_etal.Distributed_Memory_RDMA_Cached.2018,
title = {Efficient distributed memory management with RDMA and caching},
author = {Cai, Qingchao and Guo, Wentian and Zhang, Hao and Agrawal, Divyakant and Chen, Gang and Ooi, Beng Chin and Tan, Kian-Lee and Teo, Yong Meng and Wang, Sheng},
journal = {Proceedings of the VLDB Endowment},
volume = {11},
number = {11},
pages = {1604--1617},
year = {2018},
publisher = {VLDB Endowment}
}
@article{Carter_Bennett_Zwaenepoel.Munin.1991,
title = {Implementation and performance of Munin},
author = {Carter, John B and Bennett, John K and Zwaenepoel, Willy},
journal = {ACM SIGOPS Operating Systems Review},
volume = {25},
number = {5},
pages = {152--164},
year = {1991},
publisher = {ACM New York, NY, USA}
}
@inproceedings{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991,
author = {Chaiken, David and Kubiatowicz, John and Agarwal, Anant},
title = {LimitLESS directories: A scalable cache coherence scheme},
year = {1991},
isbn = {0897913809},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/106972.106995},
doi = {10.1145/106972.106995},
booktitle = {Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {224234},
numpages = {11},
location = {Santa Clara, California, USA},
series = {ASPLOS IV}
}
@misc{Corbet.LWN-NC-DMA.2021,
url = {https://lwn.net/Articles/855328/},
journal = {Noncoherent DMA mappings},
publisher = {LWN.net},
author = {Corbet, Jonathan},
year = {2021}
}
@inproceedings{Couceiro_etal.D2STM.2009,
title = {D2STM: Dependable distributed software transactional memory},
author = {Couceiro, Maria and Romano, Paolo and Carvalho, Nuno and Rodrigues, Lu{\'\i}s},
booktitle = {2009 15th IEEE Pacific Rim International Symposium on Dependable Computing},
pages = {307--313},
year = {2009},
organization = {IEEE}
}
@article{De_Wael_etal.PGAS_Survey.2015,
title = {Partitioned global address space languages},
author = {De Wael, Mattias and Marr, Stefan and De Fraine, Bruno and Van Cutsem, Tom and De Meuter, Wolfgang},
journal = {ACM Computing Surveys (CSUR)},
volume = {47},
number = {4},
pages = {1--27},
year = {2015},
publisher = {ACM New York, NY, USA}
}
@inproceedings{Ding.vDSM.2018,
author = {Ding, Zhuocheng},
booktitle = {2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)},
title = {vDSM: Distributed Shared Memory in Virtualized Environments},
year = {2018},
volume = {},
number = {},
pages = {1112-1115},
keywords = {Virtual machine monitors;Optimization;Protocols;Virtualization;Operating systems;Stress;Analytical models;component;distributed shared memory;virtuali-zation;low-latency network},
doi = {10.1109/ICSESS.2018.8663720}
}
@inproceedings{Eisley_Peh_Shang.In-net-coherence.2006,
title = {In-network cache coherence},
author = {Eisley, Noel and Peh, Li-Shiuan and Shang, Li},
booktitle = {2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)},
pages = {321--332},
year = {2006},
organization = {IEEE}
}
@inproceedings{Endo_Sato_Taura.MENPS_DSM.2020,
title = {MENPS: a decentralized distributed shared memory exploiting RDMA},
author = {Endo, Wataru and Sato, Shigeyuki and Taura, Kenjiro},
booktitle = {2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)},
pages = {9--16},
year = {2020},
organization = {IEEE}
}
@article{Fleisch_Popek.Mirage.1989,
title = {Mirage: A coherent distributed shared memory design},
author = {Fleisch, Brett and Popek, Gerald},
journal = {ACM SIGOPS Operating Systems Review},
volume = {23},
number = {5},
pages = {211--223},
year = {1989},
publisher = {ACM New York, NY, USA}
}
@misc{FreeBSD.man-BPF-4.2021,
title = {FreeBSD manual pages},
url = {https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports},
journal = {BPF(4) Kernel Interfaces Manual},
publisher = {The FreeBSD Project},
author = {The FreeBSD Project},
year = {2021}
}
@inproceedings{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018,
title = {NoC-based support of heterogeneous cache-coherence models for accelerators},
author = {Giri, Davide and Mantovani, Paolo and Carloni, Luca P},
booktitle = {2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS)},
pages = {1--8},
year = {2018},
organization = {IEEE}
}
@book{Holsapple.DSM64.2012,
title = {DSM64: A Distributed Shared Memory System in User-Space},
author = {Holsapple, Stephen Alan},
year = {2012},
publisher = {California Polytechnic State University}
}
@article{Hong_etal.NUMA-to-RDMA-DSM.2019,
title = {Scaling out NUMA-aware applications with RDMA-based distributed shared memory},
author = {Hong, Yang and Zheng, Yang and Yang, Fan and Zang, Bin-Yu and Guan, Hai-Bing and Chen, Hai-Bo},
journal = {Journal of Computer Science and Technology},
volume = {34},
pages = {94--112},
year = {2019},
publisher = {Springer}
}
@inproceedings{Hu_Shi_Tang.JIAJIA.1999,
title = {JIAJIA: A software DSM system based on a new cache coherence protocol},
author = {Hu, Weiwu and Shi, Weisong and Tang, Zhimin},
booktitle = {High-Performance Computing and Networking: 7th International Conference, HPCN Europe 1999 Amsterdam, The Netherlands, April 12--14, 1999 Proceedings 7},
pages = {461--472},
year = {1999},
organization = {Springer}
}
@misc{ISO/IEC_9899:2011.C11,
abstract = {Edition Status: Withdrawn on 2018-07-13},
isbn = {9780580801655},
keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
language = {eng},
publisher = {British Standards Institute},
title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
year = {2013}
}
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
title = {C++ Atomic Types and Operations},
url = {https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
journal = {C++ atomic types and operations},
publisher = {ISO/IEC JTC 1},
author = {Boehm, Hans J and Crowl, Lawrence},
year = {2007}
}
@article{Itzkovitz_Schuster_Shalev.Millipede.1998,
title = {Thread migration and its applications in distributed shared memory systems},
author = {Itzkovitz, Ayal and Schuster, Assaf and Shalev, Lea},
journal = {Journal of Systems and Software},
volume = {42},
number = {1},
pages = {71--87},
year = {1998},
publisher = {Elsevier}
}
@article{Jaleel_etal.RRIP.2010,
title = {High performance cache replacement using re-reference interval prediction (RRIP)},
author = {Jaleel, Aamer and Theobald, Kevin B and Steely Jr, Simon C and Emer, Joel},
year = 2010,
journal = {ACM SIGARCH computer architecture news},
publisher = {ACM New York, NY, USA},
volume = 38,
number = 3,
pages = {60--71}
}
@article{Jia_etal.Tensorflow_over_RDMA.2018,
title = {Improving the performance of distributed tensorflow with RDMA},
author = {Jia, Chengfan and Liu, Junnan and Jin, Xu and Lin, Han and An, Hong and Han, Wenting and Wu, Zheng and Chi, Mengxian},
journal = {International Journal of Parallel Programming},
volume = {46},
pages = {674--685},
year = {2018},
publisher = {Springer}
}
@inproceedings{Kaxiras_etal.DSM-Argos.2015,
author = {Kaxiras, Stefanos and Klaftenegger, David and Norgren, Magnus and Ros, Alberto and Sagonas, Konstantinos},
title = {Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory},
year = {2015},
isbn = {9781450335508},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2749246.2749250},
doi = {10.1145/2749246.2749250},
abstract = {A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.},
booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing},
pages = {3-14},
numpages = {12},
location = {Portland, Oregon, USA},
series = {HPDC '15}
}
@inproceedings{Khawaja_etal.AmorphOS.2018,
title = {Sharing, Protection, and Compatibility for Reconfigurable Fabric with $\{$AmorphOS$\}$},
author = {Khawaja, Ahmed and Landgraf, Joshua and Prakash, Rohith and Wei, Michael and Schkufza, Eric and Rossbach, Christopher J},
booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
pages = {107--127},
year = {2018}
}
@article{Khokhar_etal.HetComputingVision.1993,
title = {Heterogeneous computing: Challenges and opportunities},
author = {Khokhar, Ashfaq A. and Prasanna, Viktor K. and Shaaban, Muhammad E. and Wang, C-L},
year = 1993,
journal = {Computer},
publisher = {IEEE},
volume = 26,
number = 6,
pages = {18--27}
}
@inproceedings{Kim_etal.DeX-upon-Linux.2020,
author = {Kim, Sang-Hoon and Chuang, Ho-Ren and Lyerly, Robert and Olivier, Pierre and Min, Changwoo and Ravindran, Binoy},
booktitle = {2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)},
title = {DeX: Scaling Applications Beyond Machine Boundaries},
year = {2020},
volume = {},
number = {},
pages = {864-876},
keywords = {Protocols;Instruction sets;Linux;Prototypes;Distributed databases;Programming;Kernel;Thread migration;distributed execution;distributed memory;RDMA},
doi = {10.1109/ICDCS47774.2020.00021}
}
@misc{Kjos_etal.HP-HW-CC-IO.1996,
copyright = {Copyright 2006 Elsevier B.V., All rights reserved.},
issn = {0018-1153},
journal = {Hewlett-Packard journal},
keywords = {Computer Science ; Computer Science, Hardware & Architecture ; Engineering ; Engineering, Electrical & Electronic ; Instruments & Instrumentation ; Science & Technology ; Technology},
language = {eng},
number = {1},
pages = {52-59},
publisher = {Hewlett-Packard Co},
abstract = {Hardware cache coherent I/O is a new feature of the PA-RISC architecture that involves the I/O hardware in ensuring cache coherence, thereby reducing CPU and memory overhead and increasing performance.},
author = {Kjos, Toddj and Nusbaum, Helen and Traynor, Michaelk and Voge, Brendana},
address = {PALO ALTO},
title = {Hardware cache coherent input/output},
volume = {47},
year = {1996}
}
@article{LaRowe_Ellis.Repl_NUMA.1991,
title = {Page placement policies for NUMA multiprocessors},
author = {Richard P. LaRowe and Carla Schlatter Ellis},
year = 1991,
journal = {Journal of Parallel and Distributed Computing},
volume = 11,
number = 2,
pages = {112--129},
doi = {https://doi.org/10.1016/0743-7315(91)90117-R},
issn = {0743-7315},
url = {https://www.sciencedirect.com/science/article/pii/074373159190117R},
abstract = {In many parallel applications, the size of the program's data exceeds even the very large amount of main memory available on large-scale multiprocessors. Virtual memory, in the sense of a transparent management of the main/secondary memory hierarchy, is a natural solution. The replacement, fetch, and placement policies used in uniprocessor paging systems need to be reexamined in light of the differences in the behavior of parallel computations and in the memory architectures of multiprocessors. In particular, we investigate the impact of page placement in nonuniform memory access time (NUMA) shared memory MIMD machines. We experimentally evaluate several paging algorithms that incorporate different approaches to the placement issue. Under certain workload assumptions, our results show that placement algorithms that are strongly biased toward local frame allocation but are able to borrow remote frames can reduce the number of page faults over strictly local allocation. The increased cost of memory operations due to the extra remote accesses is more than compensated for by the savings resulting from the reduction in demand fetches, effectively reducing the computation completion time for these programs without having adverse effects on the performance of “typical” NUMA programs. We also discuss some early results obtained from an actual kernel implementation of one of our page placement algorithms.}
}
@article{Lenoski_etal.Stanford_DASH.1992,
title = {The stanford dash multiprocessor},
author = {Lenoski, Daniel and Laudon, James and Gharachorloo, Kourosh and Weber, W-D and Gupta, Anoop and Hennessy, John and Horowitz, Mark and Lam, Monica S.},
journal = {Computer},
volume = {25},
number = {3},
pages = {63--79},
year = {1992},
publisher = {IEEE}
}
@inproceedings{Li_etal.RelDB_RDMA.2016,
title = {Accelerating relational databases by leveraging remote memory and RDMA},
author = {Li, Feng and Das, Sudipto and Syamala, Manoj and Narasayya, Vivek R},
booktitle = {Proceedings of the 2016 International Conference on Management of Data},
pages = {355--370},
year = {2016}
}
@inproceedings{Lu_etal.MPI_vs_DSM_over_cluster.1995,
title = {Message passing versus distributed shared memory on networks of workstations},
author = {Lu, Honghui and Dwarkadas, Sandhya and Cox, Alan L and Zwaenepoel, Willy},
booktitle = {Supercomputing'95: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing},
pages = {37--37},
year = {1995},
organization = {IEEE}
}
@inproceedings{Lu_etal.Spark_over_RDMA.2014,
title = {Accelerating spark with RDMA for big data processing: Early experiences},
author = {Lu, Xiaoyi and Rahman, Md Wasi Ur and Islam, Nusrat and Shankar, Dipti and Panda, Dhabaleswar K},
booktitle = {2014 IEEE 22nd Annual Symposium on High-Performance Interconnects},
pages = {9--16},
year = {2014},
organization = {IEEE}
}
@inproceedings{Ma_etal.SHM_FPGA.2020,
title = {A hypervisor for shared-memory FPGA platforms},
author = {Ma, Jiacheng and Zuo, Gefei and Loughlin, Kevin and Cheng, Xiaohe and Liu, Yanqiang and Eneyew, Abel Mulugeta and Qi, Zhengwei and Kasikci, Baris},
booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {827--844},
year = {2020}
}
@misc{Manson_Goetz.JSR_133.Java_5.2004,
url = {https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
journal = {JSR 133 (Java Memory Model) FAQ},
publisher = {Department of Computer Science, University of Maryland},
author = {Manson, Jeremy and Goetz, Brian},
year = {2004}
}
@misc{many.MSFTLearn-SMBDirect.2024,
title = {SMB Direct},
url = {https://learn.microsoft.com/en-us/windows-server/storage/file-server/smb-direct},
journal = {Microsoft Learn},
publisher = {Microsoft},
author = {Xelu86 and ManikaDhiman and dknappettmsft and v-alje and nedpyle and eross-msft and SubodhBhargava and JasonGerend and lizap and Heidilohr},
year = {2024}
}
@inproceedings{Masouros_etal.Adrias.2023,
title = {Adrias: Interference-Aware Memory Orchestration for Disaggregated Cloud Infrastructures},
author = {Masouros, Dimosthenis and Pinto, Christian and Gazzetti, Michele and Xydis, Sotirios and Soudris, Dimitrios},
year = 2023,
booktitle = {2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)},
pages = {855--869},
organization = {IEEE}
}
@misc{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024,
title = {Dynamic DMA mapping Guide},
url = {https://www.kernel.org/doc/html/v6.7/core-api/dma-api-howto.html},
journal = {The Linux Kernel},
author = {Miller, David S and Henderson, Richard and Jelinek, Jakub},
year = {2024}
}
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
title = {A primer on memory consistency and cache coherence},
author = {Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
year = {2020},
publisher = {Springer Nature}
}
@inproceedings{narayanan2020heterogeneity,
title = {$\{$Heterogeneity-Aware$\}$ cluster scheduling policies for deep learning workloads},
author = {Narayanan, Deepak and Santhanam, Keshav and Kazhamiaka, Fiodar and Phanishayee, Amar and Zaharia, Matei},
year = 2020,
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
pages = {481--498}
}
@inproceedings{Nelson_etal.Grappa_DSM.2015,
title = {$\{$Latency-Tolerant$\}$ software distributed shared memory},
author = {Nelson, Jacob and Holt, Brandon and Myers, Brandon and Briggs, Preston and Ceze, Luis and Kahan, Simon and Oskin, Mark},
booktitle = {2015 USENIX Annual Technical Conference (USENIX ATC 15)},
pages = {291--305},
year = {2015}
}
@inproceedings{Oh_Kim.Container_Migration.2018,
title = {Stateful Container Migration employing Checkpoint-based Restoration for Orchestrated Container Clusters},
author = {Oh, SeungYong and Kim, JongWon},
year = 2018,
booktitle = {2018 International Conference on Information and Communication Technology Convergence (ICTC)},
volume = {},
number = {},
pages = {25--30},
doi = {10.1109/ICTC.2018.8539562}
}
@misc{Parris.AMBA_4_ACE-Lite.2013,
title = {Extended system coherency: Cache Coherency Fundamentals},
url = {https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals},
journal = {Extended System Coherency: Cache Coherency Fundamentals - Architectures and Processors blog - Arm Community blogs - Arm Community},
publisher = {ARM Community Blogs},
author = {Parris, Neil},
year = {2013}
}
@inproceedings{Pinto_etal.Thymesisflow.2020,
title = {Thymesisflow: A software-defined, hw/sw co-designed interconnect stack for rack-scale memory disaggregation},
author = {Pinto, Christian and Syrivelis, Dimitris and Gazzetti, Michele and Koutsovasilis, Panos and Reale, Andrea and Katrinis, Kostas and Hofstee, H Peter},
booktitle = {2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
pages = {868--880},
year = {2020},
organization = {IEEE}
}
@article{Rodriguez_etal.HPC_Cluster_Migration.2019,
title = {Job migration in hpc clusters by means of checkpoint/restart},
author = {Rodr{\'\i}guez-Pascual, Manuel and Cao, Jiajun and Mor{\'\i}{\~n}igo, Jos{\'e} A and Cooperman, Gene and Mayo-Garc{\'\i}a, Rafael},
year = 2019,
journal = {The Journal of Supercomputing},
publisher = {Springer},
volume = 75,
pages = {6517--6541}
}
@misc{Rust.core::sync::atomic::Ordering.2024,
title = {Ordering in core::sync::atomic - Rust},
url = {https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
journal = {The Rust Core Library},
publisher = {the Rust Team},
year = {2024}
}
@article{Schaefer_Li.Shiva.1989,
title = {Shiva: An operating system transforming a hypercube into a shared-memory machine},
author = {Li, Kai and Schaefer, Richard},
year = {1989}
}
@inproceedings{Schoinas_etal.Sirocco.1998,
title = {Sirocco: Cost-effective fine-grain distributed shared memory},
author = {Schoinas, Ioannis and Falsafi, Babak and Hill, Mark D and Larus, James R and Wood, David A},
booktitle = {Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. 98EX192)},
pages = {40--49},
year = {1998},
organization = {IEEE}
}
@inproceedings{Shan_Tsai_Zhang.DSPM.2017,
title = {Distributed Shared Persistent Memory},
author = {Shan, Yizhou and Tsai, Shin-Yeh and Zhang, Yiying},
year = 2017,
booktitle = {Proceedings of the 2017 Symposium on Cloud Computing},
location = {Santa Clara, California},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
series = {SoCC '17},
pages = {323337},
doi = {10.1145/3127479.3128610},
isbn = 9781450350280,
url = {https://doi.org/10.1145/3127479.3128610},
abstract = {Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments.We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data.We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3\texttimes{} to 3.2\texttimes{} and a recent distributed PM-based file system by 1.5\texttimes{} to 3.0\texttimes{}.},
numpages = 15,
keywords = {distributed shared memory, persistent memory}
}
@misc{Ven.LKML_x86_DMA.2008,
title = {Background on ioremap, cacheing, cache coherency on x86},
url = {https://lkml.org/lkml/2008/4/29/480},
journal = {lkml.org},
author = {Ven, Arjan van de},
year = {2008}
}
@inproceedings{Wang_etal.Concordia.2021,
author = {Qing Wang and Youyou Lu and Erci Xu and Junru Li and Youmin Chen and Jiwu Shu},
title = {Concordia: Distributed Shared Memory with {In-Network} Cache Coherence},
booktitle = {19th USENIX Conference on File and Storage Technologies (FAST 21)},
year = {2021},
isbn = {978-1-939133-20-5},
pages = {277--292},
url = {https://www.usenix.org/conference/fast21/presentation/wang},
publisher = {USENIX Association},
month = feb
}
@misc{WEB.Ampere..Ampere_Altra_Datasheet.2023,
url = {https://uawartifacts.blob.core.windows.net/upload-files/Altra_Max_Rev_A1_DS_v1_15_20230809_b7cdce449e_424d129849.pdf},
journal = {Ampere Altra Max Rev A1 64-Bit Multi-Core Processor Datasheet},
publisher = {Ampere Computing}
}
@misc{WEB.APACHE..Apache_Hadoop.2023,
url = {https://hadoop.apache.org/},
journal = {Apache Hadoop},
publisher = {The APACHE Software Foundation}
}
@misc{WEB.APACHE..Apache_Spark.2023,
url = {https://spark.apache.org/},
journal = {Apache SparkTM - Unified Engine for large-scale data analytics},
publisher = {The APACHE Software Foundation}
}
@misc{WEB.HPE.Chapel_Platforms-v1.33.2023,
title = {Platform-Specifc Notes},
url = {https://chapel-lang.org/docs/platforms/index.html#},
journal = {Chapel Documentation 1.33},
publisher = {Hewlett Packard Enterprise Development LP.},
year = {2023}
}
@misc{WEB.LBNL.UPC_man_1_upcc.2022,
title = {upcc.1},
url = {https://upc.lbl.gov/docs/user/upcc.html},
journal = {Manual Reference Pages - UPCC (1)},
publisher = {Lawrence Berkeley National Laboratory},
year = {2022}
}
@misc{WEB.LWN.Corbet.HMM_GPL_woes.2018,
title = {Heterogeneous memory management meets EXPORT\_SYMBOL\_GPL()},
author = {Corbet, Jonathan},
year = 2018,
journal = {LWN.net},
publisher = {LWN.net},
url = {https://lwn.net/Articles/757124/}
} or was the order of authors other way around?
@misc{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017,
title = {Unified memory for cuda beginners},
author = {Harris, Mark},
year = 2017,
journal = {Unified Memory for CUDA Beginners},
publisher = {NVIDIA},
url = {https://developer.nvidia.com/blog/unified-memory-cuda-beginners/}
}
@misc{WEB.Phoronix..HMM_Search_Results.2023,
journal = {Heterogeneous Memory Management - Phoronix},
publisher = {Phoronix},
url = {https://www.phoronix.com/search/Heterogeneous%20Memory%20Management}
}
@inproceedings{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003,
title = {A performance comparison of dsm, pvm, and mpi},
author = {Werstein, Paul and Pethick, Mark and Huang, Zhiyi},
booktitle = {Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies},
pages = {476--482},
year = {2003},
organization = {IEEE}
}
@inproceedings{Yang_etal.FIFO-LPQD.2023,
title = {FIFO can be Better than LRU: the Power of Lazy Promotion and Quick Demotion},
author = {Yang, Juncheng and Qiu, Ziyue and Zhang, Yazhuo and Yue, Yao and Rashmi, KV},
year = 2023,
booktitle = {Proceedings of the 19th Workshop on Hot Topics in Operating Systems},
pages = {70--79}
}
@inproceedings{Zaharia_etal.RDD.2012,
author = {Matei Zaharia and Mosharaf Chowdhury and Tathagata Das and Ankur Dave and Justin Ma and Murphy McCauly and Michael J. Franklin and Scott Shenker and Ion Stoica},
title = {Resilient Distributed Datasets: A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing},
booktitle = {9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)},
year = {2012},
isbn = {978-931971-92-8},
address = {San Jose, CA},
pages = {15--28},
url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia},
publisher = {USENIX Association},
month = apr
}
@inproceedings{Zhang_etal.GiantVM.2020,
title = {Giantvm: A type-ii hypervisor implementing many-to-one virtualization},
author = {Zhang, Jin and Ding, Zhuocheng and Chen, Yubin and Jia, Xingguo and Yu, Boshi and Qi, Zhengwei and Guan, Haibing},
booktitle = {Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
pages = {30--44},
year = {2020}
}
@inproceedings{Zhou_etal.DART-MPI.2014,
title = {DART-MPI: An MPI-based implementation of a PGAS runtime system},
author = {Zhou, Huan and Mhedheb, Yousri and Idrees, Kamran and Glass, Colin W and Gracia, Jos{\'e} and F{\"u}rlinger, Karl},
booktitle = {Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models},
pages = {1--11},
year = {2014}
}

BIN
tex/draft/skeleton.pdf Normal file

Binary file not shown.

576
tex/draft/skeleton.tex Normal file
View file

@ -0,0 +1,576 @@
% UG project example file, February 2022
% A minior change in citation, September 2023 [HS]
% Do not change the first two lines of code, except you may delete "logo," if causing problems.
% Understand any problems and seek approval before assuming it's ok to remove ugcheck.
\documentclass[logo,bsc,singlespacing,parskip]{infthesis}
\usepackage{ugcheck}
% Include any packages you need below, but don't include any that change the page
% layout or style of the dissertation. By including the ugcheck package above,
% you should catch most accidental changes of page layout though.
\usepackage{microtype} % recommended, but you can remove if it causes problems
% \usepackage{natbib} % recommended for citations
\usepackage[utf8]{inputenc}
\usepackage[dvipsnames]{xcolor}
\usepackage{hyperref}
\usepackage[justification=centering]{caption}
\usepackage{graphicx}
\usepackage[english]{babel}
% -> biblatex
\usepackage{biblatex} % full of mischief
\addbibresource{mybibfile.bib}
% <- biblatex
% -> nice definition listings
\usepackage{csquotes}
\usepackage{amsthm}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
% <- definition
% -> code listing
% [!] Requires external program: pypi:pygment
\usepackage{minted}
\usemintedstyle{vs}
% <- code listing
\begin{document}
\begin{preliminary}
\title{Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
\author{Zhengyi Chen}
% CHOOSE YOUR DEGREE a):
% please leave just one of the following un-commented
% \course{Artificial Intelligence}
%\course{Artificial Intelligence and Computer Science}
%\course{Artificial Intelligence and Mathematics}
%\course{Artificial Intelligence and Software Engineering}
%\course{Cognitive Science}
\course{Computer Science}
%\course{Computer Science and Management Science}
%\course{Computer Science and Mathematics}
%\course{Computer Science and Physics}
%\course{Software Engineering}
%\course{Master of Informatics} % MInf students
% CHOOSE YOUR DEGREE b):
% please leave just one of the following un-commented
%\project{MInf Project (Part 1) Report} % 4th year MInf students
%\project{MInf Project (Part 2) Report} % 5th year MInf students
\project{4th Year Project Report} % all other UG4 students
\date{\today}
\abstract{
This skeleton demonstrates how to use the \texttt{infthesis} style for
undergraduate dissertations in the School of Informatics. It also emphasises the
page limit, and that you must not deviate from the required style.
The file \texttt{skeleton.tex} generates this document and should be used as a
starting point for your thesis. Replace this abstract text with a concise
summary of your report.
}
\maketitle
\newenvironment{ethics}
{\begin{frontenv}{Research Ethics Approval}{\LARGE}}
{\end{frontenv}\newpage}
\begin{ethics}
% \textbf{Instructions:} \emph{Agree with your supervisor which
% statement you need to include. Then delete the statement that you are not using,
% and the instructions in italics.\\
% \textbf{Either complete and include this statement:}}\\ % DELETE THESE INSTRUCTIONS
% %
% % IF ETHICS APPROVAL WAS REQUIRED:
% This project obtained approval from the Informatics Research Ethics committee.\\
% Ethics application number: ???\\
% Date when approval was obtained: YYYY-MM-DD\\
% %
% \emph{[If the project required human participants, edit as appropriate, otherwise delete:]}\\ % DELETE THIS LINE
% The participants' information sheet and a consent form are included in the appendix.\\
% %
% IF ETHICS APPROVAL WAS NOT REQUIRED:
% \textbf{\emph{Or include this statement:}}\\ % DELETE THIS LINE
This project was planned in accordance with the Informatics Research
Ethics policy. It did not involve any aspects that required approval
from the Informatics Research Ethics committee.
\standarddeclaration
\end{ethics}
\begin{acknowledgements}
Jordanian River to the Mediterranean Sea, maybe\dots
\end{acknowledgements}
\tableofcontents
\end{preliminary}
\chapter{Introduction}
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
This thesis paper builds upon an ongoing research effort in implementing a tightly coupled cluster where HMM abstractions allow for transparent RDMA access from accelerator nodes to local data and migration of data near computation, leveraging different consistency model and coherency protocols to amortize the communication cost for shared data. More specifically, this thesis explores the following:
\begin{itemize}
\item {
The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
}
\item {
Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
}
\end{itemize}
The rest of the chapter is structured as follows:
\begin{itemize}
\item {
We identify and discuss notable developments in software-implemented DSM systems, and thus identify key features of contemporary advancements in DSM techniques that differentiate them from their predecessors.
}
\item {
We identify alternative (shared memory) programming paradigms and compare them with DSM, which sought to provide transparent shared address space among participating nodes.
}
\item {
We give an overview of coherency protocol and consistency models for multi-sharer DSM systems.
}
\item {
We provide a primer to cache coherency in ARM64 systems, which \emph{do not} guarantee cache-coherent DMA, as opposed to x86 systems \cite{Ven.LKML_x86_DMA.2008}.
}
\end{itemize}
\section{Experiences from Software DSM}
A majority of contributions to software DSM systems come from the 1990s \cites{Amza_etal.Treadmarks.1996}{Carter_Bennett_Zwaenepoel.Munin.1991}{Itzkovitz_Schuster_Shalev.Millipede.1998}{Hu_Shi_Tang.JIAJIA.1999}. These developments follow from the success of the Stanford DASH project in the late 1980s -- a hardware distributed shared memory (specifically NUMA) implementation of a multiprocessor that first proposed the \textit{directory-based protocol} for cache coherence, which stores the ownership information of cache lines to reduce unnecessary communication that prevented previous multiprocessors from scaling out \cite{Lenoski_etal.Stanford_DASH.1992}.
While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}{Lu_etal.MPI_vs_DSM_over_cluster.1995}.
New developments in network interfaces provides much improved bandwidth and latency compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC, scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}. Similar results have been observed for \textit{APACHE Spark} \cite{Lu_etal.Spark_over_RDMA.2014} and \textit{SMBDirect} \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a resurgence of interest in software DSM systems and programming models \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
\subsection{Munin: Multi-Consistency Protocol}
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older developments in software DSM systems. The authors of Munin identify that \textit{false-sharing}, occurring due to multiple processors writing to different offsets of the same page triggering invalidations, is strongly detrimental to the performance of shared-memory systems. To combat this, Munin exposes annotations as part of its programming model to facilitate multiple consistency protocols on top of release consistency. An immutable shared memory object across readers, for example, can be safely copied without concern for coherence between processors. On the other hand, the \textit{write-shared} annotation explicates that a memory object is written by multiple processors without synchronization -- i.e., the programmer guarantees that only false-sharing occurs within this granularity. Annotations such as these explicitly disables subsets of consistency procedures to reduce communication in the network fabric, thereby improving the performance of the DSM system.
Perhaps most importantly, experiences from Munin show that \emph{restricting the flexibility of programming model can lead to more performant coherence models}, as exhibited by the now-foundational \textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data processing frameworks such as \textit{Hadoop MapReduce} \cite{WEB.APACHE..Apache_Hadoop.2023} and \textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory [based on]\dots transformations rather than\dots updates to shared state'' \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to cheaply synchronize states between unshared address spaces -- a much desired property for highly scalable, loosely-coupled clustered systems.
\subsection{Treadmarks: Multi-Writer Protocol}
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system developed in 1996, which featured an intricate \textit{interval}-based multi-writer protocol that allows multiple nodes to write to the same page without false-sharing. The system follows a release-consistent memory model, which requires the use of either locks (via \texttt{acquire}, \texttt{release}) or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval} represents a time period in-between page creation, \texttt{release} to another processor, or a \texttt{barrier}; they also each correspond to a \textit{write notice}, which are used for page invalidation. Each \texttt{acquire} message is sent to the statically-assigned lock-manager node, which forwards the message to the last releaser. The last releaser computes the outstanding write notices and piggy-backs them back for the acquirer to invalidate its own cached page entry, thus signifying entry into the critical section. Consistency information, including write notices, intervals, and page diffs, are routinely garbage-collected which forces cached pages in each node to become validated.
Compared to \textit{Treadmarks}, the system described in this paper uses a single-writer protocol, thus eliminating the concept of ``intervals'' -- with regards to synchronization, each page can be either in-sync (in which case they can be safely shared) or out-of-sync (in which case they must be invalidated/updated). This comes with the following advantage:
\begin{itemize}
\item Less metadata for consistency-keeping.
\item More adherent to the CPU-accelerator dichotomy model.
\item Much simpler coherence protocol, which reduces communication cost.
\end{itemize}
In view of the (still) disparate throughput and latency differences between local and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018}, the simpler coherence protocol of single-writer protocol should provide better performance on the critical paths of remote memory access.
\subsection{Hotpot: Single-Writer \& Data Replication}
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply distributed shared memory techniques on persistent memory to provide ``transparent memory accesses, data persistence, data reliability, and high availability''. Leveraging on persistent memory devices allow DSM applications to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017}, ensuring both distributed cache coherence and data reliability at the same time \cite{Shan_Tsai_Zhang.DSPM.2017}.
We specifically discuss the single-writer portion of its coherence protocol. The data reliability guarantees proposed by the \textit{Hotpot} system requires each shared page to be replicated to some \textit{degree of replication}. Nodes who always store latest replication of shared pages are referred to as ``owner nodes'', which arbitrate other nodes to store more replications in order to reach the degree of replication quota. At acquisition time, the acquiring node asks the access-management node for single-writer access to shared page, who grants it if no other critical section exists, alongside list of current owner nodes. At release time, the releaser first commits its changes to all owner nodes which, in turn, commits its received changes across lesser sharers to achieve the required degree of replication. These two operations are all acknowledged back in reverse order. Once all acknowledgements are received from owner nodes by commit node, the releaser tells them to delete their commit logs and, finally, tells the manager node to exit critical section.
The required degree of replication and logged commit transaction until explicit deletion facilitate crash recovery at the expense of worse performance over release-time I/O. While the study of crash recovery with respect to shared memory systems is out of the scope of this thesis, this paper provides a good framework for a \textbf{correct} coherence protocol for a single-writer, multiple-reader shared memory system, particularly when the protocol needs to cater for a great variety of nodes each with their own memory preferences (e.g., write-update vs. write-invalidate, prefetching, etc.).
\subsection{MENPS: A Return to DSM}
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable interconnects as a proof-of-concept that DSM systems and programming models can be as efficient as \textit{partitioned global address space} (PGAS) using today's network interfaces. It builds upon \textit{TreadMark}'s \cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to a \textit{floating home-based} protocol, based on the insight that diff-transfers across the network is comparatively costly compared to RDMA intrinsics -- which implies preference towards local diff-merging. The home node then acts as the data supplier for every shared page within the system.
Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of \textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup in some of the computation tasks, while achieving much better productivity due to DSM's support for transparent caching, etc. \cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that DSM systems are at least as viable as traditional PGAS/message-passing frameworks for scientific computing, also corroborated by the resurgence of DSM studies later on\cite{Masouros_etal.Adrias.2023}.
\section{PGAS and Message Passing}
While the feasibility of transparent DSM systems over multiple machines on the network has been made apparent since the 1980s, predominant approaches to ``scaling-out'' programs over the network relies on the message-passing approach \cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold:
\begin{enumerate}
\item {
Programmers would rather resort to more intricate, more predictable approaches to scaling-out programs over the network \cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies manual/controlled data sharding over nodes, separation of compute and communication ``stages'' of computation, etc., which benefit performance analysis and engineering.
}
\item {
Enterprise applications value throughput and uptime of relatively computationally inexpensive tasks/resources \cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy scalability of tried-and-true, latency-inexpensive applications. Studies in transparent DSM systems mostly require exotic, specifically-written programs to exploit global address space, which is fundamentally at odds in terms of reusability and flexibility required.
}
\end{enumerate}
\subsection{PGAS}
\textit{Partitioned Global Address Space} (PGAS) is a parallel programming model that (1) exposes a global address space to all machines within a network and (2) explicates distinction between local and remote memory \cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks, for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used as backends to provide the PGAS model over various network interfaces/platforms (e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022} {WEB.HPE.Chapel_Platforms-v1.33.2023}.
Notably, implementation of a \emph{global} address space across machines on top of machines already equipped with their own \emph{local} address space (e.g., cluster nodes running commercial Linux) necessitates a global addressing mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014}, for example, utilizes a 128-bit ``global pointer'' to encode global memory object/segment ID and access flags in the upper 64 bits and virtual addresses in the lower 64 bits for each (slice of) memory object allocated within the PGAS model. A \textit{non-collective} PGAS object is allocated entirely local to the allocating node's memory, but registered globally. Consequently, a single global pointer is recorded in the runtime with corresponding permission flags for the context of some user-defined group of associated nodes. Comparatively, a \textit{collective} PGAS object is allocated such that a partition of the object (i.e., a sub-array of the repr) is stored in each of the associated node -- for a $k$-partitioned object, $k$ global pointers are recorded in the runtime each pointing to the same object, with different offsets and (intuitively) independently-chosen virtual addresses. Note that this design naturally requires virtual addresses within each node to be \emph{pinned} -- the allocated object cannot be re-addressed to a different virtual address, thus preventing the global pointer that records the local virtual address from becoming spontaneously invalidated.
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may opt to use a map-like data structure for addressing instead. In general, despite both PGAS and DSM systems provide memory management over remote nodes, PGAS frameworks provide no transparent caching and transfer of remote memory objects accessed by local nodes. The programmer is still expected to handle data/thread movement manually when working with shared memory over network to maximize their performance metrics of interest.
\subsection{Message Passing}
\label{sec:msg-passing}
\textit{Message Passing} remains the predominant programming model for parallelism between loosely-coupled nodes within a computer system, much as it is ubiquitous in supporting all levels of abstraction within any concurrent components of a computer system. Specific to cluster computing systems is the message-passing programming model, where parallel programs (or instances of the same parallel program) on different nodes within the system communicate via exchanging messages over network between these nodes. Such models exchange programming model productivity for more fine-grained control over the messages passed, as well as more explicit separation between communication and computation stages within a programming subproblem.
Commonly, message-passing backends function as \textit{middlewares} -- communication runtimes -- to aid distributed software development \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend expose facilities for inter-application communication to frontend developers while transparently providing security, accounting, and fault-tolerance, much like how an operating system may provide resource management, scheduling, and security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. This is the case for implementing the PGAS programming model, which mostly rely on common message-passing backends to facilitate orchestrated data manipulation across distributed nodes. Likewise, message-passing backends, including RDMA API, form the backbone of many research-oriented DSM systems \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
Message-passing between network-connected nodes may be \textit{two-sided} or \textit{one-sided}. The former models an intuitive workflow to sending and receiving datagrams over the network -- the sender initiates a transfer; the receiver copies a received packet from the network card into a kernel buffer; the receiver's kernel filters the packet and (optionally) \cite{FreeBSD.man-BPF-4.2021} copies the internal message into the message-passing runtime/middleware's address space; the receiver's middleware inspects the copied message and performs some procedures accordingly, likely also involving copying slices of message data to some registered distributed shared memory buffer for the distributed application to access. Despite it being a highly intuitive model of data manipulation over the network, this poses a fundamental performance issue: because the process requires the receiver's kernel AND userspace to exert CPU-time, upon reception of each message, the receiver node needs to proactively exert CPU-time to move the received data from bytes read from NIC devices to userspace. Because this happens concurrently with other kernel and userspace routines in a concurrent system, a preemptable kernel may incur significant latency if the kernel routine for packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
Comparatively, a ``one-sided'' message-passing scheme, for example RDMA, allows the network interface card to bypass in-kernel packet filters and perform DMA on registered memory regions. The NIC can hence notify the CPU via interrupts, thus allowing the kernel and the userspace programs to perform callbacks at reception time with reduced latency. Because of this advantage, many recent studies attempt to leverage RDMA APIs for improved distributed data workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014} {Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020} {Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kaxiras_etal.DSM-Argos.2015}.
\section{Consistency Model and Cache Coherence}
Consistency model specifies a contract on allowed behaviors of multi-processing programs with regards to a shared memory \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious conflict, which consistency models aim to resolve, lies within the interaction between processor-native programs and multi-processors, all of whom needs to operate on a shared memory with heterogeneous cache topologies. Here, a well-defined consistency model aims to resolve the conflict on an architectural scope. Beyond consistency models for bare-metal systems, programming languages \cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007} {Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024} and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for parallel access to shared memory on top of program order guarantees to explicate program behavior under shared memory parallel programming across underlying implementations.
Related to the definition of a consistency model is the coherence problem, which arises whenever multiple actors have access to multiple copies of some datum, which needs to be synchronized across multiple actors with regards to write-accesses \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. While less relevant to programming language design, coherence must be maintained via a coherence protocol \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020} in systems of both microarchitectural and network scales. For DSM systems, the design of a correct and performant coherence protocol is of especially high priority and is a major part of many studies in DSM systems throughout history \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996} {Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020} {Couceiro_etal.D2STM.2009}.
\subsection{Consistency Model in DSM}
Distributed shared memory systems with node-local caching naturally implies the existence of the consistency problem with regards to contending read/write accesses. Indeed, a significant subset of DSM studies explicitly characterize themselves as adhering to one of the well-known consistency models to better understand system behavior and to provide optimizations in coherence protocols \cites{Amza_etal.Treadmarks.1996}{Hu_Shi_Tang.JIAJIA.1999} {Carter_Bennett_Zwaenepoel.Munin.1991}{Endo_Sato_Taura.MENPS_DSM.2020} {Wang_etal.Concordia.2021}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kim_etal.DeX-upon-Linux.2020}, each adhering to a different consistency model to balance between communication costs and ease of programming.
In particular, we note that DSM studies tend to conform to either release consistency \cites{Amza_etal.Treadmarks.1996}{Endo_Sato_Taura.MENPS_DSM.2020} {Carter_Bennett_Zwaenepoel.Munin.1991} or weaker \cite{Hu_Shi_Tang.JIAJIA.1999}, or sequential consistency \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991} {Wang_etal.Concordia.2021}{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}, with few works \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018} pertaining to moderately constrained consistency models in-between. While older works, as well as works which center performance of their proposed DSM systems over existing approaches \cites{Endo_Sato_Taura.MENPS_DSM.2020} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}, favor release consistency due to its performance benefits (e.g., in terms of coherence costs \cite{Endo_Sato_Taura.MENPS_DSM.2020}), newer works tend to adopt stricter consistency models, sometimes due to improved productivity offered to programmers \cite{Kim_etal.DeX-upon-Linux.2020}.
\begin{table}[h]
\centering
\begin{tabular}{|l|c c c c c c|}
\hline
% ...
& Sequential
& TSO
& PSO
& Release
& Acquire
& Scope \\
\hline
Home; Invalidate
& \cites{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}{Zhang_etal.GiantVM.2020}
&
&
& \cites{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}
& \cites{Holsapple.DSM64.2012}
& \cites{Hu_Shi_Tang.JIAJIA.1999} \\
\hline
Home; Update
& & & & & & \\
\hline
Float; Invalidate
&
&
&
& \cites{Endo_Sato_Taura.MENPS_DSM.2020}
&
& \\
\hline
Float; Update
& & & & & & \\
\hline
Directory; Inval.
& \cites{Wang_etal.Concordia.2021}
&
&
&
&
& \\
\hline
Directory; Update
& & & & & & \\
\hline
Dist. Dir.; Inval.
& \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991}
&
& \cites{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
& \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
& \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}
& \\
\hline
Dist. Dir.; Update
&
&
&
& \cites{Carter_Bennett_Zwaenepoel.Munin.1991}
&
& \\
\hline
\end{tabular}
\caption{
Coherence Protocol vs. Consistency Model in Selected Disaggregated Memory Studies. ``Float'' short for ``floating home''. Studies selected for clearly described consistency model and coherence protocol.
}
\label{table:1}
\end{table}
We especially note the role of balancing productivity and performance in terms of selecting the ideal consistency model for a system. It is common knowledge that weaker consistency models are harder to program with, at the benefit of less (implied) coherence communications resulting in better throughput overall -- provided that the programmer could guarantee correctness, a weaker consistency model allows for less invalidation of node-local cache entries, thereby allowing multiple nodes to compute in parallel on (likely) outdated local copy of data such that the result of the computation remains semantically correct with regards to the program. This point was made explicit in \textit{Munin} \cite{Carter_Bennett_Zwaenepoel.Munin.1991}, where (to reiterate) it introduces the concept of consistency ``protocol parameters'' to annotate shared memory access pattern, in order to reduce the amount of coherence communications necessary between nodes computing in distributed shared memory. For example, a DSM object (memory object accounted for by the DSM system) can be annotated with ``delayed operations'' to delay coherence operations beyond any write-access, or shared without ``write'' annotation to disable write-access over shared nodes, thereby disabling all coherence operations with regards to this DSM object. Via programmer annotation of DSM objects, the Munin DSM system explicates the effect of weaker consistency in relation to the amount of synchronization overhead necessary among shared memory nodes. To our knowledge, no other more recent DSM works have explored this interaction between consistency and coherence costs on DSM objects, though relatedly \textit{Resilient Distributed Dataset (RDD)} \cite{Zaharia_etal.RDD.2012} also highlights its performance and flexibility benefits in opting for an immutable data representation over disaggregated memory over network when compared to contemporary DSM approaches.
\subsection{Coherence Protocol}
Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:1} shows, DSM studies tends to implement write-invalidated coherence under a \textit{home-based} or \textit{directory-based} protocol framework, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.
\subsubsection{Home-Based Protocols}
\textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.
We identify that home-based protocols are conceptually straightforward compared to directory-based protocols, centering communications over storage of global metadata (in this case ownership of each shared memory object). This leads to greater flexibility in implementing coherence protocols. A shared memory object at its creation may be made known globally via broadcast, or made known to only a subset of nodes (0 or more) via multicast. Likewise, metadata storage could be cached locally to each node and invalidated alongside object invalidation or fetched from a fixed node with respect to one object. This implementation flexibility is further taken advantage of in \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017}, which refines the ``home node'' concept into \textit{owner node} to provide replication and persistence, in addition to adopting a dynamic home protocol similar to that of \cite{Endo_Sato_Taura.MENPS_DSM.2020}.
\subsubsection{Directory-Based Protocols}
\textit{Directory-based} protocols instead take a shared database approach by denoting each shared memory object with a globally shared entry describing ownership and sharing status. In its non-distributed form (e.g., \cite{Wang_etal.Concordia.2021}), a global, central directory is maintained for all nodes in network for ownership information: the directory hence becomes a bottleneck for imposing latency and bandwidth constraints on parallel processing systems. Comparatively, a distributed directory scheme may delegate responsibilities across all nodes in network mostly in accordance to sharded address space \cites{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}. Though theoretically sound, this scheme performs no dynamic load-balancing for commonly shared memory objects, which in the worst case would function exactly like a non-distributed directory coherence scheme. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}{Schoinas_etal.Sirocco.1998}{Eisley_Peh_Shang.In-net-coherence.2006}{Hong_etal.NUMA-to-RDMA-DSM.2019}.
\subsection{DMA and Cache Coherence}
The advent of high-speed RDMA-capable network interfaces introduce introduce opportunities for designing more performant DSM systems over RDMA (as established in \ref{sec:msg-passing}). Orthogonally, RDMA-capable NICs on a fundamental level perform direct memory access over the main memory to achieve one-sided RDMA operations to reduce the effect of OS jittering on RDMA latencies. For modern computer systems with cached multiprocessors, this poses a potential cache coherence problem on a local level -- because RDMA operations happen concurrently with regards to memory accesses by CPUs, which stores copies of memory data in cache lines which may \cites{Kjos_etal.HP-HW-CC-IO.1996}{Ven.LKML_x86_DMA.2008} or may not \cites{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018}{Corbet.LWN-NC-DMA.2021} be fully coherent by the DMA mechanism, any DMA operations performed by the RDMA NIC may be incoherent with the cached copy of the same data inside the CPU caches (as is the case for accelerators, etc.). This issue is of particular concern to the kernel development community, who needs to ensure that the behaviors of DMA operations remain identical across architectures regardless of support of cache-coherent DMA \cite{Corbet.LWN-NC-DMA.2021}. Likewise existing RDMA implementations which make heavy use of architecture-specific DMA memory allocation implementations, implementing RDMA-based DSM systems in kernel also requires careful use of kernel API functions that ensure cache coherency as necessary.
\subsection{Cache Coherence in ARMv8-A}
We specifically focus on the implementation of cache coherence in ARMv8-A. Unlike x86 which guarantees cache-coherent DMA \cites{Ven.LKML_x86_DMA.2008}{Corbet.LWN-NC-DMA.2021}, the ARMv8-A architecture (and many other popular ISAs, for example \textit{RISC-V}) \emph{does not} guarantee cache-coherency of DMA operations across vendor implementations. ARMv8 defines a hierarchical model for coherency organization to support \textit{heterogeneous} and \textit{asymmetric} multi-processing systems \cite{ARM.ARMv8-A.v1.0.2015}.
\begin{definition}[cluster]
A \textit{cluster} defines a minimal cache-coherent region for Cortex-A53 and Cortex-A57 processors. Each cluster usually comprises of 1 or more core as well as a shared last-level cache.
\end{definition}
\begin{definition}[sharable domain]
A \textit{sharable domain} defines a vendor-defined cache-coherent region. Sharable domains can be \textit{inner} or \textit{outer}, which limits the scope of broadcast coherence messages to \textit{point-of-unification} and \textit{point-of-coherence}, respectively.
Usually, the \textit{inner} sharable domain defines the domain of all (closely-coupled) processors inside a heterogeneous multiprocessing system (see \ref{def:het-mp}); while the \textit{outer} sharable domain defines the largest memory-sharing domain for the system (e.g. inclusive of DMA bus).
\end{definition}
\begin{definition}[Point-of-Unification]\label{def:pou}
The \textit{point-of-unification} (\textit{PoU}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{inner} sharable domain see the same copy of data.
Consequently, \textit{PoU} defines a point at which every core of a ARMv8-A processor sees the same (i.e., a \emph{unified}) copy of a memory location regardless of accessing via instruction caches, data caches, or TLB.
\end{definition}
\begin{definition}[Point-of-Coherence]\label{def:poc}
The \textit{point-of-coherence} (\textit{PoC}) under ARMv8 defines a level of coherency such that all sharers inside the \textbf{outer} sharable domain see the same copy of data.
Consequently, \textit{PoC} defines a point at which all \textit{observers} (e.g., cores, DSPs, DMA engines) to memory will observe the same copy of a memory location.
\end{definition}
\subsubsection{Addendum: \textit{Heterogeneous} \& \textit{Asymmetric} Multiprocessing}
Using these definitions, a vendor could build \textit{heterogeneous} and \textit{asymmetric} multi-processing systems as follows:
\begin{definition}[Heterogeneous Multiprocessing]\label{def:het-mp}
A \textit{heterogeneous multiprocessing} system incorporates ARMv8 processors of diverse microarchitectures that are fully coherent with one another, running the same system image.
\end{definition}
\begin{definition}[Asymmetric Multiprocessing]
A \textit{asymmetric multiprocessing} system needs not contain fully coherent processors. For example, a system-on-a-chip may contain a non-coherent co-processor for secure computing purposes \cite{ARM.ARMv8-A.v1.0.2015}.
\end{definition}
\subsection{ARMv8-A Software Cache Coherence in Linux Kernel}
Because of the lack of hardware guarantee on hardware DMA coherency (though such support exists \cite{Parris.AMBA_4_ACE-Lite.2013}), programmers need to invoke architecture-specific cache-coherency instructions when porting DMA hardware support over a diverse range of ARMv8 microarchitectures, often encapsulated in problem-specific subroutines.
Notably, kernel (driver) programming warrants programmer attention to software-maintained coherency when userspace programmers downstream expect data-flow, interspersed between CPU and DMA operations, to follow program ordering and (driver vendor) specifications. One such example arises in the Linux kernel implementation of DMA memory management API \cite{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024}\footnote[1]{Based on Linux kernel v6.7.0.}:
\begin{definition}[DMA Mappings]
The Linux kernel DMA memory allocation API, imported via
\begin{minted}[linenos]{c}
#include <linux/dma-mapping.h>
\end{minted}
defines two variants of DMA mappings:
\begin{itemize}
\item {\label{def:consistent-dma-map}
\textit{Consistent} DMA mappings:
They are guaranteed to be coherent in-between concurrent CPU/DMA accesses without explicit software flushing.
\footnote[2]{
However, it does not preclude CPU store reordering, so memory barriers remain necessary in a multiprocessing context.
}
}
\item {
\textit{Streaming} DMA mappings:
They provide no guarantee to coherency in-between concurrent CPU/DMA accesses. Programmers need to manually apply coherency maintenance subroutines for synchronization.
}
\end{itemize}
\end{definition}
Consistent DMA mappings could be trivially created via allocating non-cacheable memory, which guarantees \textit{PoC} for all memory observers (though system-specific fastpaths exist).
On the other hand, streaming DMA mappings require manual synchronization upon programmed CPU/DMA access. Take single-buffer synchronization on CPU after DMA access for example:
\begin{minted}[linenos, mathescape]{c}
/* In kernel/dma/mapping.c $\label{code:dma_sync_single_for_cpu}$*/
void dma_sync_single_for_cpu(
struct device *dev, // kernel repr for DMA device
dma_addr_t addr, // DMA address
size_t size, // Synchronization buffer size
enum dma_data_direction dir, // Data-flow direction
) {
/* Translate DMA address to physical address */
phys_addr_t paddr = dma_to_phys(dev, addr);
if (!dev_is_dma_coherent(dev)) {
arch_sync_dma_for_cpu(paddr, size, dir);
arch_sync_dma_for_cpu_all(); // MIPS quirks...
}
/* Miscellaneous cases... */
}
\end{minted}
\begin{minted}[linenos]{c}
/* In arch/arm64/mm/dma-mapping.c */
void arch_sync_dma_for_cpu(
phys_addr_t paddr,
size_t size,
enum dma_data_direction dir,
) {
/* Translate physical address to (kernel) virtual address */
unsigned long start = (unsigned long)phys_to_virt(paddr);
/* Early exit for DMA read: no action needed for CPU */
if (dir == DMA_TO_DEVICE)
return;
/* ARM64-specific: invalidate CPU cache to PoC */
dcache_inval_poc(start, start + size);
}
\end{minted}
This call-chain, as well as its mirror case which maintains cache coherency for the DMA device after CPU access: \mint[breaklines=true]{c}|dma_sync_single_for_device(struct device *, dma_addr_t, size_t, enum dma_data_direction)|, call into the following procedures, respectively:
\begin{minted}[linenos]{c}
/* Exported @ arch/arm64/include/asm/cacheflush.h */
/* Defined @ arch/arm64/mm/cache.S */
/* All functions accept virtual start, end addresses. */
/* Invalidate data cache region [start, end) to PoC.
*
* Invalidate CPU cache entries that intersect with [start, end),
* such that data from external writers becomes visible to CPU.
*/
extern void dcache_inval_poc(
unsigned long start, unsigned long end
);
/* Clean data cache region [start, end) to PoC.
*
* Write-back CPU cache entries that intersect with [start, end),
* such that data from CPU becomes visible to external writers.
*/
extern void dcache_clean_poc(
unsigned long start, unsigned long end
);
\end{minted}
\subsubsection{Addendum: \texttt{enum dma\_data\_direction}}
The Linux kernel defines 4 direction \texttt{enum} values for fine-tuning synchronization behaviors:
\begin{minted}[linenos]{c}
/* In include/linux/dma-direction.h */
enum dma_data_direction {
DMA_BIDIRECTION = 0, // data transfer direction uncertain.
DMA_TO_DEVICE = 1, // data from main memory to device.
DMA_FROM_DEVICE = 2, // data from device to main memory.
DMA_NONE = 3, // invalid repr for runtime errors.
};
\end{minted}
These values allow for certain fast-paths to be taken at runtime. For example, \texttt{DMA\_TO\_DEVICE} implies that the device reads data from memory without modification, and hence precludes software coherence instructions from being run when synchronizing for CPU after DMA operation.
% TODO: Move to addendum section.
\subsubsection{Use-case: Kernel-space \textit{SMBDirect} Driver}
\textit{SMBDirect} is an extension of the \textit{SMB} (\textit{Server Message Block}) protocol for opportunistically establishing the communication protocol over RDMA-capable network interfaces \cite{many.MSFTLearn-SMBDirect.2024}.
We focus on two procedures inside the in-kernel SMBDirect implementation:
\paragraph{Before send: \texttt{smbd\_post\_send}}
\begin{minted}[linenos]{c}
/* In fs/smb/client/smbdirect.c */
static int smbd_post_send(
struct smbd_connection *info, // SMBDirect transport context
struct smbd_request *request, // SMBDirect request context
) // ...
\end{minted}
Downstream of \texttt{smbd\_send}, which sends SMBDirect payload for transport over network. Payloads are constructed and batched for maximized bandwidth, then \texttt{smbd\_post\_send} is called to signal the RDMA NIC for transport.
The function body is roughly as follows:
\begin{minted}[linenos, firstnumber=last, mathescape]{c}
{
struct ib_send_wr send_wr; // "Write Request" for entire payload
int rc, i;
/* For each message in batched payload */
for (i = 0; i < request->num_sge; i++) {
/* Log to kmesg ring buffer... */
/* RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ $\label{code:ib_dma_sync_single_for_device}$*/
ib_dma_sync_single_for_device(
info->id->device, // struct ib_device *
request->sge[i].addr, // u64 (as dma_addr_t)
request->sge[i].length, // size_t
DMA_TO_DEVICE, // enum dma_data_direction
);
}
/* Populate `request`, `send_wr`... */
rc = ib_post_send(
info->id->qp, // struct ib_qp * ("Queue Pair")
&send_wr, // const struct ib_recv_wr *
NULL, // const struct ib_recv_wr ** (err handling)
);
/* Error handling... */
return rc;
}
\end{minted}
Line \ref{code:ib_dma_sync_single_for_device} writes back CPU cache lines to be visible for RDMA NIC in preparation for DMA operations when the posted \textit{send request} is worked upon.
\paragraph{Upon reception: \texttt{recv\_done}}
\begin{minted}[linenos]{c}
/* In fs/smb/client/smbdirect.c */
static void recv_done(
struct ib_cq *cq, // "Completion Queue"
struct ib_wc *wc, // "Work Completion"
) // ...
\end{minted}
Called when the RDMA subsystem works on the received payload over RDMA. Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores prior to any operations on received data:
\begin{minted}[linenos, firstnumber=last, mathescape]{c}
{
struct smbd_data_transfer *data_transfer;
struct smbd_response *response = container_of(
wc->wr_cqe, // ptr: pointer to member
struct smbd_response, // type: type of container struct
cqe, // name: name of member in struct
); // Cast member of struct into containing struct (C magic)
struct smbd_connection *info = response->info;
int data_length = 0;
/* Logging, error handling... */
/* Likewise, RDMA wrapper over DMA API$\ref{code:dma_sync_single_for_cpu}$ */
ib_dma_sync_single_for_cpu(
wc->qp->device,
response->sge.addr,
response->sge.length,
DMA_FROM_DEVICE,
);
/* ... */
}
\end{minted}
\chapter{Software Coherency Latency}
\chapter{DSM System Design}
% \bibliographystyle{plain}
% \bibliographystyle{plainnat}
% \bibliography{mybibfile}
\printbibliography
% You may delete everything from \appendix up to \end{document} if you don't need it.
\appendix
\chapter{First appendix}
\section{First section}
Any appendices, including any required ethics information, should be included
after the references.
Markers do not have to consider appendices. Make sure that your contributions
are made clear in the main body of the dissertation (within the page limit).
% \chapter{Participants' information sheet}
% If you had human participants, include key information that they were given in
% an appendix, and point to it from the ethics declaration.
% \chapter{Participants' consent form}
% If you had human participants, include information about how consent was
% gathered in an appendix, and point to it from the ethics declaration.
% This information is often a copy of a consent form.
\end{document}

61
tex/draft/ugcheck.sty Normal file
View file

@ -0,0 +1,61 @@
% Historically a small number of students change the page layout,
% often accidentally by including a package like geometry or fullpage.
% Here we check if the basic page setup is correct. It does not
% check all aspects of the style guide, or any page limits.
%
% Changing the style in a way that fools these simple checks is still not ok!
%
\RequirePackage{printlen}
\AtBeginDocument{%
% To get the numbers below, include printlen package above and see lengths like this:
%\printlength\oddsidemargin\\
%\printlength\headheight\\
%\printlength\textheight\\
%\printlength\marginparsep\\
%\printlength\footskip\\
%\printlength\hoffset\\
%\printlength\paperwidth\\
%\printlength\topmargin\\
%\printlength\headsep\\
%\printlength\textwidth\\
%\printlength\marginparwidth\\
%\printlength\marginparpush\\
%\printlength\voffset\\
%\printlength\paperheight\\
%\baselinestretch\\
%\@thesispoints\\
%
\newif\ifmarginsmessedwith
\marginsmessedwithfalse
\ifdim\oddsidemargin=41.54103pt \else oddsidemargin has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\headheight=12.0pt \else headheight has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\textheight=674.33032pt \else textheight has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\marginparsep=10.0pt \else marginparsep has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\footskip=30.0pt \else footskip has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\hoffset=0.0pt \else hoffset has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\paperwidth=597.50787pt \else paperwidth has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\topmargin=-52.36449pt \else topmargin has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\headsep=25.0pt \else headsep has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\textwidth=412.56497pt \else textwidth has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\marginparwidth=35.0pt \else marginparwidth has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\marginparpush=7.0pt \else marginparpush has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\voffset=0.0pt \else voffset has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\paperheight=845.04684pt \else paperheight has been altered.\\ \marginsmessedwithtrue\fi
\newcommand{\pts}[1]{#1pt}
\ifdim\pts\baselinestretch = 1pt \else linespacing has been altered.\\ \marginsmessedwithtrue\fi
\ifdim\@thesispoints=12pt \else font size has been altered.\\ \marginsmessedwithtrue\fi
\ifmarginsmessedwith
\textbf{\large \em The required page layout has been changed.}
Please set up your document as in the example skeleton thesis document.
Do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We're not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again. If you
can't figure out the problem, try adding your \LaTeX\ code a part at a time
to the example document.
\fi}