Reorganization + Added intro to skeleton
This commit is contained in:
parent
0d78e11a97
commit
9bef473315
32 changed files with 788 additions and 2781 deletions
|
|
@ -1,28 +0,0 @@
|
|||
@inproceedings{!BGW.2010.CDN,
|
||||
title={Distributed caching algorithms for content distribution networks},
|
||||
author={Borst, Sem and Gupta, Varun and Walid, Anwar},
|
||||
booktitle={2010 Proceedings IEEE INFOCOM},
|
||||
pages={1--9},
|
||||
year={2010},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@article{KD.2002.Akamai_CoordCacheRepl,
|
||||
title={Coordinated placement and replacement for large-scale distributed caches},
|
||||
author={Korupolu, Madhukar R. and Dahlin, Michael},
|
||||
journal={IEEE Transactions on Knowledge and Data Engineering},
|
||||
volume={14},
|
||||
number={6},
|
||||
pages={1317--1329},
|
||||
year={2002},
|
||||
publisher={IEEE}
|
||||
}
|
||||
|
||||
@misc{Z.2022.Linux_LRU_GEN,
|
||||
title={Multi-Gen LRU},
|
||||
url={https://www.kernel.org/doc/html/v6.6-rc5/mm/multigen_lru.html},
|
||||
journal={The Linux Kernel documentation},
|
||||
author={Zhao, Yu},
|
||||
editor={Alumbaugh, T JEditor},
|
||||
year={2022}
|
||||
}
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
> A High-Performance Framework for Dynamic Cache-Replacement-Strategy-Selection in Distributed Shared Memory Systems
|
||||
|
||||
# Background
|
||||
> Various Kinds of (Distributed) Systems (What makes a system "distributed", anyways?) $\rightarrow$
|
||||
> (Distributed) Cache Replacement Algorithms (Strategies) $\rightarrow$
|
||||
> Limitations to common distributed cache replacement practices in extremely time-sensitive scenarios (like ours) $\rightarrow$
|
||||
> Variables that need to be accounted for in cache replacement problms $\rightarrow$
|
||||
> Need for dynamic manipulation to cache replacement strategy, which implies probing & measurement & comparison, etc. $\rightarrow$
|
||||
> Framework for such a thing, which is what we explore in this paper.
|
||||
|
||||
Binary file not shown.
|
|
@ -1,63 +0,0 @@
|
|||
\documentclass{article}
|
||||
\usepackage{biblatex}
|
||||
|
||||
\title{Thesis Background}
|
||||
\author{Zhengyi Chen}
|
||||
\date{\today}
|
||||
|
||||
\addbibresource{../main.bib}
|
||||
\addbibresource{background.bib}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
% Phil Karlton's famous quote about the 2 hard problems in CS here, maybe.
|
||||
|
||||
The problem of cache replacement is general to computer systems of all scales and topologies:
|
||||
topologically massive systems, such as cellular stations\cite{GWHSZ.2014.CacheReplAsMDP-QLearning}
|
||||
and CDNs\cites{EHOFK.2020.IBM-LRUvsFIFO}{!BGW.2010.CDN}{KD.2002.Akamai_CoordCacheRepl}, and
|
||||
data-path level implementations for processors\cites{QJPSE.2007.DIP}{JTSE.2010.RRIP}{SYS.2021.RLR}
|
||||
alike requires good solutions to maintain and maximize application performance
|
||||
to various levels of granularity. On the other hand, the set of feasible/performant solutions
|
||||
(i.e., cache replacement policies) to one system may or may not be inspiring to performance
|
||||
improvement on another system of different scale, objectives, tasks, constrained by a
|
||||
(mostly) different context of available inputs, metadata, etc.
|
||||
|
||||
We propose a framework for dynamic cache-replacement-strategy selection that balances computation
|
||||
cost, optimality, and working-set estimation for each strategy while incurring minimal performance
|
||||
penalties for a shared-kernel cooperative Distributed Shared Memory system. (We identify \dots)
|
||||
|
||||
\section[1]{Existing Cache Replacement Strategies}
|
||||
\subsection[1.1]{LRU-derived Algorithms}
|
||||
\subsection[1.2]{FIFO-derived Algorithms}
|
||||
\subsection[1.3]{Cache Replacement in Processors}
|
||||
\subsection[1.4]{Machine Learning and Heuristics}
|
||||
|
||||
\section[2]{The Cache Replacement Problem}
|
||||
|
||||
\section[3]{Page Replacement in (SMP or?) Linux}
|
||||
%-- But LRU_GEN is interop-ed with an array of other systems,
|
||||
% how could we trivially implement alternative page replacement algorithms with maximum feature
|
||||
% compliance?
|
||||
%
|
||||
|
||||
% Cache replacement strategies local to its own resources, for example CPU cache line replacement stategies, may not optimally perform cache eviction and
|
||||
% replacement for CDNs which (1) center \textit{freqency} over \textit{recency} and (2) could
|
||||
% cooperate to utilize a nearby cache with small additional transfer cost\cite{KD.2002.Akamai_CoordCacheRepl}.
|
||||
% Orthogonally, cache replacement strategies that perform well on one task might perform less well on
|
||||
% another, as implied by \cite{SYS.2021.RLR} among others.
|
||||
|
||||
% this is the case for Linux's \textit{multi-gen LRU} page replacement algorithm which
|
||||
% by default prioritizes memory access via page table to be stored in cache over those via file
|
||||
% descriptors (though it dynamically self-adjusts)\cite{Z.2022.Linux_LRU_GEN} -- the kernel developers
|
||||
% assume that the former is costlier upon page fault. This is well and good for programs with
|
||||
|
||||
% This is not to say that some amount of "technological transfer" from cache replacement strategies
|
||||
% intended for one specific setting could not be
|
||||
|
||||
% A performant cache replacement strategy, relative to its hosting
|
||||
% system, needs to strike balance between optimality and the necessary computation needed to make a
|
||||
% replacement/eviction decision.
|
||||
|
||||
\printbibliography
|
||||
\end{document}
|
||||
623
tex/draft/mybibfile.bib
Normal file
623
tex/draft/mybibfile.bib
Normal file
|
|
@ -0,0 +1,623 @@
|
|||
@article{Aguilar_Leiss.Coherence-Replacement.2006,
|
||||
title = {A Coherence-Replacement Protocol For Web Proxy Cache Systems},
|
||||
author = {J. Aguilar and E.L. Leiss},
|
||||
year = 2006,
|
||||
journal = {International Journal of Computers and Applications},
|
||||
publisher = {Taylor & Francis},
|
||||
volume = 28,
|
||||
number = 1,
|
||||
pages = {12--18},
|
||||
doi = {10.1080/1206212X.2006.11441783},
|
||||
url = {https://doi.org/10.1080/1206212X.2006.11441783},
|
||||
eprint = {https://doi.org/10.1080/1206212X.2006.11441783}
|
||||
}
|
||||
@article{Amza_etal.Treadmarks.1996,
|
||||
title = {Treadmarks: Shared memory computing on networks of workstations},
|
||||
author = {Amza, Cristiana and Cox, Alan L and Dwarkadas, Sandhya and Keleher, Pete and Lu, Honghui and Rajamony, Ramakrishnan and Yu, Weimin and Zwaenepoel, Willy},
|
||||
journal = {Computer},
|
||||
volume = {29},
|
||||
number = {2},
|
||||
pages = {18--28},
|
||||
year = {1996},
|
||||
publisher = {IEEE}
|
||||
}
|
||||
@misc{ARM.ARMv8-A.v1.0.2015,
|
||||
title = {ARM® Cortex®-A Series Programmer's Guide for ARMv8-A},
|
||||
url = {https://developer.arm.com/documentation/den0024/a},
|
||||
journal = {Documentation - arm developer},
|
||||
publisher = {ARM},
|
||||
author = {ARM},
|
||||
year = {2015}
|
||||
}
|
||||
@book{AST_Steen.Distributed_Systems-3ed.2017,
|
||||
title = {Distributed systems},
|
||||
author = {Van Steen, Maarten and Tanenbaum, Andrew S},
|
||||
year = {2017},
|
||||
publisher = {Maarten van Steen Leiden, The Netherlands}
|
||||
}
|
||||
@article{Bell_Gray.HPC_is_Cluster.2002,
|
||||
title = {What's next in high-performance computing?},
|
||||
author = {Bell, Gordon and Gray, Jim},
|
||||
journal = {Communications of the ACM},
|
||||
volume = {45},
|
||||
number = {2},
|
||||
pages = {91--95},
|
||||
year = {2002},
|
||||
publisher = {ACM New York, NY, USA}
|
||||
}
|
||||
@book{BOOK.Hennessy_Patterson.CArch.2011,
|
||||
title = {Computer architecture: a quantitative approach},
|
||||
author = {Hennessy, John L and Patterson, David A},
|
||||
year = 2011,
|
||||
publisher = {Elsevier}
|
||||
}
|
||||
@inproceedings{Cabezas_etal.GPU-SM.2015,
|
||||
title = {GPU-SM: shared memory multi-GPU programming},
|
||||
author = {Cabezas, Javier and Jord{\`a}, Marc and Gelado, Isaac and Navarro, Nacho and Hwu, Wen-mei},
|
||||
year = 2015,
|
||||
booktitle = {Proceedings of the 8th Workshop on General Purpose Processing using GPUs},
|
||||
pages = {13--24}
|
||||
}
|
||||
@article{Cai_etal.Distributed_Memory_RDMA_Cached.2018,
|
||||
title = {Efficient distributed memory management with RDMA and caching},
|
||||
author = {Cai, Qingchao and Guo, Wentian and Zhang, Hao and Agrawal, Divyakant and Chen, Gang and Ooi, Beng Chin and Tan, Kian-Lee and Teo, Yong Meng and Wang, Sheng},
|
||||
journal = {Proceedings of the VLDB Endowment},
|
||||
volume = {11},
|
||||
number = {11},
|
||||
pages = {1604--1617},
|
||||
year = {2018},
|
||||
publisher = {VLDB Endowment}
|
||||
}
|
||||
@article{Carter_Bennett_Zwaenepoel.Munin.1991,
|
||||
title = {Implementation and performance of Munin},
|
||||
author = {Carter, John B and Bennett, John K and Zwaenepoel, Willy},
|
||||
journal = {ACM SIGOPS Operating Systems Review},
|
||||
volume = {25},
|
||||
number = {5},
|
||||
pages = {152--164},
|
||||
year = {1991},
|
||||
publisher = {ACM New York, NY, USA}
|
||||
}
|
||||
@inproceedings{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991,
|
||||
author = {Chaiken, David and Kubiatowicz, John and Agarwal, Anant},
|
||||
title = {LimitLESS directories: A scalable cache coherence scheme},
|
||||
year = {1991},
|
||||
isbn = {0897913809},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/106972.106995},
|
||||
doi = {10.1145/106972.106995},
|
||||
booktitle = {Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems},
|
||||
pages = {224–234},
|
||||
numpages = {11},
|
||||
location = {Santa Clara, California, USA},
|
||||
series = {ASPLOS IV}
|
||||
}
|
||||
@misc{Corbet.LWN-NC-DMA.2021,
|
||||
url = {https://lwn.net/Articles/855328/},
|
||||
journal = {Noncoherent DMA mappings},
|
||||
publisher = {LWN.net},
|
||||
author = {Corbet, Jonathan},
|
||||
year = {2021}
|
||||
}
|
||||
@inproceedings{Couceiro_etal.D2STM.2009,
|
||||
title = {D2STM: Dependable distributed software transactional memory},
|
||||
author = {Couceiro, Maria and Romano, Paolo and Carvalho, Nuno and Rodrigues, Lu{\'\i}s},
|
||||
booktitle = {2009 15th IEEE Pacific Rim International Symposium on Dependable Computing},
|
||||
pages = {307--313},
|
||||
year = {2009},
|
||||
organization = {IEEE}
|
||||
}
|
||||
@article{De_Wael_etal.PGAS_Survey.2015,
|
||||
title = {Partitioned global address space languages},
|
||||
author = {De Wael, Mattias and Marr, Stefan and De Fraine, Bruno and Van Cutsem, Tom and De Meuter, Wolfgang},
|
||||
journal = {ACM Computing Surveys (CSUR)},
|
||||
volume = {47},
|
||||
number = {4},
|
||||
pages = {1--27},
|
||||
year = {2015},
|
||||
publisher = {ACM New York, NY, USA}
|
||||
}
|
||||
@inproceedings{Ding.vDSM.2018,
|
||||
author = {Ding, Zhuocheng},
|
||||
booktitle = {2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)},
|
||||
title = {vDSM: Distributed Shared Memory in Virtualized Environments},
|
||||
year = {2018},
|
||||
volume = {},
|
||||
number = {},
|
||||
pages = {1112-1115},
|
||||
keywords = {Virtual machine monitors;Optimization;Protocols;Virtualization;Operating systems;Stress;Analytical models;component;distributed shared memory;virtuali-zation;low-latency network},
|
||||
doi = {10.1109/ICSESS.2018.8663720}
|
||||
}
|
||||
@inproceedings{Eisley_Peh_Shang.In-net-coherence.2006,
|
||||
title = {In-network cache coherence},
|
||||
author = {Eisley, Noel and Peh, Li-Shiuan and Shang, Li},
|
||||
booktitle = {2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)},
|
||||
pages = {321--332},
|
||||
year = {2006},
|
||||
organization = {IEEE}
|
||||
}
|
||||
@inproceedings{Endo_Sato_Taura.MENPS_DSM.2020,
|
||||
title = {MENPS: a decentralized distributed shared memory exploiting RDMA},
|
||||
author = {Endo, Wataru and Sato, Shigeyuki and Taura, Kenjiro},
|
||||
booktitle = {2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)},
|
||||
pages = {9--16},
|
||||
year = {2020},
|
||||
organization = {IEEE}
|
||||
}
|
||||
@article{Fleisch_Popek.Mirage.1989,
|
||||
title = {Mirage: A coherent distributed shared memory design},
|
||||
author = {Fleisch, Brett and Popek, Gerald},
|
||||
journal = {ACM SIGOPS Operating Systems Review},
|
||||
volume = {23},
|
||||
number = {5},
|
||||
pages = {211--223},
|
||||
year = {1989},
|
||||
publisher = {ACM New York, NY, USA}
|
||||
}
|
||||
@misc{FreeBSD.man-BPF-4.2021,
|
||||
title = {FreeBSD manual pages},
|
||||
url = {https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports},
|
||||
journal = {BPF(4) Kernel Interfaces Manual},
|
||||
publisher = {The FreeBSD Project},
|
||||
author = {The FreeBSD Project},
|
||||
year = {2021}
|
||||
}
|
||||
@inproceedings{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018,
|
||||
title = {NoC-based support of heterogeneous cache-coherence models for accelerators},
|
||||
author = {Giri, Davide and Mantovani, Paolo and Carloni, Luca P},
|
||||
booktitle = {2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS)},
|
||||
pages = {1--8},
|
||||
year = {2018},
|
||||
organization = {IEEE}
|
||||
}
|
||||
@book{Holsapple.DSM64.2012,
|
||||
title = {DSM64: A Distributed Shared Memory System in User-Space},
|
||||
author = {Holsapple, Stephen Alan},
|
||||
year = {2012},
|
||||
publisher = {California Polytechnic State University}
|
||||
}
|
||||
@article{Hong_etal.NUMA-to-RDMA-DSM.2019,
|
||||
title = {Scaling out NUMA-aware applications with RDMA-based distributed shared memory},
|
||||
author = {Hong, Yang and Zheng, Yang and Yang, Fan and Zang, Bin-Yu and Guan, Hai-Bing and Chen, Hai-Bo},
|
||||
journal = {Journal of Computer Science and Technology},
|
||||
volume = {34},
|
||||
pages = {94--112},
|
||||
year = {2019},
|
||||
publisher = {Springer}
|
||||
}
|
||||
@inproceedings{Hu_Shi_Tang.JIAJIA.1999,
|
||||
title = {JIAJIA: A software DSM system based on a new cache coherence protocol},
|
||||
author = {Hu, Weiwu and Shi, Weisong and Tang, Zhimin},
|
||||
booktitle = {High-Performance Computing and Networking: 7th International Conference, HPCN Europe 1999 Amsterdam, The Netherlands, April 12--14, 1999 Proceedings 7},
|
||||
pages = {461--472},
|
||||
year = {1999},
|
||||
organization = {Springer}
|
||||
}
|
||||
@misc{ISO/IEC_9899:2011.C11,
|
||||
abstract = {Edition Status: Withdrawn on 2018-07-13},
|
||||
isbn = {9780580801655},
|
||||
keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
|
||||
language = {eng},
|
||||
publisher = {British Standards Institute},
|
||||
title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
|
||||
year = {2013}
|
||||
}
|
||||
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
|
||||
title = {C++ Atomic Types and Operations},
|
||||
url = {https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
|
||||
journal = {C++ atomic types and operations},
|
||||
publisher = {ISO/IEC JTC 1},
|
||||
author = {Boehm, Hans J and Crowl, Lawrence},
|
||||
year = {2007}
|
||||
}
|
||||
@article{Itzkovitz_Schuster_Shalev.Millipede.1998,
|
||||
title = {Thread migration and its applications in distributed shared memory systems},
|
||||
author = {Itzkovitz, Ayal and Schuster, Assaf and Shalev, Lea},
|
||||
journal = {Journal of Systems and Software},
|
||||
volume = {42},
|
||||
number = {1},
|
||||
pages = {71--87},
|
||||
year = {1998},
|
||||
publisher = {Elsevier}
|
||||
}
|
||||
@article{Jaleel_etal.RRIP.2010,
|
||||
title = {High performance cache replacement using re-reference interval prediction (RRIP)},
|
||||
author = {Jaleel, Aamer and Theobald, Kevin B and Steely Jr, Simon C and Emer, Joel},
|
||||
year = 2010,
|
||||
journal = {ACM SIGARCH computer architecture news},
|
||||
publisher = {ACM New York, NY, USA},
|
||||
volume = 38,
|
||||
number = 3,
|
||||
pages = {60--71}
|
||||
}
|
||||
@article{Jia_etal.Tensorflow_over_RDMA.2018,
|
||||
title = {Improving the performance of distributed tensorflow with RDMA},
|
||||
author = {Jia, Chengfan and Liu, Junnan and Jin, Xu and Lin, Han and An, Hong and Han, Wenting and Wu, Zheng and Chi, Mengxian},
|
||||
journal = {International Journal of Parallel Programming},
|
||||
volume = {46},
|
||||
pages = {674--685},
|
||||
year = {2018},
|
||||
publisher = {Springer}
|
||||
}
|
||||
@inproceedings{Kaxiras_etal.DSM-Argos.2015,
|
||||
author = {Kaxiras, Stefanos and Klaftenegger, David and Norgren, Magnus and Ros, Alberto and Sagonas, Konstantinos},
|
||||
title = {Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory},
|
||||
year = {2015},
|
||||
isbn = {9781450335508},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/2749246.2749250},
|
||||
doi = {10.1145/2749246.2749250},
|
||||
abstract = {A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.},
|
||||
booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing},
|
||||
pages = {3-14},
|
||||
numpages = {12},
|
||||
location = {Portland, Oregon, USA},
|
||||
series = {HPDC '15}
|
||||
}
|
||||
@inproceedings{Khawaja_etal.AmorphOS.2018,
|
||||
title = {Sharing, Protection, and Compatibility for Reconfigurable Fabric with $\{$AmorphOS$\}$},
|
||||
author = {Khawaja, Ahmed and Landgraf, Joshua and Prakash, Rohith and Wei, Michael and Schkufza, Eric and Rossbach, Christopher J},
|
||||
booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
|
||||
pages = {107--127},
|
||||
year = {2018}
|
||||
}
|
||||
@article{Khokhar_etal.HetComputingVision.1993,
|
||||
title = {Heterogeneous computing: Challenges and opportunities},
|
||||
author = {Khokhar, Ashfaq A. and Prasanna, Viktor K. and Shaaban, Muhammad E. and Wang, C-L},
|
||||
year = 1993,
|
||||
journal = {Computer},
|
||||
publisher = {IEEE},
|
||||
volume = 26,
|
||||
number = 6,
|
||||
pages = {18--27}
|
||||
}
|
||||
@inproceedings{Kim_etal.DeX-upon-Linux.2020,
|
||||
author = {Kim, Sang-Hoon and Chuang, Ho-Ren and Lyerly, Robert and Olivier, Pierre and Min, Changwoo and Ravindran, Binoy},
|
||||
booktitle = {2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)},
|
||||
title = {DeX: Scaling Applications Beyond Machine Boundaries},
|
||||
year = {2020},
|
||||
volume = {},
|
||||
number = {},
|
||||
pages = {864-876},
|
||||
keywords = {Protocols;Instruction sets;Linux;Prototypes;Distributed databases;Programming;Kernel;Thread migration;distributed execution;distributed memory;RDMA},
|
||||
doi = {10.1109/ICDCS47774.2020.00021}
|
||||
}
|
||||
|
||||
@misc{Kjos_etal.HP-HW-CC-IO.1996,
|
||||
copyright = {Copyright 2006 Elsevier B.V., All rights reserved.},
|
||||
issn = {0018-1153},
|
||||
journal = {Hewlett-Packard journal},
|
||||
keywords = {Computer Science ; Computer Science, Hardware & Architecture ; Engineering ; Engineering, Electrical & Electronic ; Instruments & Instrumentation ; Science & Technology ; Technology},
|
||||
language = {eng},
|
||||
number = {1},
|
||||
pages = {52-59},
|
||||
publisher = {Hewlett-Packard Co},
|
||||
abstract = {Hardware cache coherent I/O is a new feature of the PA-RISC architecture that involves the I/O hardware in ensuring cache coherence, thereby reducing CPU and memory overhead and increasing performance.},
|
||||
author = {Kjos, Toddj and Nusbaum, Helen and Traynor, Michaelk and Voge, Brendana},
|
||||
address = {PALO ALTO},
|
||||
title = {Hardware cache coherent input/output},
|
||||
volume = {47},
|
||||
year = {1996}
|
||||
}
|
||||
|
||||
@article{LaRowe_Ellis.Repl_NUMA.1991,
|
||||
title = {Page placement policies for NUMA multiprocessors},
|
||||
author = {Richard P. LaRowe and Carla Schlatter Ellis},
|
||||
year = 1991,
|
||||
journal = {Journal of Parallel and Distributed Computing},
|
||||
volume = 11,
|
||||
number = 2,
|
||||
pages = {112--129},
|
||||
doi = {https://doi.org/10.1016/0743-7315(91)90117-R},
|
||||
issn = {0743-7315},
|
||||
url = {https://www.sciencedirect.com/science/article/pii/074373159190117R},
|
||||
abstract = {In many parallel applications, the size of the program's data exceeds even the very large amount of main memory available on large-scale multiprocessors. Virtual memory, in the sense of a transparent management of the main/secondary memory hierarchy, is a natural solution. The replacement, fetch, and placement policies used in uniprocessor paging systems need to be reexamined in light of the differences in the behavior of parallel computations and in the memory architectures of multiprocessors. In particular, we investigate the impact of page placement in nonuniform memory access time (NUMA) shared memory MIMD machines. We experimentally evaluate several paging algorithms that incorporate different approaches to the placement issue. Under certain workload assumptions, our results show that placement algorithms that are strongly biased toward local frame allocation but are able to borrow remote frames can reduce the number of page faults over strictly local allocation. The increased cost of memory operations due to the extra remote accesses is more than compensated for by the savings resulting from the reduction in demand fetches, effectively reducing the computation completion time for these programs without having adverse effects on the performance of “typical” NUMA programs. We also discuss some early results obtained from an actual kernel implementation of one of our page placement algorithms.}
|
||||
}
|
||||
|
||||
@article{Lenoski_etal.Stanford_DASH.1992,
|
||||
title = {The stanford dash multiprocessor},
|
||||
author = {Lenoski, Daniel and Laudon, James and Gharachorloo, Kourosh and Weber, W-D and Gupta, Anoop and Hennessy, John and Horowitz, Mark and Lam, Monica S.},
|
||||
journal = {Computer},
|
||||
volume = {25},
|
||||
number = {3},
|
||||
pages = {63--79},
|
||||
year = {1992},
|
||||
publisher = {IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Li_etal.RelDB_RDMA.2016,
|
||||
title = {Accelerating relational databases by leveraging remote memory and RDMA},
|
||||
author = {Li, Feng and Das, Sudipto and Syamala, Manoj and Narasayya, Vivek R},
|
||||
booktitle = {Proceedings of the 2016 International Conference on Management of Data},
|
||||
pages = {355--370},
|
||||
year = {2016}
|
||||
}
|
||||
|
||||
@inproceedings{Lu_etal.MPI_vs_DSM_over_cluster.1995,
|
||||
title = {Message passing versus distributed shared memory on networks of workstations},
|
||||
author = {Lu, Honghui and Dwarkadas, Sandhya and Cox, Alan L and Zwaenepoel, Willy},
|
||||
booktitle = {Supercomputing'95: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing},
|
||||
pages = {37--37},
|
||||
year = {1995},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Lu_etal.Spark_over_RDMA.2014,
|
||||
title = {Accelerating spark with RDMA for big data processing: Early experiences},
|
||||
author = {Lu, Xiaoyi and Rahman, Md Wasi Ur and Islam, Nusrat and Shankar, Dipti and Panda, Dhabaleswar K},
|
||||
booktitle = {2014 IEEE 22nd Annual Symposium on High-Performance Interconnects},
|
||||
pages = {9--16},
|
||||
year = {2014},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Ma_etal.SHM_FPGA.2020,
|
||||
title = {A hypervisor for shared-memory FPGA platforms},
|
||||
author = {Ma, Jiacheng and Zuo, Gefei and Loughlin, Kevin and Cheng, Xiaohe and Liu, Yanqiang and Eneyew, Abel Mulugeta and Qi, Zhengwei and Kasikci, Baris},
|
||||
booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
|
||||
pages = {827--844},
|
||||
year = {2020}
|
||||
}
|
||||
|
||||
@misc{Manson_Goetz.JSR_133.Java_5.2004,
|
||||
url = {https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
|
||||
journal = {JSR 133 (Java Memory Model) FAQ},
|
||||
publisher = {Department of Computer Science, University of Maryland},
|
||||
author = {Manson, Jeremy and Goetz, Brian},
|
||||
year = {2004}
|
||||
}
|
||||
|
||||
@misc{many.MSFTLearn-SMBDirect.2024,
|
||||
title = {SMB Direct},
|
||||
url = {https://learn.microsoft.com/en-us/windows-server/storage/file-server/smb-direct},
|
||||
journal = {Microsoft Learn},
|
||||
publisher = {Microsoft},
|
||||
author = {Xelu86 and ManikaDhiman and dknappettmsft and v-alje and nedpyle and eross-msft and SubodhBhargava and JasonGerend and lizap and Heidilohr},
|
||||
year = {2024}
|
||||
}
|
||||
|
||||
@inproceedings{Masouros_etal.Adrias.2023,
|
||||
title = {Adrias: Interference-Aware Memory Orchestration for Disaggregated Cloud Infrastructures},
|
||||
author = {Masouros, Dimosthenis and Pinto, Christian and Gazzetti, Michele and Xydis, Sotirios and Soudris, Dimitrios},
|
||||
year = 2023,
|
||||
booktitle = {2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)},
|
||||
pages = {855--869},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@misc{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024,
|
||||
title = {Dynamic DMA mapping Guide},
|
||||
url = {https://www.kernel.org/doc/html/v6.7/core-api/dma-api-howto.html},
|
||||
journal = {The Linux Kernel},
|
||||
author = {Miller, David S and Henderson, Richard and Jelinek, Jakub},
|
||||
year = {2024}
|
||||
}
|
||||
|
||||
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
|
||||
title = {A primer on memory consistency and cache coherence},
|
||||
author = {Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
|
||||
year = {2020},
|
||||
publisher = {Springer Nature}
|
||||
}
|
||||
|
||||
@inproceedings{narayanan2020heterogeneity,
|
||||
title = {$\{$Heterogeneity-Aware$\}$ cluster scheduling policies for deep learning workloads},
|
||||
author = {Narayanan, Deepak and Santhanam, Keshav and Kazhamiaka, Fiodar and Phanishayee, Amar and Zaharia, Matei},
|
||||
year = 2020,
|
||||
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
|
||||
pages = {481--498}
|
||||
}
|
||||
|
||||
@inproceedings{Nelson_etal.Grappa_DSM.2015,
|
||||
title = {$\{$Latency-Tolerant$\}$ software distributed shared memory},
|
||||
author = {Nelson, Jacob and Holt, Brandon and Myers, Brandon and Briggs, Preston and Ceze, Luis and Kahan, Simon and Oskin, Mark},
|
||||
booktitle = {2015 USENIX Annual Technical Conference (USENIX ATC 15)},
|
||||
pages = {291--305},
|
||||
year = {2015}
|
||||
}
|
||||
|
||||
@inproceedings{Oh_Kim.Container_Migration.2018,
|
||||
title = {Stateful Container Migration employing Checkpoint-based Restoration for Orchestrated Container Clusters},
|
||||
author = {Oh, SeungYong and Kim, JongWon},
|
||||
year = 2018,
|
||||
booktitle = {2018 International Conference on Information and Communication Technology Convergence (ICTC)},
|
||||
volume = {},
|
||||
number = {},
|
||||
pages = {25--30},
|
||||
doi = {10.1109/ICTC.2018.8539562}
|
||||
}
|
||||
|
||||
@misc{Parris.AMBA_4_ACE-Lite.2013,
|
||||
title = {Extended system coherency: Cache Coherency Fundamentals},
|
||||
url = {https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals},
|
||||
journal = {Extended System Coherency: Cache Coherency Fundamentals - Architectures and Processors blog - Arm Community blogs - Arm Community},
|
||||
publisher = {ARM Community Blogs},
|
||||
author = {Parris, Neil},
|
||||
year = {2013}
|
||||
}
|
||||
|
||||
@inproceedings{Pinto_etal.Thymesisflow.2020,
|
||||
title = {Thymesisflow: A software-defined, hw/sw co-designed interconnect stack for rack-scale memory disaggregation},
|
||||
author = {Pinto, Christian and Syrivelis, Dimitris and Gazzetti, Michele and Koutsovasilis, Panos and Reale, Andrea and Katrinis, Kostas and Hofstee, H Peter},
|
||||
booktitle = {2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
|
||||
pages = {868--880},
|
||||
year = {2020},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@article{Rodriguez_etal.HPC_Cluster_Migration.2019,
|
||||
title = {Job migration in hpc clusters by means of checkpoint/restart},
|
||||
author = {Rodr{\'\i}guez-Pascual, Manuel and Cao, Jiajun and Mor{\'\i}{\~n}igo, Jos{\'e} A and Cooperman, Gene and Mayo-Garc{\'\i}a, Rafael},
|
||||
year = 2019,
|
||||
journal = {The Journal of Supercomputing},
|
||||
publisher = {Springer},
|
||||
volume = 75,
|
||||
pages = {6517--6541}
|
||||
}
|
||||
|
||||
@misc{Rust.core::sync::atomic::Ordering.2024,
|
||||
title = {Ordering in core::sync::atomic - Rust},
|
||||
url = {https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
|
||||
journal = {The Rust Core Library},
|
||||
publisher = {the Rust Team},
|
||||
year = {2024}
|
||||
}
|
||||
|
||||
@article{Schaefer_Li.Shiva.1989,
|
||||
title = {Shiva: An operating system transforming a hypercube into a shared-memory machine},
|
||||
author = {Li, Kai and Schaefer, Richard},
|
||||
year = {1989}
|
||||
}
|
||||
|
||||
@inproceedings{Schoinas_etal.Sirocco.1998,
|
||||
title = {Sirocco: Cost-effective fine-grain distributed shared memory},
|
||||
author = {Schoinas, Ioannis and Falsafi, Babak and Hill, Mark D and Larus, James R and Wood, David A},
|
||||
booktitle = {Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. 98EX192)},
|
||||
pages = {40--49},
|
||||
year = {1998},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Shan_Tsai_Zhang.DSPM.2017,
|
||||
title = {Distributed Shared Persistent Memory},
|
||||
author = {Shan, Yizhou and Tsai, Shin-Yeh and Zhang, Yiying},
|
||||
year = 2017,
|
||||
booktitle = {Proceedings of the 2017 Symposium on Cloud Computing},
|
||||
location = {Santa Clara, California},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
series = {SoCC '17},
|
||||
pages = {323–337},
|
||||
doi = {10.1145/3127479.3128610},
|
||||
isbn = 9781450350280,
|
||||
url = {https://doi.org/10.1145/3127479.3128610},
|
||||
abstract = {Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments.We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data.We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3\texttimes{} to 3.2\texttimes{} and a recent distributed PM-based file system by 1.5\texttimes{} to 3.0\texttimes{}.},
|
||||
numpages = 15,
|
||||
keywords = {distributed shared memory, persistent memory}
|
||||
}
|
||||
|
||||
@misc{Ven.LKML_x86_DMA.2008,
|
||||
title = {Background on ioremap, cacheing, cache coherency on x86},
|
||||
url = {https://lkml.org/lkml/2008/4/29/480},
|
||||
journal = {lkml.org},
|
||||
author = {Ven, Arjan van de},
|
||||
year = {2008}
|
||||
}
|
||||
|
||||
@inproceedings{Wang_etal.Concordia.2021,
|
||||
author = {Qing Wang and Youyou Lu and Erci Xu and Junru Li and Youmin Chen and Jiwu Shu},
|
||||
title = {Concordia: Distributed Shared Memory with {In-Network} Cache Coherence},
|
||||
booktitle = {19th USENIX Conference on File and Storage Technologies (FAST 21)},
|
||||
year = {2021},
|
||||
isbn = {978-1-939133-20-5},
|
||||
pages = {277--292},
|
||||
url = {https://www.usenix.org/conference/fast21/presentation/wang},
|
||||
publisher = {USENIX Association},
|
||||
month = feb
|
||||
}
|
||||
|
||||
@misc{WEB.Ampere..Ampere_Altra_Datasheet.2023,
|
||||
url = {https://uawartifacts.blob.core.windows.net/upload-files/Altra_Max_Rev_A1_DS_v1_15_20230809_b7cdce449e_424d129849.pdf},
|
||||
journal = {Ampere Altra Max Rev A1 64-Bit Multi-Core Processor Datasheet},
|
||||
publisher = {Ampere Computing}
|
||||
}
|
||||
|
||||
@misc{WEB.APACHE..Apache_Hadoop.2023,
|
||||
url = {https://hadoop.apache.org/},
|
||||
journal = {Apache Hadoop},
|
||||
publisher = {The APACHE Software Foundation}
|
||||
}
|
||||
|
||||
@misc{WEB.APACHE..Apache_Spark.2023,
|
||||
url = {https://spark.apache.org/},
|
||||
journal = {Apache SparkTM - Unified Engine for large-scale data analytics},
|
||||
publisher = {The APACHE Software Foundation}
|
||||
}
|
||||
|
||||
@misc{WEB.HPE.Chapel_Platforms-v1.33.2023,
|
||||
title = {Platform-Specifc Notes},
|
||||
url = {https://chapel-lang.org/docs/platforms/index.html#},
|
||||
journal = {Chapel Documentation 1.33},
|
||||
publisher = {Hewlett Packard Enterprise Development LP.},
|
||||
year = {2023}
|
||||
}
|
||||
|
||||
@misc{WEB.LBNL.UPC_man_1_upcc.2022,
|
||||
title = {upcc.1},
|
||||
url = {https://upc.lbl.gov/docs/user/upcc.html},
|
||||
journal = {Manual Reference Pages - UPCC (1)},
|
||||
publisher = {Lawrence Berkeley National Laboratory},
|
||||
year = {2022}
|
||||
}
|
||||
|
||||
@misc{WEB.LWN.Corbet.HMM_GPL_woes.2018,
|
||||
title = {Heterogeneous memory management meets EXPORT\_SYMBOL\_GPL()},
|
||||
author = {Corbet, Jonathan},
|
||||
year = 2018,
|
||||
journal = {LWN.net},
|
||||
publisher = {LWN.net},
|
||||
url = {https://lwn.net/Articles/757124/}
|
||||
} or was the order of authors other way around?
|
||||
|
||||
@misc{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017,
|
||||
title = {Unified memory for cuda beginners},
|
||||
author = {Harris, Mark},
|
||||
year = 2017,
|
||||
journal = {Unified Memory for CUDA Beginners},
|
||||
publisher = {NVIDIA},
|
||||
url = {https://developer.nvidia.com/blog/unified-memory-cuda-beginners/}
|
||||
}
|
||||
|
||||
@misc{WEB.Phoronix..HMM_Search_Results.2023,
|
||||
journal = {Heterogeneous Memory Management - Phoronix},
|
||||
publisher = {Phoronix},
|
||||
url = {https://www.phoronix.com/search/Heterogeneous%20Memory%20Management}
|
||||
}
|
||||
|
||||
@inproceedings{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003,
|
||||
title = {A performance comparison of dsm, pvm, and mpi},
|
||||
author = {Werstein, Paul and Pethick, Mark and Huang, Zhiyi},
|
||||
booktitle = {Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies},
|
||||
pages = {476--482},
|
||||
year = {2003},
|
||||
organization = {IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Yang_etal.FIFO-LPQD.2023,
|
||||
title = {FIFO can be Better than LRU: the Power of Lazy Promotion and Quick Demotion},
|
||||
author = {Yang, Juncheng and Qiu, Ziyue and Zhang, Yazhuo and Yue, Yao and Rashmi, KV},
|
||||
year = 2023,
|
||||
booktitle = {Proceedings of the 19th Workshop on Hot Topics in Operating Systems},
|
||||
pages = {70--79}
|
||||
}
|
||||
|
||||
@inproceedings{Zaharia_etal.RDD.2012,
|
||||
author = {Matei Zaharia and Mosharaf Chowdhury and Tathagata Das and Ankur Dave and Justin Ma and Murphy McCauly and Michael J. Franklin and Scott Shenker and Ion Stoica},
|
||||
title = {Resilient Distributed Datasets: A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing},
|
||||
booktitle = {9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)},
|
||||
year = {2012},
|
||||
isbn = {978-931971-92-8},
|
||||
address = {San Jose, CA},
|
||||
pages = {15--28},
|
||||
url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia},
|
||||
publisher = {USENIX Association},
|
||||
month = apr
|
||||
}
|
||||
|
||||
@inproceedings{Zhang_etal.GiantVM.2020,
|
||||
title = {Giantvm: A type-ii hypervisor implementing many-to-one virtualization},
|
||||
author = {Zhang, Jin and Ding, Zhuocheng and Chen, Yubin and Jia, Xingguo and Yu, Boshi and Qi, Zhengwei and Guan, Haibing},
|
||||
booktitle = {Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
|
||||
pages = {30--44},
|
||||
year = {2020}
|
||||
}
|
||||
|
||||
@inproceedings{Zhou_etal.DART-MPI.2014,
|
||||
title = {DART-MPI: An MPI-based implementation of a PGAS runtime system},
|
||||
author = {Zhou, Huan and Mhedheb, Yousri and Idrees, Kamran and Glass, Colin W and Gracia, Jos{\'e} and F{\"u}rlinger, Karl},
|
||||
booktitle = {Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models},
|
||||
pages = {1--11},
|
||||
year = {2014}
|
||||
}
|
||||
BIN
tex/draft/skeleton.pdf
Normal file
BIN
tex/draft/skeleton.pdf
Normal file
Binary file not shown.
|
|
@ -1,37 +1,117 @@
|
|||
\documentclass{article}
|
||||
\usepackage[english]{babel}
|
||||
% UG project example file, February 2022
|
||||
% A minior change in citation, September 2023 [HS]
|
||||
% Do not change the first two lines of code, except you may delete "logo," if causing problems.
|
||||
% Understand any problems and seek approval before assuming it's ok to remove ugcheck.
|
||||
\documentclass[logo,bsc,singlespacing,parskip]{infthesis}
|
||||
\usepackage{ugcheck}
|
||||
|
||||
% Include any packages you need below, but don't include any that change the page
|
||||
% layout or style of the dissertation. By including the ugcheck package above,
|
||||
% you should catch most accidental changes of page layout though.
|
||||
|
||||
\usepackage{microtype} % recommended, but you can remove if it causes problems
|
||||
% \usepackage{natbib} % recommended for citations
|
||||
\usepackage[utf8]{inputenc}
|
||||
\usepackage[dvipsnames]{xcolor}
|
||||
\usepackage{biblatex}
|
||||
\usepackage{graphicx}
|
||||
\usepackage[justification=centering]{caption}
|
||||
\usepackage{hyperref}
|
||||
\usepackage{amsthm}
|
||||
\usepackage[justification=centering]{caption}
|
||||
\usepackage{graphicx}
|
||||
\usepackage[english]{babel}
|
||||
% -> biblatex
|
||||
\usepackage{biblatex} % full of mischief
|
||||
\addbibresource{mybibfile.bib}
|
||||
% <- biblatex
|
||||
% -> nice definition listings
|
||||
\usepackage{csquotes}
|
||||
% \usepackage{listings}
|
||||
% \usepackage{xcolor}
|
||||
\usepackage{minted}
|
||||
|
||||
\addbibresource{background_draft.bib}
|
||||
\usepackage{amsthm}
|
||||
\theoremstyle{definition}
|
||||
\newtheorem{definition}{Definition}
|
||||
|
||||
% Code listings
|
||||
% <- definition
|
||||
% -> code listing
|
||||
% [!] Requires external program: pypi:pygment
|
||||
\usepackage{minted}
|
||||
\usemintedstyle{vs}
|
||||
% \definecolor{code-comment}{rgb}{0.5, 0.5, 0.4}
|
||||
% \definecolor{code-background}{rgb}{0.96, 0.96, 0.96}
|
||||
% \lstset{
|
||||
% backgroundcolor=\color{code-background},
|
||||
% keywordstyle=\color{magenta},
|
||||
% commentstyle=\color{code-comment},
|
||||
% stringstyle=\color{purple},
|
||||
% basicstyle=\ttfamily\footnotesize,
|
||||
% emphstyle=\underline,
|
||||
% numbers=left,
|
||||
% tabsize=4
|
||||
% }
|
||||
% <- code listing
|
||||
|
||||
\begin{document}
|
||||
\begin{preliminary}
|
||||
|
||||
\title{Cache Coherency in ARMv8-A for Cross-Architectural DSM Systems}
|
||||
|
||||
\author{Zhengyi Chen}
|
||||
|
||||
% CHOOSE YOUR DEGREE a):
|
||||
% please leave just one of the following un-commented
|
||||
% \course{Artificial Intelligence}
|
||||
%\course{Artificial Intelligence and Computer Science}
|
||||
%\course{Artificial Intelligence and Mathematics}
|
||||
%\course{Artificial Intelligence and Software Engineering}
|
||||
%\course{Cognitive Science}
|
||||
\course{Computer Science}
|
||||
%\course{Computer Science and Management Science}
|
||||
%\course{Computer Science and Mathematics}
|
||||
%\course{Computer Science and Physics}
|
||||
%\course{Software Engineering}
|
||||
%\course{Master of Informatics} % MInf students
|
||||
|
||||
% CHOOSE YOUR DEGREE b):
|
||||
% please leave just one of the following un-commented
|
||||
%\project{MInf Project (Part 1) Report} % 4th year MInf students
|
||||
%\project{MInf Project (Part 2) Report} % 5th year MInf students
|
||||
\project{4th Year Project Report} % all other UG4 students
|
||||
|
||||
|
||||
\date{\today}
|
||||
|
||||
\abstract{
|
||||
This skeleton demonstrates how to use the \texttt{infthesis} style for
|
||||
undergraduate dissertations in the School of Informatics. It also emphasises the
|
||||
page limit, and that you must not deviate from the required style.
|
||||
The file \texttt{skeleton.tex} generates this document and should be used as a
|
||||
starting point for your thesis. Replace this abstract text with a concise
|
||||
summary of your report.
|
||||
}
|
||||
|
||||
\maketitle
|
||||
|
||||
\newenvironment{ethics}
|
||||
{\begin{frontenv}{Research Ethics Approval}{\LARGE}}
|
||||
{\end{frontenv}\newpage}
|
||||
|
||||
\begin{ethics}
|
||||
% \textbf{Instructions:} \emph{Agree with your supervisor which
|
||||
% statement you need to include. Then delete the statement that you are not using,
|
||||
% and the instructions in italics.\\
|
||||
% \textbf{Either complete and include this statement:}}\\ % DELETE THESE INSTRUCTIONS
|
||||
% %
|
||||
% % IF ETHICS APPROVAL WAS REQUIRED:
|
||||
% This project obtained approval from the Informatics Research Ethics committee.\\
|
||||
% Ethics application number: ???\\
|
||||
% Date when approval was obtained: YYYY-MM-DD\\
|
||||
% %
|
||||
% \emph{[If the project required human participants, edit as appropriate, otherwise delete:]}\\ % DELETE THIS LINE
|
||||
% The participants' information sheet and a consent form are included in the appendix.\\
|
||||
% %
|
||||
% IF ETHICS APPROVAL WAS NOT REQUIRED:
|
||||
% \textbf{\emph{Or include this statement:}}\\ % DELETE THIS LINE
|
||||
This project was planned in accordance with the Informatics Research
|
||||
Ethics policy. It did not involve any aspects that required approval
|
||||
from the Informatics Research Ethics committee.
|
||||
|
||||
\standarddeclaration
|
||||
\end{ethics}
|
||||
|
||||
|
||||
\begin{acknowledgements}
|
||||
Jordanian River to the Mediterranean Sea, maybe\dots
|
||||
\end{acknowledgements}
|
||||
|
||||
|
||||
\tableofcontents
|
||||
\end{preliminary}
|
||||
|
||||
|
||||
\chapter{Introduction}
|
||||
Though large-scale cluster systems remain the dominant solution for request and data-level parallelism \cite{BOOK.Hennessy_Patterson.CArch.2011}, there have been a resurgence towards applying HPC techniques (e.g., DSM) for more efficient heterogeneous computation with tighter-coupled heterogeneous nodes providing (hardware) acceleration for one another \cites{Cabezas_etal.GPU-SM.2015}{Ma_etal.SHM_FPGA.2020}{Khawaja_etal.AmorphOS.2018}. Orthogonally, within the scope of one motherboard, \emph{heterogeneous memory management (HMM)} enables the use of OS-controlled, unified memory view across both main memory and device memory \cite{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017}, all while using the same libc function calls as one would with SMP programming, the underlying complexities of memory ownership and data placement automatically managed by the OS kernel. However, while HMM promises a distributed shared memory approach towards exposing CPU and peripheral memory, applications (drivers and front-ends) that exploit HMM to provide ergonomic programming models remain fragmented and narrowly-focused. Existing efforts in exploiting HMM in Linux predominantly focus on exposing global address space abstraction to GPU memory -- a largely non-coordinated effort surrounding both \textit{in-tree} and proprietary code \cites{WEB.LWN.Corbet.HMM_GPL_woes.2018}{WEB.Phoronix..HMM_Search_Results.2023}. Limited effort have been done on incorporating HMM into other variants of accelerators in various system topologies.
|
||||
|
||||
Orthogonally, allocation of hardware accelerator resources in a cluster computing environment becomes difficult when the required hardware accelerator resources of one workload cannot be easily determined and/or isolated as a ``stage'' of computation. Within a cluster system there may exist a large amount of general-purpose worker nodes and limited amount of hardware-accelerated nodes. Further, it is possible that every workload performed on this cluster asks for hardware acceleration from time to time, but never for a relatively long time. Many job scheduling mechanisms within a cluster \emph{move data near computation} by migrating the entire job/container between general-purpose and accelerator nodes \cites{Rodriguez_etal.HPC_Cluster_Migration.2019} {Oh_Kim.Container_Migration.2018}. This way of migration naturally incurs large overhead -- accelerator nodes which strictly perform computation on data in memory without ever needing to touch the container's filesystem should not have to install the entire filesystem locally, for starters. Moreover, must \emph{all} computations be performed near data? \textit{Adrias}\cite{Masouros_etal.Adrias.2023}, for example, shows that RDMA over fast network interfaces (25 Gbps $\times$ 8), when compared to node-local setups, result in negligible impact on tail latencies but high impact on throughput when bandwidth is maximized.
|
||||
|
|
@ -42,9 +122,6 @@ This thesis paper builds upon an ongoing research effort in implementing a tight
|
|||
\item {
|
||||
The effect of cache coherency maintenance, specifically OS-initiated, on RDMA programs.
|
||||
}
|
||||
\item {
|
||||
Implementation of cache coherency in cache-incoherent kernel-side RDMA clients.
|
||||
}
|
||||
\item {
|
||||
Discussion of memory models and coherence protocol designs for a single-writer, multi-reader RDMA-based DSM system.
|
||||
}
|
||||
|
|
@ -71,56 +148,17 @@ A majority of contributions to software DSM systems come from the 1990s \cites{A
|
|||
|
||||
While developments in hardware DSM materialized into a universal approach to cache-coherence in contemporary many-core processors (e.g., \textit{Ampere Altra}\cite{WEB.Ampere..Ampere_Altra_Datasheet.2023}), software DSMs in clustered computing languished in favor of loosely-coupled nodes performing data-parallel computation, communicating via message-passing. Bandwidth limitations with the network interfaces of the late 1990s was insufficient to support the high traffic incurred by DSM and its programming model \cites{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003}{Lu_etal.MPI_vs_DSM_over_cluster.1995}.
|
||||
|
||||
New developments in network interfaces provides much improved bandwidth and latency
|
||||
compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve
|
||||
the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC,
|
||||
scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}.
|
||||
Similar results have been observed for \textit{APACHE Spark} \cite{Lu_etal.Spark_over_RDMA.2014}
|
||||
and \textit{SMBDirect} \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a
|
||||
resurgence of interest in software DSM systems and programming models
|
||||
\cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
|
||||
|
||||
% Different to DSM-over-RDMA, we try to expose RDMA as device with HMM capability
|
||||
% i.e., we do it in kernel as opposed to in userspace. Accelerator node can access
|
||||
% local node's shared page like a DMA device do so via HMM.
|
||||
New developments in network interfaces provides much improved bandwidth and latency compared to ethernet in the 1990s. RDMA-capable NICs have been shown to improve the training efficiency sixfold compared to distributed \textit{TensorFlow} via RPC, scaling positively over non-distributed training \cite{Jia_etal.Tensorflow_over_RDMA.2018}. Similar results have been observed for \textit{APACHE Spark} \cite{Lu_etal.Spark_over_RDMA.2014} and \textit{SMBDirect} \cite{Li_etal.RelDB_RDMA.2016}. Consequently, there have been a resurgence of interest in software DSM systems and programming models \cites{Nelson_etal.Grappa_DSM.2015}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}.
|
||||
|
||||
\subsection{Munin: Multi-Consistency Protocol}
|
||||
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older
|
||||
developments in software DSM systems. The authors of Munin identify that
|
||||
\textit{false-sharing}, occurring due to multiple processors writing to different
|
||||
offsets of the same page triggering invalidations, is strongly detrimental to the
|
||||
performance of shared-memory systems. To combat this, Munin exposes annotations
|
||||
as part of its programming model to facilitate multiple consistency protocols on
|
||||
top of release consistency. An immutable shared memory object across readers,
|
||||
for example, can be safely copied without concern for coherence between processors.
|
||||
On the other hand, the \textit{write-shared} annotation explicates that a memory
|
||||
object is written by multiple processors without synchronization -- i.e., the
|
||||
programmer guarantees that only false-sharing occurs within this granularity.
|
||||
Annotations such as these explicitly disables subsets of consistency procedures
|
||||
to reduce communication in the network fabric, thereby improving the performance
|
||||
of the DSM system.
|
||||
\textit{Munin}\cite{Carter_Bennett_Zwaenepoel.Munin.1991} is one of the older developments in software DSM systems. The authors of Munin identify that \textit{false-sharing}, occurring due to multiple processors writing to different offsets of the same page triggering invalidations, is strongly detrimental to the performance of shared-memory systems. To combat this, Munin exposes annotations as part of its programming model to facilitate multiple consistency protocols on top of release consistency. An immutable shared memory object across readers, for example, can be safely copied without concern for coherence between processors. On the other hand, the \textit{write-shared} annotation explicates that a memory object is written by multiple processors without synchronization -- i.e., the programmer guarantees that only false-sharing occurs within this granularity. Annotations such as these explicitly disables subsets of consistency procedures to reduce communication in the network fabric, thereby improving the performance of the DSM system.
|
||||
|
||||
Perhaps most importantly, experiences from Munin show that \emph{restricting the
|
||||
flexibility of programming model can lead to more performant coherence models}, as
|
||||
exhibited by the now-foundational \textit{Resilient Distributed Database} paper
|
||||
\cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data
|
||||
processing frameworks such as \textit{Hadoop MapReduce}
|
||||
\cite{WEB.APACHE..Apache_Hadoop.2023} and
|
||||
\textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault
|
||||
tolerance efficiently, RDDs provide a restricted form of shared memory
|
||||
[based on]\dots transformations rather than\dots updates to shared state''
|
||||
\cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to
|
||||
cheaply synchronize states between unshared address spaces -- a much desired
|
||||
property for highly scalable, loosely-coupled clustered systems.
|
||||
Perhaps most importantly, experiences from Munin show that \emph{restricting the flexibility of programming model can lead to more performant coherence models}, as exhibited by the now-foundational \textit{Resilient Distributed Database} paper \cite{Zaharia_etal.RDD.2012} which powered many now-popular scalable data processing frameworks such as \textit{Hadoop MapReduce} \cite{WEB.APACHE..Apache_Hadoop.2023} and \textit{APACHE Spark} \cite{WEB.APACHE..Apache_Spark.2023}. ``To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory [based on]\dots transformations rather than\dots updates to shared state'' \cite{Zaharia_etal.RDD.2012}. This allows for the use of transformation logs to cheaply synchronize states between unshared address spaces -- a much desired property for highly scalable, loosely-coupled clustered systems.
|
||||
|
||||
\subsection{Treadmarks: Multi-Writer Protocol}
|
||||
\textit{Treadmarks}\cite{Amza_etal.Treadmarks.1996} is a software DSM system developed in 1996, which featured an intricate \textit{interval}-based multi-writer protocol that allows multiple nodes to write to the same page without false-sharing. The system follows a release-consistent memory model, which requires the use of either locks (via \texttt{acquire}, \texttt{release}) or barriers (via \texttt{barrier}) to synchronize. Each \textit{interval} represents a time period in-between page creation, \texttt{release} to another processor, or a \texttt{barrier}; they also each correspond to a \textit{write notice}, which are used for page invalidation. Each \texttt{acquire} message is sent to the statically-assigned lock-manager node, which forwards the message to the last releaser. The last releaser computes the outstanding write notices and piggy-backs them back for the acquirer to invalidate its own cached page entry, thus signifying entry into the critical section. Consistency information, including write notices, intervals, and page diffs, are routinely garbage-collected which forces cached pages in each node to become validated.
|
||||
|
||||
Compared to \textit{Treadmarks}, the system described in this paper uses a
|
||||
single-writer protocol, thus eliminating the concept of ``intervals'' --
|
||||
with regards to synchronization, each page can be either in-sync (in which case
|
||||
they can be safely shared) or out-of-sync (in which case they must be
|
||||
invalidated/updated). This comes with the following advantage:
|
||||
Compared to \textit{Treadmarks}, the system described in this paper uses a single-writer protocol, thus eliminating the concept of ``intervals'' -- with regards to synchronization, each page can be either in-sync (in which case they can be safely shared) or out-of-sync (in which case they must be invalidated/updated). This comes with the following advantage:
|
||||
|
||||
\begin{itemize}
|
||||
\item Less metadata for consistency-keeping.
|
||||
|
|
@ -128,355 +166,58 @@ invalidated/updated). This comes with the following advantage:
|
|||
\item Much simpler coherence protocol, which reduces communication cost.
|
||||
\end{itemize}
|
||||
|
||||
In view of the (still) disparate throughput and latency differences between local
|
||||
and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018},
|
||||
the simpler coherence protocol of single-writer protocol should provide better
|
||||
performance on the critical paths of remote memory access.
|
||||
|
||||
% The majority of contributions to DSM study come from the 1990s, for example
|
||||
% \textbf{[Treadmark, Millipede, Munin, Shiva, etc.]}. These DSM systems attempt to
|
||||
% leverage kernel system calls to allow for user-level DSM over ethernet NICs. While
|
||||
% these systems provide a strong theoretical basis for today's majority-software
|
||||
% DSM systems and applications that expose a \emph{(partitioned) global address space},
|
||||
% they were nevertheless constrained by the limitations in NIC transfer rate and
|
||||
% bandwidth, and the concept of DSM failed to take off (relative to cluster computing).
|
||||
In view of the (still) disparate throughput and latency differences between local and remote memory access \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018}, the simpler coherence protocol of single-writer protocol should provide better performance on the critical paths of remote memory access.
|
||||
|
||||
\subsection{Hotpot: Single-Writer \& Data Replication}
|
||||
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply
|
||||
distributed shared memory techniques on persistent memory to provide
|
||||
``transparent memory accesses, data persistence, data reliability, and high
|
||||
availability''. Leveraging on persistent memory devices allow DSM applications
|
||||
to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017},
|
||||
ensuring both distributed cache coherence and data reliability at the same time
|
||||
\cite{Shan_Tsai_Zhang.DSPM.2017}.
|
||||
Newer works such as \textit{Hotpot}\cite{Shan_Tsai_Zhang.DSPM.2017} apply distributed shared memory techniques on persistent memory to provide ``transparent memory accesses, data persistence, data reliability, and high availability''. Leveraging on persistent memory devices allow DSM applications to bypass checkpoints to block device storage \cite{Shan_Tsai_Zhang.DSPM.2017}, ensuring both distributed cache coherence and data reliability at the same time \cite{Shan_Tsai_Zhang.DSPM.2017}.
|
||||
|
||||
We specifically discuss the single-writer portion of its coherence protocol. The
|
||||
data reliability guarantees proposed by the \textit{Hotpot} system requires each
|
||||
shared page to be replicated to some \textit{degree of replication}. Nodes
|
||||
who always store latest replication of shared pages are referred to as
|
||||
``owner nodes'', which arbitrate other nodes to store more replications in order
|
||||
to reach the degree of replication quota. At acquisition time, the acquiring node
|
||||
asks the access-management node for single-writer access to shared page,
|
||||
who grants it if no other critical section exists, alongside list of current
|
||||
owner nodes. At release time, the releaser first commits its changes to all owner
|
||||
nodes which, in turn, commits its received changes across lesser sharers to
|
||||
achieve the required degree of replication. These two operations are all
|
||||
acknowledged back in reverse order. Once all acknowledgements are received from
|
||||
owner nodes by commit node, the releaser tells them to delete their commit logs
|
||||
and, finally, tells the manager node to exit critical section.
|
||||
We specifically discuss the single-writer portion of its coherence protocol. The data reliability guarantees proposed by the \textit{Hotpot} system requires each shared page to be replicated to some \textit{degree of replication}. Nodes who always store latest replication of shared pages are referred to as ``owner nodes'', which arbitrate other nodes to store more replications in order to reach the degree of replication quota. At acquisition time, the acquiring node asks the access-management node for single-writer access to shared page, who grants it if no other critical section exists, alongside list of current owner nodes. At release time, the releaser first commits its changes to all owner nodes which, in turn, commits its received changes across lesser sharers to achieve the required degree of replication. These two operations are all acknowledged back in reverse order. Once all acknowledgements are received from owner nodes by commit node, the releaser tells them to delete their commit logs and, finally, tells the manager node to exit critical section.
|
||||
|
||||
The required degree of replication and logged commit transaction until explicit
|
||||
deletion facilitate crash recovery at the expense of worse performance over
|
||||
release-time I/O. While the study of crash recovery with respect to shared
|
||||
memory systems is out of the scope of this thesis, this paper provides a good
|
||||
framework for a \textbf{correct} coherence protocol for a single-writer,
|
||||
multiple-reader shared memory system, particularly when the protocol needs to
|
||||
cater for a great variety of nodes each with their own memory preferences (e.g.,
|
||||
write-update vs. write-invalidate, prefetching, etc.).
|
||||
The required degree of replication and logged commit transaction until explicit deletion facilitate crash recovery at the expense of worse performance over release-time I/O. While the study of crash recovery with respect to shared memory systems is out of the scope of this thesis, this paper provides a good framework for a \textbf{correct} coherence protocol for a single-writer, multiple-reader shared memory system, particularly when the protocol needs to cater for a great variety of nodes each with their own memory preferences (e.g., write-update vs. write-invalidate, prefetching, etc.).
|
||||
|
||||
\subsection{MENPS: A Return to DSM}
|
||||
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable
|
||||
interconnects as a proof-of-concept that DSM systems and programming models can
|
||||
be as efficient as \textit{partitioned global address space} (PGAS) using today's
|
||||
network interfaces. It builds upon \textit{TreadMark}'s
|
||||
\cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to
|
||||
a \textit{floating home-based} protocol, based on the insight that diff-transfers
|
||||
across the network is comparatively costly compared to RDMA intrinsics -- which
|
||||
implies preference towards local diff-merging. The home node then acts as the
|
||||
data supplier for every shared page within the system.
|
||||
MENPS\cite{Endo_Sato_Taura.MENPS_DSM.2020} leverages new RDMA-capable interconnects as a proof-of-concept that DSM systems and programming models can be as efficient as \textit{partitioned global address space} (PGAS) using today's network interfaces. It builds upon \textit{TreadMark}'s \cite{Amza_etal.Treadmarks.1996} coherence protocol and crucially alters it to a \textit{floating home-based} protocol, based on the insight that diff-transfers across the network is comparatively costly compared to RDMA intrinsics -- which implies preference towards local diff-merging. The home node then acts as the data supplier for every shared page within the system.
|
||||
|
||||
Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of
|
||||
\textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup
|
||||
in some of the computation tasks, while achieving much
|
||||
better productivity due to DSM's support for transparent caching, etc.
|
||||
\cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that
|
||||
DSM systems are at least as viable as traditional PGAS/message-passing frameworks
|
||||
for scientific computing, also corroborated by the resurgence of DSM studies
|
||||
later on\cite{Masouros_etal.Adrias.2023}.
|
||||
Compared to PGAS frameworks (e.g., MPI), experimentation over a subset of \textit{NAS Parallel Benchmarks} shows that MENPS can obtain comparable speedup in some of the computation tasks, while achieving much better productivity due to DSM's support for transparent caching, etc. \cite{Endo_Sato_Taura.MENPS_DSM.2020}. These results back up their claim that DSM systems are at least as viable as traditional PGAS/message-passing frameworks for scientific computing, also corroborated by the resurgence of DSM studies later on\cite{Masouros_etal.Adrias.2023}.
|
||||
|
||||
\section{PGAS and Message Passing}
|
||||
While the feasibility of transparent DSM systems over multiple machines on the
|
||||
network has been made apparent since the 1980s, predominant approaches to
|
||||
``scaling-out'' programs over the network relies on the message-passing approach
|
||||
\cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold:
|
||||
While the feasibility of transparent DSM systems over multiple machines on the network has been made apparent since the 1980s, predominant approaches to ``scaling-out'' programs over the network relies on the message-passing approach \cite{AST_Steen.Distributed_Systems-3ed.2017}. The reasons are twofold:
|
||||
|
||||
\begin{enumerate}
|
||||
\item {
|
||||
Programmers would rather resort to more intricate, more predictable
|
||||
approaches to scaling-out programs over the network
|
||||
\cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies
|
||||
manual/controlled data sharding over nodes, separation of compute and
|
||||
communication ``stages'' of computation, etc., which benefit performance
|
||||
analysis and engineering.
|
||||
Programmers would rather resort to more intricate, more predictable approaches to scaling-out programs over the network \cite{AST_Steen.Distributed_Systems-3ed.2017}. This implies manual/controlled data sharding over nodes, separation of compute and communication ``stages'' of computation, etc., which benefit performance analysis and engineering.
|
||||
}
|
||||
\item {
|
||||
Enterprise applications value throughput and uptime of relatively
|
||||
computationally inexpensive tasks/resources
|
||||
\cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy
|
||||
scalability of tried-and-true, latency-inexpensive applications.
|
||||
Studies in transparent DSM systems mostly require exotic,
|
||||
specifically-written programs to exploit global address space, which is
|
||||
fundamentally at odds in terms of reusability and flexibility required.
|
||||
Enterprise applications value throughput and uptime of relatively computationally inexpensive tasks/resources \cite{BOOK.Hennessy_Patterson.CArch.2011}, which requires easy scalability of tried-and-true, latency-inexpensive applications. Studies in transparent DSM systems mostly require exotic, specifically-written programs to exploit global address space, which is fundamentally at odds in terms of reusability and flexibility required.
|
||||
}
|
||||
\end{enumerate}
|
||||
|
||||
\subsection{PGAS}
|
||||
\textit{Partitioned Global Address Space} (PGAS) is a parallel programming model
|
||||
that (1) exposes a global address space to all machines within a network and
|
||||
(2) explicates distinction between local and remote memory
|
||||
\cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks,
|
||||
for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used
|
||||
as backends to provide the PGAS model over various network interfaces/platforms
|
||||
(e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022}
|
||||
{WEB.HPE.Chapel_Platforms-v1.33.2023}.
|
||||
\textit{Partitioned Global Address Space} (PGAS) is a parallel programming model that (1) exposes a global address space to all machines within a network and (2) explicates distinction between local and remote memory \cite{De_Wael_etal.PGAS_Survey.2015}. Oftentimes, message-passing frameworks, for example \textit{OpenMPI}, \textit{OpenFabrics}, and \textit{UCX}, are used as backends to provide the PGAS model over various network interfaces/platforms (e.g., Ethernet and Infiniband)\cites{WEB.LBNL.UPC_man_1_upcc.2022} {WEB.HPE.Chapel_Platforms-v1.33.2023}.
|
||||
|
||||
Notably, implementation of a \emph{global} address space across machines on top
|
||||
of machines already equipped with their own \emph{local} address space (e.g.,
|
||||
cluster nodes running commercial Linux) necessitates a global addressing
|
||||
mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014},
|
||||
for example, utilizes a 128-bit ``global pointer'' to encode global memory
|
||||
object/segment ID and access flags in the upper 64 bits and virtual addresses in
|
||||
the lower 64 bits for each (slice of) memory object allocated within the PGAS
|
||||
model. A \textit{non-collective} PGAS object is allocated entirely local to the
|
||||
allocating node's memory, but registered globally. Consequently, a single global
|
||||
pointer is recorded in the runtime with corresponding permission flags for the
|
||||
context of some user-defined group of associated nodes. Comparatively, a
|
||||
\textit{collective} PGAS object is allocated such that a partition of the object
|
||||
(i.e., a sub-array of the repr) is stored in each of the associated node -- for
|
||||
a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
|
||||
pointing to the same object, with different offsets and (intuitively)
|
||||
independently-chosen virtual addresses. Note that this design naturally requires
|
||||
virtual addresses within each node to be \emph{pinned} -- the allocated object
|
||||
cannot be re-addressed to a different virtual address, thus preventing the
|
||||
global pointer that records the local virtual address from becoming
|
||||
spontaneously invalidated.
|
||||
Notably, implementation of a \emph{global} address space across machines on top of machines already equipped with their own \emph{local} address space (e.g., cluster nodes running commercial Linux) necessitates a global addressing mechanism for shared/shared data objects. DART\cite{Zhou_etal.DART-MPI.2014}, for example, utilizes a 128-bit ``global pointer'' to encode global memory object/segment ID and access flags in the upper 64 bits and virtual addresses in the lower 64 bits for each (slice of) memory object allocated within the PGAS model. A \textit{non-collective} PGAS object is allocated entirely local to the allocating node's memory, but registered globally. Consequently, a single global pointer is recorded in the runtime with corresponding permission flags for the context of some user-defined group of associated nodes. Comparatively, a \textit{collective} PGAS object is allocated such that a partition of the object (i.e., a sub-array of the repr) is stored in each of the associated node -- for a $k$-partitioned object, $k$ global pointers are recorded in the runtime each pointing to the same object, with different offsets and (intuitively) independently-chosen virtual addresses. Note that this design naturally requires virtual addresses within each node to be \emph{pinned} -- the allocated object cannot be re-addressed to a different virtual address, thus preventing the global pointer that records the local virtual address from becoming spontaneously invalidated.
|
||||
|
||||
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
|
||||
opt to use a map-like data structure for addressing instead. In general, despite
|
||||
both PGAS and DSM systems provide memory management over remote nodes, PGAS
|
||||
frameworks provide no transparent caching and transfer of remote memory objects
|
||||
accessed by local nodes. The programmer is still expected to handle data/thread
|
||||
movement manually when working with shared memory over network to maximize
|
||||
their performance metrics of interest.
|
||||
Similar schemes can be observed in other PGAS backends/runtimes, albeit they may opt to use a map-like data structure for addressing instead. In general, despite both PGAS and DSM systems provide memory management over remote nodes, PGAS frameworks provide no transparent caching and transfer of remote memory objects accessed by local nodes. The programmer is still expected to handle data/thread movement manually when working with shared memory over network to maximize their performance metrics of interest.
|
||||
|
||||
\subsection{Message Passing}
|
||||
\label{sec:msg-passing}
|
||||
\textit{Message Passing} remains the predominant programming model for
|
||||
parallelism between loosely-coupled nodes within a computer system, much as it
|
||||
is ubiquitous in supporting all levels of abstraction within any concurrent
|
||||
components of a computer system. Specific to cluster computing systems is the
|
||||
message-passing programming model, where parallel programs (or instances of
|
||||
the same parallel program) on different nodes within the system communicate
|
||||
via exchanging messages over network between these nodes. Such models exchange
|
||||
programming model productivity for more fine-grained control over the messages
|
||||
passed, as well as more explicit separation between communication and computation
|
||||
stages within a programming subproblem.
|
||||
\textit{Message Passing} remains the predominant programming model for parallelism between loosely-coupled nodes within a computer system, much as it is ubiquitous in supporting all levels of abstraction within any concurrent components of a computer system. Specific to cluster computing systems is the message-passing programming model, where parallel programs (or instances of the same parallel program) on different nodes within the system communicate via exchanging messages over network between these nodes. Such models exchange programming model productivity for more fine-grained control over the messages passed, as well as more explicit separation between communication and computation stages within a programming subproblem.
|
||||
|
||||
Commonly, message-passing backends function as \textit{middlewares} --
|
||||
communication runtimes -- to aid distributed software development
|
||||
\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend
|
||||
expose facilities for inter-application communication to frontend developers
|
||||
while transparently providing security, accounting, and fault-tolerance, much
|
||||
like how an operating system may provide resource management, scheduling, and
|
||||
security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}.
|
||||
This is the case for implementing the PGAS programming model, which mostly rely
|
||||
on common message-passing backends to facilitate orchestrated data manipulation
|
||||
across distributed nodes. Likewise, message-passing backends, including RDMA API,
|
||||
form the backbone of many research-oriented DSM systems
|
||||
\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}
|
||||
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
|
||||
Commonly, message-passing backends function as \textit{middlewares} -- communication runtimes -- to aid distributed software development \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend expose facilities for inter-application communication to frontend developers while transparently providing security, accounting, and fault-tolerance, much like how an operating system may provide resource management, scheduling, and security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. This is the case for implementing the PGAS programming model, which mostly rely on common message-passing backends to facilitate orchestrated data manipulation across distributed nodes. Likewise, message-passing backends, including RDMA API, form the backbone of many research-oriented DSM systems \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
|
||||
|
||||
Message-passing between network-connected nodes may be \textit{two-sided} or
|
||||
\textit{one-sided}. The former models an intuitive workflow to sending and
|
||||
receiving datagrams over the network -- the sender initiates a transfer; the
|
||||
receiver copies a received packet from the network card into a kernel buffer;
|
||||
the receiver's kernel filters the packet and (optionally)
|
||||
\cite{FreeBSD.man-BPF-4.2021} copies the internal message
|
||||
into the message-passing runtime/middleware's address space; the receiver's
|
||||
middleware inspects the copied message and performs some procedures accordingly,
|
||||
likely also involving copying slices of message data to some registered
|
||||
distributed shared memory buffer for the distributed application to access.
|
||||
Despite it being a highly intuitive model of data manipulation over the network,
|
||||
this poses a fundamental performance issue: because the process requires the
|
||||
receiver's kernel AND userspace to exert CPU-time, upon reception of each
|
||||
message, the receiver node needs to proactively exert CPU-time to move the
|
||||
received data from bytes read from NIC devices to userspace. Because this
|
||||
happens concurrently with other kernel and userspace routines in a
|
||||
concurrent system, a preemptable kernel may incur significant latency if the
|
||||
kernel routine for packet filtering is pre-empted by another kernel routine,
|
||||
userspace, or IRQs.
|
||||
Message-passing between network-connected nodes may be \textit{two-sided} or \textit{one-sided}. The former models an intuitive workflow to sending and receiving datagrams over the network -- the sender initiates a transfer; the receiver copies a received packet from the network card into a kernel buffer; the receiver's kernel filters the packet and (optionally) \cite{FreeBSD.man-BPF-4.2021} copies the internal message into the message-passing runtime/middleware's address space; the receiver's middleware inspects the copied message and performs some procedures accordingly, likely also involving copying slices of message data to some registered distributed shared memory buffer for the distributed application to access. Despite it being a highly intuitive model of data manipulation over the network, this poses a fundamental performance issue: because the process requires the receiver's kernel AND userspace to exert CPU-time, upon reception of each message, the receiver node needs to proactively exert CPU-time to move the received data from bytes read from NIC devices to userspace. Because this happens concurrently with other kernel and userspace routines in a concurrent system, a preemptable kernel may incur significant latency if the kernel routine for packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
|
||||
|
||||
Comparatively, a ``one-sided'' message-passing scheme, for example RDMA,
|
||||
allows the network interface card to bypass in-kernel packet filters and
|
||||
perform DMA on registered memory regions. The NIC can hence notify the CPU via
|
||||
interrupts, thus allowing the kernel and the userspace programs to perform
|
||||
callbacks at reception time with reduced latency. Because of this advantage,
|
||||
many recent studies attempt to leverage RDMA APIs for improved distributed data
|
||||
workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014}
|
||||
{Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||
{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
|
||||
{Kaxiras_etal.DSM-Argos.2015}.
|
||||
Comparatively, a ``one-sided'' message-passing scheme, for example RDMA, allows the network interface card to bypass in-kernel packet filters and perform DMA on registered memory regions. The NIC can hence notify the CPU via interrupts, thus allowing the kernel and the userspace programs to perform callbacks at reception time with reduced latency. Because of this advantage, many recent studies attempt to leverage RDMA APIs for improved distributed data workloads and creating DSM middlewares \cites{Lu_etal.Spark_over_RDMA.2014} {Jia_etal.Tensorflow_over_RDMA.2018}{Endo_Sato_Taura.MENPS_DSM.2020} {Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kaxiras_etal.DSM-Argos.2015}.
|
||||
|
||||
% \subsection{Data to Process, or Process to Data?}
|
||||
% Hypothetically, instead of moving data back-and-forth between nodes within a
|
||||
% shared storage domain, nodes could instead opt to perform remote procedure
|
||||
% calls to other nodes which have access to their own share of data and
|
||||
% acknowledge its completion at return. In the latter case, nodes connected within
|
||||
% a network exchange task information -- data necessary to (re)construct the task
|
||||
% in question on a remote node -- which can lead to significantly smaller packets
|
||||
% than transmitting data over network. Provided that the time necessary to
|
||||
% reconstruct the task on a remote node is less than the time necessary to
|
||||
% transmit the data over network.
|
||||
|
||||
% Indeed, RPC have been shown
|
||||
% (TBD -- The former is costly for data-intensive computation, but the latter may
|
||||
% be impossible for certain tasks, and greatly hardens the replacement problem.)
|
||||
|
||||
% \section{Replacement Policy}
|
||||
|
||||
% In general, three variants of replacement strategies have been proposed for either
|
||||
% generic cache block replacement problems, or specific use-cases where contextual
|
||||
% factors can facilitate more efficient cache resource allocation:
|
||||
% \begin{itemize}
|
||||
% \item General-Purpose Replacement Algorithms, for example LRU.
|
||||
% \item Cost-Model Analysis
|
||||
% \item Probabilistic and Learned Algorithms
|
||||
% \end{itemize}
|
||||
|
||||
% \subsection{General-Purpose Replacement Algorithms}
|
||||
% Practically speaking, in the general case of the cache replacement problem,
|
||||
% we desire to predict the re-reference interval of a cache block
|
||||
% \cite{Jaleel_etal.RRIP.2010}. This follows from the Belady's algorithm -- the
|
||||
% optimal case for the \emph{ideal} replacement problem occurs when, at eviction
|
||||
% time, the entry with the highest re-reference interval is replaced. Under this
|
||||
% framework, therefore, the commonly-used LRU algorithm could be seen as a heuristic
|
||||
% where the re-reference interval for each entry is predicted to be immediate.
|
||||
% Fortunately, memory access traces of real computer systems agree with this
|
||||
% tendency due to spatial locality \textbf{[source]}. (Real systems are complex,
|
||||
% however, and there are other behaviors...) On the other hand, the hypothetical
|
||||
% LFU algorithm is a heuristic that captures frequency. \textbf{[\dots]} While the
|
||||
% textbook LFU algorithm suffers from needing to maintain a priority-queue for
|
||||
% frequency analysis, it was nevertheless useful for keeping recurrent (though
|
||||
% non-recent) blocks from being evicted from the cache \textbf{[source]}.
|
||||
|
||||
% Derivatives from the LRU algorithm attempts to balance between frequency and
|
||||
% recency. \textbf{[Talk about LRU-K, LRU-2Q, LRU-MQ, LIRS, ARC here \dots]}
|
||||
|
||||
% Advancements in parallel/concurrent systems had led to a rediscovery of the benefits
|
||||
% of using FIFO-derived replacement policies over their LRU/LFU counterparts, as
|
||||
% book-keeping operations on the uniform LRU/LFU state proves to be (1) difficult
|
||||
% for synchronization and, relatedly, (2) cache-unfriendly \cite{Yang_etal.FIFO-LPQD.2023}.
|
||||
% \textbf{[Talk about FIFO, FIFO-CLOCK, FIFO-CAR, FIFO-QuickDemotion, and Dueling
|
||||
% CLOCK here \dots]}
|
||||
|
||||
% Finally, real-life experiences have shown the need to reduce CPU time in practical
|
||||
% applications, owing from one simple observation -- during the fetch-execution
|
||||
% cycle, all processors perform blocking I/O on the memory. A cache-unfriendly
|
||||
% design, despite its hypothetical optimality, could nevertheless degrade the performance
|
||||
% of a system during low-memory situations. In fact, this proves to be the driving
|
||||
% motivation behind Linux's transition away from the old LRU-2Q page replacement
|
||||
% algorithm into the more coarse-grained Multi-generation LRU algorithm, which has
|
||||
% been mainlined since v6.1.
|
||||
|
||||
% \subsection{Cost-Model Analysis}
|
||||
% The ideal case for the replacement problem fails to account for invalidation of
|
||||
% cache entries. It also assumes for a uniform, dual-hierarchical cache-store model
|
||||
% that is insufficient to capture the heterogeneity of today's massively-parallel,
|
||||
% distributed systems. High-speed network interfaces are capable of exposing RDMA
|
||||
% interfaces between computer nodes, which amount to almost twice as fast RDMA transfer
|
||||
% when compared to swapping over the kernel I/O stack, while software that bypass
|
||||
% the kernel I/O stack is capable of stretching the bandwidth advantage even more
|
||||
% (source). This creates an interesting network topology between RDMA-enabled nodes,
|
||||
% where, in addition to swapping at low-memory situations, the node may opt to ``swap''
|
||||
% or simply drop the physical page in order to lessen the cost of page misses.
|
||||
|
||||
% \textbf{[Talk about GreedyDual, GDSF, BCL, Amortization]}
|
||||
|
||||
% Traditionally, replacement policies based on cost-model analysis were utilized in
|
||||
% content-delivery networks, which had different consistency models compared to
|
||||
% finer-grained systems. HTTP servers need not pertain to strong consistency models,
|
||||
% as out-of-date information is considered permissible, and single-writer scenarios
|
||||
% are common. Consequently, most replacement policies for static content servers,
|
||||
% while making strong distinction towards network topology, fails to concern for the
|
||||
% cases where an entry might become invalidated, let along multi-writer protocols.
|
||||
% One early paper \cite{LaRowe_Ellis.Repl_NUMA.1991} examines the efficacy of using
|
||||
% page fault frequency as an indicator of preference towards working set inclusion
|
||||
% (which I personally think is highly flawed -- to be explained). Another paper
|
||||
% \cite{Aguilar_Leiss.Coherence-Replacement.2006} explores the possibility of taking
|
||||
% page fault into consideration for eviction, but fails to go beyond the obvious
|
||||
% implication that pages that have been faulted \emph{must} be evicted.
|
||||
|
||||
% The concept of cost models for RDMA and NUMA systems are relatively underdeveloped,
|
||||
% too. (Expand)
|
||||
|
||||
% \subsection{Probabilistic and Learned Algorithms for Cache Replacement}
|
||||
% Finally, machine learning techniques and low-cost probabilistic approaches have
|
||||
% been applied on the ideal cache replacement problem with some level of success.
|
||||
% \textbf{[Talk about LeCaR, CACHEUS here]}.
|
||||
|
||||
% XXX: I will be writing about replacement as postfix...
|
||||
\section{Consistency Model and Cache Coherence}
|
||||
Consistency model specifies a contract on allowed behaviors of multi-processing
|
||||
programs with regards to a shared memory
|
||||
\cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious
|
||||
conflict, which consistency models aim to resolve, lies within the interaction
|
||||
between processor-native programs and multi-processors, all of whom needs to
|
||||
operate on a shared memory with heterogeneous cache topologies. Here, a
|
||||
well-defined consistency model aims to resolve the conflict on an architectural
|
||||
scope. Beyond consistency models for bare-metal systems, programming languages
|
||||
\cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007}
|
||||
{Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024}
|
||||
and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019}
|
||||
{Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for
|
||||
parallel access to shared memory on top of program order guarantees to explicate
|
||||
program behavior under shared memory parallel programming across underlying
|
||||
implementations.
|
||||
Consistency model specifies a contract on allowed behaviors of multi-processing programs with regards to a shared memory \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. One obvious conflict, which consistency models aim to resolve, lies within the interaction between processor-native programs and multi-processors, all of whom needs to operate on a shared memory with heterogeneous cache topologies. Here, a well-defined consistency model aims to resolve the conflict on an architectural scope. Beyond consistency models for bare-metal systems, programming languages \cites{ISO/IEC_9899:2011.C11}{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007} {Manson_Goetz.JSR_133.Java_5.2004}{Rust.core::sync::atomic::Ordering.2024} and paradigms \cites{Amza_etal.Treadmarks.1996}{Hong_etal.NUMA-to-RDMA-DSM.2019} {Cai_etal.Distributed_Memory_RDMA_Cached.2018} define consistency models for parallel access to shared memory on top of program order guarantees to explicate program behavior under shared memory parallel programming across underlying implementations.
|
||||
|
||||
Related to the definition of a consistency model is the coherence problem, which
|
||||
arises whenever multiple actors have access to multiple copies of some datum,
|
||||
which needs to be synchronized across multiple actors with regards to
|
||||
write-accesses \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}.
|
||||
While less relevant to programming language design, coherence must be maintained
|
||||
via a coherence protocol
|
||||
\cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020} in systems of
|
||||
both microarchitectural and network scales. For DSM systems, the design of a
|
||||
correct and performant coherence protocol is of especially high priority and
|
||||
is a major part of many studies in DSM systems throughout history
|
||||
\cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996}
|
||||
{Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||
{Couceiro_etal.D2STM.2009}.
|
||||
|
||||
% \subsection{Common Consistency Models}
|
||||
% ... should I even write this section? imo it's too basic for anyone to read
|
||||
% and really just serves as a means to increase word count
|
||||
Related to the definition of a consistency model is the coherence problem, which arises whenever multiple actors have access to multiple copies of some datum, which needs to be synchronized across multiple actors with regards to write-accesses \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020}. While less relevant to programming language design, coherence must be maintained via a coherence protocol \cite{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020} in systems of both microarchitectural and network scales. For DSM systems, the design of a correct and performant coherence protocol is of especially high priority and is a major part of many studies in DSM systems throughout history \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Amza_etal.Treadmarks.1996} {Pinto_etal.Thymesisflow.2020}{Endo_Sato_Taura.MENPS_DSM.2020} {Couceiro_etal.D2STM.2009}.
|
||||
|
||||
\subsection{Consistency Model in DSM}
|
||||
Distributed shared memory systems with node-local caching naturally implies the
|
||||
existence of the consistency problem with regards to contending read/write
|
||||
accesses. Indeed, a significant subset of DSM studies explicitly characterize
|
||||
themselves as adhering to one of the well-known consistency models to better
|
||||
understand system behavior and to provide optimizations in coherence protocols
|
||||
\cites{Amza_etal.Treadmarks.1996}{Hu_Shi_Tang.JIAJIA.1999}
|
||||
{Carter_Bennett_Zwaenepoel.Munin.1991}{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||
{Wang_etal.Concordia.2021}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}
|
||||
{Kim_etal.DeX-upon-Linux.2020}, each adhering to a different consistency model
|
||||
to balance between communication costs and ease of programming.
|
||||
Distributed shared memory systems with node-local caching naturally implies the existence of the consistency problem with regards to contending read/write accesses. Indeed, a significant subset of DSM studies explicitly characterize themselves as adhering to one of the well-known consistency models to better understand system behavior and to provide optimizations in coherence protocols \cites{Amza_etal.Treadmarks.1996}{Hu_Shi_Tang.JIAJIA.1999} {Carter_Bennett_Zwaenepoel.Munin.1991}{Endo_Sato_Taura.MENPS_DSM.2020} {Wang_etal.Concordia.2021}{Cai_etal.Distributed_Memory_RDMA_Cached.2018} {Kim_etal.DeX-upon-Linux.2020}, each adhering to a different consistency model to balance between communication costs and ease of programming.
|
||||
|
||||
In particular, we note that DSM studies tend to conform to either release
|
||||
consistency \cites{Amza_etal.Treadmarks.1996}{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||
{Carter_Bennett_Zwaenepoel.Munin.1991} or weaker \cite{Hu_Shi_Tang.JIAJIA.1999},
|
||||
or sequential consistency
|
||||
\cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991}
|
||||
{Wang_etal.Concordia.2021}{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}, with
|
||||
few works \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018} pertaining to
|
||||
moderately constrained consistency models in-between. While older works, as
|
||||
well as works which center performance of their proposed DSM systems over
|
||||
existing approaches \cites{Endo_Sato_Taura.MENPS_DSM.2020}
|
||||
{Cai_etal.Distributed_Memory_RDMA_Cached.2018}, favor release consistency due
|
||||
to its performance benefits (e.g., in terms of coherence costs
|
||||
\cite{Endo_Sato_Taura.MENPS_DSM.2020}), newer works tend to adopt stricter
|
||||
consistency models, sometimes due to improved productivity offered to
|
||||
programmers \cite{Kim_etal.DeX-upon-Linux.2020}.
|
||||
In particular, we note that DSM studies tend to conform to either release consistency \cites{Amza_etal.Treadmarks.1996}{Endo_Sato_Taura.MENPS_DSM.2020} {Carter_Bennett_Zwaenepoel.Munin.1991} or weaker \cite{Hu_Shi_Tang.JIAJIA.1999}, or sequential consistency \cites{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991} {Wang_etal.Concordia.2021}{Kim_etal.DeX-upon-Linux.2020}{Ding.vDSM.2018}, with few works \cite{Cai_etal.Distributed_Memory_RDMA_Cached.2018} pertaining to moderately constrained consistency models in-between. While older works, as well as works which center performance of their proposed DSM systems over existing approaches \cites{Endo_Sato_Taura.MENPS_DSM.2020} {Cai_etal.Distributed_Memory_RDMA_Cached.2018}, favor release consistency due to its performance benefits (e.g., in terms of coherence costs \cite{Endo_Sato_Taura.MENPS_DSM.2020}), newer works tend to adopt stricter consistency models, sometimes due to improved productivity offered to programmers \cite{Kim_etal.DeX-upon-Linux.2020}.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
|
|
@ -546,36 +287,10 @@ programmers \cite{Kim_etal.DeX-upon-Linux.2020}.
|
|||
\label{table:1}
|
||||
\end{table}
|
||||
|
||||
We especially note the role of balancing productivity and performance in terms
|
||||
of selecting the ideal consistency model for a system. It is common knowledge
|
||||
that weaker consistency models are harder to program with, at the benefit of
|
||||
less (implied) coherence communications resulting in better throughput overall
|
||||
-- provided that the programmer could guarantee correctness, a weaker
|
||||
consistency model allows for less invalidation of node-local cache entries,
|
||||
thereby allowing multiple nodes to compute in parallel on (likely) outdated
|
||||
local copy of data such that the result of the computation remains semantically
|
||||
correct with regards to the program. This point was made explicit in \textit{Munin}
|
||||
\cite{Carter_Bennett_Zwaenepoel.Munin.1991}, where (to reiterate) it introduces
|
||||
the concept of consistency ``protocol parameters'' to annotate shared memory
|
||||
access pattern, in order to reduce the amount of coherence communications
|
||||
necessary between nodes computing in distributed shared memory. For example, a
|
||||
DSM object (memory object accounted for by the DSM system) can be annotated
|
||||
with ``delayed operations'' to delay coherence operations beyond any
|
||||
write-access, or shared without ``write'' annotation to disable write-access
|
||||
over shared nodes, thereby disabling all coherence operations with regards to
|
||||
this DSM object. Via programmer annotation of DSM objects, the Munin DSM system
|
||||
explicates the effect of weaker consistency in relation to the amount of
|
||||
synchronization overhead necessary among shared memory nodes. To our knowledge,
|
||||
no other more recent DSM works have explored this interaction between
|
||||
consistency and coherence costs on DSM objects, though relatedly
|
||||
\textit{Resilient Distributed Dataset (RDD)} \cite{Zaharia_etal.RDD.2012} also
|
||||
highlights its performance and flexibility benefits in opting for an immutable
|
||||
data representation over disaggregated memory over network when compared to
|
||||
contemporary DSM approaches.
|
||||
We especially note the role of balancing productivity and performance in terms of selecting the ideal consistency model for a system. It is common knowledge that weaker consistency models are harder to program with, at the benefit of less (implied) coherence communications resulting in better throughput overall -- provided that the programmer could guarantee correctness, a weaker consistency model allows for less invalidation of node-local cache entries, thereby allowing multiple nodes to compute in parallel on (likely) outdated local copy of data such that the result of the computation remains semantically correct with regards to the program. This point was made explicit in \textit{Munin} \cite{Carter_Bennett_Zwaenepoel.Munin.1991}, where (to reiterate) it introduces the concept of consistency ``protocol parameters'' to annotate shared memory access pattern, in order to reduce the amount of coherence communications necessary between nodes computing in distributed shared memory. For example, a DSM object (memory object accounted for by the DSM system) can be annotated with ``delayed operations'' to delay coherence operations beyond any write-access, or shared without ``write'' annotation to disable write-access over shared nodes, thereby disabling all coherence operations with regards to this DSM object. Via programmer annotation of DSM objects, the Munin DSM system explicates the effect of weaker consistency in relation to the amount of synchronization overhead necessary among shared memory nodes. To our knowledge, no other more recent DSM works have explored this interaction between consistency and coherence costs on DSM objects, though relatedly \textit{Resilient Distributed Dataset (RDD)} \cite{Zaharia_etal.RDD.2012} also highlights its performance and flexibility benefits in opting for an immutable data representation over disaggregated memory over network when compared to contemporary DSM approaches.
|
||||
|
||||
\subsection{Coherence Protocol}
|
||||
Coherence protocols hence becomes the means over which DSM systems implement their consistency model guarantees. As table \ref{table:1} shows, DSM studies tends to implement write-invalidated coherence under a \textit{home-based} or \textit{directory-based} protocol framework, while a subset of DSM studies sought to reduce communication overheads and/or improve data persistence by offering write-update protocol extensions \cites{Carter_Bennett_Zwaenepoel.Munin.1991}{Shan_Tsai_Zhang.DSPM.2017}.
|
||||
% The concepts of \textit{home-based} vs. \textit{directory-based} protocols are not parallels, however, but instead differentiates the perspective
|
||||
|
||||
\subsubsection{Home-Based Protocols}
|
||||
\textit{Home-based} protocols define each shared memory object with a corresponding ``home'' node, under the assumption that a many-node network would distribute home-node ownership of shared memory objects across all hosts \cite{Hu_Shi_Tang.JIAJIA.1999}. On top of home-node ownership, each mutable shared memory object may be additionally cached by other nodes within the network, creating the coherence problem. To our knowledge, in addition to table \ref{table:1}, this protocol and its derivatives had been adopted by \cites{Fleisch_Popek.Mirage.1989}{Schaefer_Li.Shiva.1989}{Hu_Shi_Tang.JIAJIA.1999}{Nelson_etal.Grappa_DSM.2015}{Shan_Tsai_Zhang.DSPM.2017}{Endo_Sato_Taura.MENPS_DSM.2020}.
|
||||
|
|
@ -796,9 +511,9 @@ static void recv_done(
|
|||
) // ...
|
||||
\end{minted}
|
||||
|
||||
Called when the RDMA subsystem works on the received payload over RDMA. Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores:
|
||||
Called when the RDMA subsystem works on the received payload over RDMA. Mirroring the case for \texttt{smbd\_post\_send}, it invalidates CPU cache lines for DMA-ed data to be visible at CPU cores prior to any operations on received data:
|
||||
|
||||
\begin{minted}[linenos, firstnumber=last]{c}
|
||||
\begin{minted}[linenos, firstnumber=last, mathescape]{c}
|
||||
{
|
||||
struct smbd_data_transfer *data_transfer;
|
||||
struct smbd_response *response = container_of(
|
||||
|
|
@ -823,21 +538,39 @@ Called when the RDMA subsystem works on the received payload over RDMA. Mirrorin
|
|||
}
|
||||
\end{minted}
|
||||
|
||||
% TODO: lead to cache coherence mechanism in Linux kernel
|
||||
\chapter{Software Coherency Latency}
|
||||
|
||||
% Experiment: ...
|
||||
% Discussion: (1) Linux and DMA and RDMA (2) replacement and other ideas...
|
||||
|
||||
% (I need to read more into this. Most of the contribution comes from CPU caches,
|
||||
% less so for DSM systems.) \textbf{[Talk about JIAJIA and Treadmark's coherence
|
||||
% protocol.]}
|
||||
|
||||
% Consistency and communication protocols naturally affect the cost for each faulted
|
||||
% memory access \dots
|
||||
|
||||
% \textbf{[Talk about directory, transactional, scope, and library cache coherence,
|
||||
% which allow for multi-casted communications at page fault but all with different
|
||||
% levels of book-keeping.]}
|
||||
\chapter{DSM System Design}
|
||||
|
||||
% \bibliographystyle{plain}
|
||||
% \bibliographystyle{plainnat}
|
||||
% \bibliography{mybibfile}
|
||||
\printbibliography
|
||||
\end{document}
|
||||
|
||||
|
||||
% You may delete everything from \appendix up to \end{document} if you don't need it.
|
||||
\appendix
|
||||
|
||||
\chapter{First appendix}
|
||||
|
||||
\section{First section}
|
||||
|
||||
Any appendices, including any required ethics information, should be included
|
||||
after the references.
|
||||
|
||||
Markers do not have to consider appendices. Make sure that your contributions
|
||||
are made clear in the main body of the dissertation (within the page limit).
|
||||
|
||||
% \chapter{Participants' information sheet}
|
||||
|
||||
% If you had human participants, include key information that they were given in
|
||||
% an appendix, and point to it from the ethics declaration.
|
||||
|
||||
% \chapter{Participants' consent form}
|
||||
|
||||
% If you had human participants, include information about how consent was
|
||||
% gathered in an appendix, and point to it from the ethics declaration.
|
||||
% This information is often a copy of a consent form.
|
||||
|
||||
|
||||
\end{document}
|
||||
182
tex/main.bib
182
tex/main.bib
|
|
@ -1,182 +0,0 @@
|
|||
@article{JTSE.2010.RRIP,
|
||||
author = {Jaleel, Aamer and Theobald, Kevin B. and Steely, Simon C. and Emer, Joel},
|
||||
title = {High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)},
|
||||
year = {2010},
|
||||
issue_date = {June 2010},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
volume = {38},
|
||||
number = {3},
|
||||
issn = {0163-5964},
|
||||
url = {https://doi.org/10.1145/1816038.1815971},
|
||||
doi = {10.1145/1816038.1815971},
|
||||
abstract = {Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4\% and 10\% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7\% and 9\% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.},
|
||||
journal = {SIGARCH Comput. Archit. News},
|
||||
month = {jun},
|
||||
pages = {60–71},
|
||||
numpages = {12},
|
||||
keywords = {thrashing, shared cache, replacement, scan resistance}
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{SYS.2021.RLR,
|
||||
author={Sethumurugan, Subhash and Yin, Jieming and Sartori, John},
|
||||
booktitle={2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)},
|
||||
title={Designing a Cost-Effective Cache Replacement Policy using Machine Learning},
|
||||
year={2021},
|
||||
volume={},
|
||||
number={},
|
||||
pages={291-303},
|
||||
doi={10.1109/HPCA51647.2021.00033}
|
||||
}
|
||||
|
||||
@ARTICLE{MM.2004.ARC,
|
||||
author={Megiddo, N. and Modha, D.S.},
|
||||
journal={Computer},
|
||||
title={Outperforming LRU with an adaptive replacement cache algorithm},
|
||||
year={2004},
|
||||
volume={37},
|
||||
number={4},
|
||||
pages={58-65},
|
||||
doi={10.1109/MC.2004.1297303}
|
||||
}
|
||||
|
||||
@article{QJPSE.2007.DIP,
|
||||
author = {Qureshi, Moinuddin K. and Jaleel, Aamer and Patt, Yale N. and Steely, Simon C. and Emer, Joel},
|
||||
title = {Adaptive Insertion Policies for High Performance Caching},
|
||||
year = {2007},
|
||||
issue_date = {May 2007},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
volume = {35},
|
||||
number = {2},
|
||||
issn = {0163-5964},
|
||||
url = {https://doi.org/10.1145/1273440.1250709},
|
||||
doi = {10.1145/1273440.1250709},
|
||||
abstract = {The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21\%, bridging two-thirds of the gap between LRU and OPT.},
|
||||
journal = {SIGARCH Comput. Archit. News},
|
||||
month = {jun},
|
||||
pages = {381–391},
|
||||
numpages = {11},
|
||||
keywords = {thrashing, set sampling, set dueling, replacement}
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{GWHSZ.2014.CacheReplAsMDP-QLearning,
|
||||
author={Gu, Jingxiong and Wang, Wei and Huang, Aiping and Shan, Hangguan and Zhang, Zhaoyang},
|
||||
booktitle={2014 IEEE International Conference on Communications (ICC)},
|
||||
title={Distributed cache replacement for caching-enable base stations in cellular networks},
|
||||
year={2014},
|
||||
volume={},
|
||||
number={},
|
||||
pages={2648-2653},
|
||||
doi={10.1109/ICC.2014.6883723}
|
||||
}
|
||||
|
||||
@inproceedings {EHOFK.2020.IBM-LRUvsFIFO,
|
||||
author = {Ohad Eytan and Danny Harnik and Effi Ofer and Roy Friedman and Ronen Kat},
|
||||
title = {It{\textquoteright}s Time to Revisit {LRU} vs. {FIFO}},
|
||||
booktitle = {12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20)},
|
||||
year = {2020},
|
||||
url = {https://www.usenix.org/conference/hotstorage20/presentation/eytan},
|
||||
publisher = {USENIX Association},
|
||||
month = jul
|
||||
}
|
||||
|
||||
@inproceedings{YQZYR.2023.FIFOwithTwist,
|
||||
author = {Yang, Juncheng and Qiu, Ziyue and Zhang, Yazhuo and Yue, Yao and Rashmi, K. V.},
|
||||
title = {FIFO Can Be Better than LRU: The Power of Lazy Promotion and Quick Demotion},
|
||||
year = {2023},
|
||||
isbn = {9798400701955},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/3593856.3595887},
|
||||
doi = {10.1145/3593856.3595887},
|
||||
abstract = {LRU has been the basis of cache eviction algorithms for decades, with a plethora of innovations on improving LRU's miss ratio and throughput. While it is well-known that FIFO-based eviction algorithms provide significantly better throughput and scalability, they lag behind LRU on miss ratio, thus, cache efficiency.We performed a large-scale simulation study using 5307 block and web cache workloads collected in the past two decades. We find that contrary to what common wisdom suggests, some FIFO-based algorithms, such as FIFO-Reinsertion (or CLOCK), are, in fact, more efficient (have a lower miss ratio) than LRU. Moreover, we find that qick demotion --- evicting most new objects very quickly --- is critical for cache efficiency. We show that when enhanced by qick demotion, not only can state-of-the-art algorithms be more efficient, a simple FIFO-based algorithm can outperform five complex state-of-the-art in terms of miss ratio.},
|
||||
booktitle = {Proceedings of the 19th Workshop on Hot Topics in Operating Systems},
|
||||
pages = {70–79},
|
||||
numpages = {10},
|
||||
location = {Providence, RI, USA},
|
||||
series = {HOTOS '23}
|
||||
}
|
||||
|
||||
@inproceedings{CDKP.1994.TreadMarks,
|
||||
title = {TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems},
|
||||
author = {Cox, A.L. and Dwarkadas, S. and Keleher, P. and Zwaenepoel, Willy},
|
||||
year = {1994},
|
||||
abstract = {TreadMarks is a distributed shared memory (DSM) system for standard Unix systems such as SunOS and Ultrix. This paper presents a performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet. Our objective is to determine the efficiency of a user-level DSM implementation on commercially available workstations and operating systems. We achieved good speedups on the 8-processor ATM network for Jacobi (7.4), TSP (7.2), Quicksort (6.3), and ILINK (5.7). For a slightly modified version ofWater from the SPLASH benchmark suite, we achieved only moderate speedups (4.0) due to the high communication and synchronization rate. Speedups decline on the 10-Mbps Ethernet (5.5 for Jacobi, 6.5 for TSP, 4.2 for Quicksort, 5.1 for ILINK, and 2.1 for Water), reecting the bandwidth limitations of the Ethernet. These results support the contention that, with suitable networking technology, DSM is a viable technique for parallel computation on clusters of workstations. To achieve these speedups, TreadMarks goes to great lengths to reduce the amount of communication performed to maintain memory consistency. It uses a lazy implementation of release consistency, and it allows multiple concurrent writers to modify a page, reducing the impact of false sharing. Great care was taken to minimize communication overhead. In particular, on the ATM network, we used a standard low-level protocol, AAL3/4, bypassing the TCP/IP protocol stack. Unix communication overhead, however, remains the main obstacle in the way of better performance for programs like Water. Compared to the Unix communication overhead, memory management cost (both kernel and user level) is small and wire time is negligible.},
|
||||
url = {http://infoscience.epfl.ch/record/55805},
|
||||
}
|
||||
|
||||
@article{ISS.1998.Millipede,
|
||||
title = {Thread migration and its applications in distributed shared memory systems1Technion CS/LPCR Technical Report #9603, July 1996.1},
|
||||
journal = {Journal of Systems and Software},
|
||||
volume = {42},
|
||||
number = {1},
|
||||
pages = {71-87},
|
||||
year = {1998},
|
||||
issn = {0164-1212},
|
||||
doi = {https://doi.org/10.1016/S0164-1212(98)00008-9},
|
||||
url = {https://www.sciencedirect.com/science/article/pii/S0164121298000089},
|
||||
author = {Ayal Itzkovitz and Assaf Schuster and Lea Shalev},
|
||||
keywords = {Tread migration, Distributed shared memory, Load sharing, Virtual parallel machine},
|
||||
abstract = {In this paper we describe the way thread migration can be carried in distributed shared memory (DSM) systems. We discuss the advantages of multi-threading in DSM systems and the importance of preempted dynamic thread migration. The proposed solution is implemented in MILLIPEDE: an environment for parallel programming over a network of (personal) computers. MILLIPEDE implements transparent computation migration mechanism: a mobile computation thread in a MILLIPEDE application can be suspended almost at every point during its lifetime and be resumed on another host. This mechanism can be used to better utilize system resources and improve performance by balancing the load and solving ping-pong situations of memory objects, and to provide user ownership on his workstation. We describe how some of these are implemented in the MILLIPEDE system. MILLIPEDE, including its thread migration module, is fully implemented in user-mode (currently on Windows-NT) using the standard operating system APIs.}
|
||||
}
|
||||
|
||||
@inproceedings{de2000effect,
|
||||
title={The effect of contention on the scalability of page-based software shared memory systems},
|
||||
author={de Lara, Eyal and Lu, Honghui and Charlie, Y and Cox, Alan L and Zwaenepoel, Willy},
|
||||
booktitle={Languages, Compilers, and Run-Time Systems for Scalable Computers: 5th International Workshop, LCR 2000 Rochester, NY, USA, May 25--27, 2000 Selected Papers 5},
|
||||
pages={155--169},
|
||||
year={2000},
|
||||
organization={Springer}
|
||||
}
|
||||
|
||||
@misc{Haynes_2022,
|
||||
title={Sequential consistency in armv8},
|
||||
url={https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/armv8-sequential-consistency},
|
||||
journal={Arm Community Blogs},
|
||||
author={Haynes, Samuel Parker},
|
||||
year={2022},
|
||||
month={Feb}
|
||||
}
|
||||
|
||||
@inproceedings{wang2021concordia,
|
||||
title={Concordia: Distributed shared memory with $\{$In-Network$\}$ cache coherence},
|
||||
author={Wang, Qing and Lu, Youyou and Xu, Erci and Li, Junru and Chen, Youmin and Shu, Jiwu},
|
||||
booktitle={19th USENIX Conference on File and Storage Technologies (FAST 21)},
|
||||
pages={277--292},
|
||||
year={2021}
|
||||
}
|
||||
|
||||
@article{cai2018efficient,
|
||||
title={Efficient distributed memory management with RDMA and caching},
|
||||
author={Cai, Qingchao and Guo, Wentian and Zhang, Hao and Agrawal, Divyakant and Chen, Gang and Ooi, Beng Chin and Tan, Kian-Lee and Teo, Yong Meng and Wang, Sheng},
|
||||
journal={Proceedings of the VLDB Endowment},
|
||||
volume={11},
|
||||
number={11},
|
||||
pages={1604--1617},
|
||||
year={2018},
|
||||
publisher={VLDB Endowment}
|
||||
}
|
||||
|
||||
@inproceedings{shan2017distributed,
|
||||
title={Distributed shared persistent memory},
|
||||
author={Shan, Yizhou and Tsai, Shin-Yeh and Zhang, Yiying},
|
||||
booktitle={Proceedings of the 2017 Symposium on Cloud Computing},
|
||||
pages={323--337},
|
||||
year={2017}
|
||||
}
|
||||
|
||||
@inproceedings{EndoWataru2020MADD,
|
||||
abstract = {The spread of RDMA-capable interconnects on supercomputers has enabled the middleware developers to explore new design options for runtime systems based on efficient communications. Observing low-latency networks and shared-memory infrastructure for multi-core processors, we have focused on extending shared-memory abstraction into multiple nodes exploiting RDMA, i.e., Distributed Shared Memory (DSM). We have found that the traditional protocols of DSM designed for two-sided communications cannot fully exploit the performance of RDMA, which necessitates decentralization and coarse-grained communications. To solve this problem, we introduced two methods for the DSM coherence protocol to exploit RDMA and implemented a DSM library MENPS using this protocol. Our evaluation shows that MENPS could accelerate two of five shared-memory applications with minimal modifications and beat an existing RDMA-based DSM runtime.},
|
||||
author = {Endo, Wataru and Sato, Shigeyuki and Taura, Kenjiro},
|
||||
address = {LOS ALAMITOS},
|
||||
booktitle = {2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)},
|
||||
isbn = {1665422769},
|
||||
keywords = {cache coherence protocol ; coarse-grained communications ; Coherence ; Computer Science ; Computer Science, Hardware & Architecture ; Computer Science, Software Engineering ; Computer Science, Theory & Methods ; decentralized distributed shared memory ; design options ; distributed shared memory ; distributed shared memory systems ; DSM coherence protocol ; DSM library MENPS ; efficient communications ; existing RDMA-based DSM runtime ; home migration ; Libraries ; Merging ; message passing ; middleware ; middleware developers ; multicore processors ; multiple nodes ; Program processors ; protocols ; RDMA ; RDMA-capable interconnects ; Runtime ; runtime systems ; Science & Technology ; shared memory systems ; shared-memory abstraction ; shared-memory applications ; shared-memory infrastructure ; Synchronization ; Technology ; timestamp based coherence ; traditional protocols ; two-sided communications},
|
||||
language = {eng},
|
||||
organization = {IEEE Comp Soc},
|
||||
pages = {9-16},
|
||||
publisher = {IEEE},
|
||||
title = {MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA},
|
||||
year = {2020},
|
||||
}
|
||||
|
|
@ -1,623 +0,0 @@
|
|||
@article{Jaleel_etal.RRIP.2010,
|
||||
title = {High performance cache replacement using re-reference interval prediction (RRIP)},
|
||||
author = {Jaleel, Aamer and Theobald, Kevin B and Steely Jr, Simon C and Emer, Joel},
|
||||
year = 2010,
|
||||
journal = {ACM SIGARCH computer architecture news},
|
||||
publisher = {ACM New York, NY, USA},
|
||||
volume = 38,
|
||||
number = 3,
|
||||
pages = {60--71}
|
||||
}
|
||||
@inproceedings{Yang_etal.FIFO-LPQD.2023,
|
||||
title = {FIFO can be Better than LRU: the Power of Lazy Promotion and Quick Demotion},
|
||||
author = {Yang, Juncheng and Qiu, Ziyue and Zhang, Yazhuo and Yue, Yao and Rashmi, KV},
|
||||
year = 2023,
|
||||
booktitle = {Proceedings of the 19th Workshop on Hot Topics in Operating Systems},
|
||||
pages = {70--79}
|
||||
}
|
||||
@inproceedings{Shan_Tsai_Zhang.DSPM.2017,
|
||||
title = {Distributed Shared Persistent Memory},
|
||||
author = {Shan, Yizhou and Tsai, Shin-Yeh and Zhang, Yiying},
|
||||
year = 2017,
|
||||
booktitle = {Proceedings of the 2017 Symposium on Cloud Computing},
|
||||
location = {Santa Clara, California},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
series = {SoCC '17},
|
||||
pages = {323–337},
|
||||
doi = {10.1145/3127479.3128610},
|
||||
isbn = 9781450350280,
|
||||
url = {https://doi.org/10.1145/3127479.3128610},
|
||||
abstract = {Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments.We introduce Distributed Shared Persistent Memory (DSPM), a new framework for using persistent memories in distributed data-center environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data.We built Hotpot, a kernel-level DSPM system that provides low-latency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributed memory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3\texttimes{} to 3.2\texttimes{} and a recent distributed PM-based file system by 1.5\texttimes{} to 3.0\texttimes{}.},
|
||||
numpages = 15,
|
||||
keywords = {distributed shared memory, persistent memory}
|
||||
}
|
||||
@article{LaRowe_Ellis.Repl_NUMA.1991,
|
||||
title = {Page placement policies for NUMA multiprocessors},
|
||||
author = {Richard P. LaRowe and Carla Schlatter Ellis},
|
||||
year = 1991,
|
||||
journal = {Journal of Parallel and Distributed Computing},
|
||||
volume = 11,
|
||||
number = 2,
|
||||
pages = {112--129},
|
||||
doi = {https://doi.org/10.1016/0743-7315(91)90117-R},
|
||||
issn = {0743-7315},
|
||||
url = {https://www.sciencedirect.com/science/article/pii/074373159190117R},
|
||||
abstract = {In many parallel applications, the size of the program's data exceeds even the very large amount of main memory available on large-scale multiprocessors. Virtual memory, in the sense of a transparent management of the main/secondary memory hierarchy, is a natural solution. The replacement, fetch, and placement policies used in uniprocessor paging systems need to be reexamined in light of the differences in the behavior of parallel computations and in the memory architectures of multiprocessors. In particular, we investigate the impact of page placement in nonuniform memory access time (NUMA) shared memory MIMD machines. We experimentally evaluate several paging algorithms that incorporate different approaches to the placement issue. Under certain workload assumptions, our results show that placement algorithms that are strongly biased toward local frame allocation but are able to borrow remote frames can reduce the number of page faults over strictly local allocation. The increased cost of memory operations due to the extra remote accesses is more than compensated for by the savings resulting from the reduction in demand fetches, effectively reducing the computation completion time for these programs without having adverse effects on the performance of “typical” NUMA programs. We also discuss some early results obtained from an actual kernel implementation of one of our page placement algorithms.}
|
||||
}
|
||||
@article{Aguilar_Leiss.Coherence-Replacement.2006,
|
||||
title = {A Coherence-Replacement Protocol For Web Proxy Cache Systems},
|
||||
author = {J. Aguilar and E.L. Leiss},
|
||||
year = 2006,
|
||||
journal = {International Journal of Computers and Applications},
|
||||
publisher = {Taylor & Francis},
|
||||
volume = 28,
|
||||
number = 1,
|
||||
pages = {12--18},
|
||||
doi = {10.1080/1206212X.2006.11441783},
|
||||
url = {https://doi.org/10.1080/1206212X.2006.11441783},
|
||||
eprint = {https://doi.org/10.1080/1206212X.2006.11441783}
|
||||
}
|
||||
@inproceedings{Masouros_etal.Adrias.2023,
|
||||
title = {Adrias: Interference-Aware Memory Orchestration for Disaggregated Cloud Infrastructures},
|
||||
author = {Masouros, Dimosthenis and Pinto, Christian and Gazzetti, Michele and Xydis, Sotirios and Soudris, Dimitrios},
|
||||
year = 2023,
|
||||
booktitle = {2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)},
|
||||
pages = {855--869},
|
||||
organization = {IEEE}
|
||||
}
|
||||
@book{BOOK.Hennessy_Patterson.CArch.2011,
|
||||
title = {Computer architecture: a quantitative approach},
|
||||
author = {Hennessy, John L and Patterson, David A},
|
||||
year = 2011,
|
||||
publisher = {Elsevier}
|
||||
}
|
||||
@inproceedings{Cabezas_etal.GPU-SM.2015,
|
||||
title = {GPU-SM: shared memory multi-GPU programming},
|
||||
author = {Cabezas, Javier and Jord{\`a}, Marc and Gelado, Isaac and Navarro, Nacho and Hwu, Wen-mei},
|
||||
year = 2015,
|
||||
booktitle = {Proceedings of the 8th Workshop on General Purpose Processing using GPUs},
|
||||
pages = {13--24}
|
||||
}
|
||||
@misc{WEB.NVIDIA.Harris.Unified_Memory_CUDA.2017,
|
||||
title = {Unified memory for cuda beginners},
|
||||
author = {Harris, Mark},
|
||||
year = 2017,
|
||||
journal = {Unified Memory for CUDA Beginners},
|
||||
publisher = {NVIDIA},
|
||||
url = {https://developer.nvidia.com/blog/unified-memory-cuda-beginners/}
|
||||
}
|
||||
@article{Khokhar_etal.HetComputingVision.1993,
|
||||
title = {Heterogeneous computing: Challenges and opportunities},
|
||||
author = {Khokhar, Ashfaq A. and Prasanna, Viktor K. and Shaaban, Muhammad E. and Wang, C-L},
|
||||
year = 1993,
|
||||
journal = {Computer},
|
||||
publisher = {IEEE},
|
||||
volume = 26,
|
||||
number = 6,
|
||||
pages = {18--27}
|
||||
}
|
||||
@misc{WEB.LWN.Corbet.HMM_GPL_woes.2018,
|
||||
title = {Heterogeneous memory management meets EXPORT\_SYMBOL\_GPL()},
|
||||
author = {Corbet, Jonathan},
|
||||
year = 2018,
|
||||
journal = {LWN.net},
|
||||
publisher = {LWN.net},
|
||||
url = {https://lwn.net/Articles/757124/}
|
||||
}
|
||||
@misc{WEB.Phoronix..HMM_Search_Results.2023,
|
||||
journal = {Heterogeneous Memory Management - Phoronix},
|
||||
publisher = {Phoronix},
|
||||
url = {https://www.phoronix.com/search/Heterogeneous%20Memory%20Management}
|
||||
}
|
||||
@inproceedings{narayanan2020heterogeneity,
|
||||
title = {$\{$Heterogeneity-Aware$\}$ cluster scheduling policies for deep learning workloads},
|
||||
author = {Narayanan, Deepak and Santhanam, Keshav and Kazhamiaka, Fiodar and Phanishayee, Amar and Zaharia, Matei},
|
||||
year = 2020,
|
||||
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
|
||||
pages = {481--498}
|
||||
}
|
||||
@article{Rodriguez_etal.HPC_Cluster_Migration.2019,
|
||||
title = {Job migration in hpc clusters by means of checkpoint/restart},
|
||||
author = {Rodr{\'\i}guez-Pascual, Manuel and Cao, Jiajun and Mor{\'\i}{\~n}igo, Jos{\'e} A and Cooperman, Gene and Mayo-Garc{\'\i}a, Rafael},
|
||||
year = 2019,
|
||||
journal = {The Journal of Supercomputing},
|
||||
publisher = {Springer},
|
||||
volume = 75,
|
||||
pages = {6517--6541}
|
||||
}
|
||||
@inproceedings{Oh_Kim.Container_Migration.2018,
|
||||
title = {Stateful Container Migration employing Checkpoint-based Restoration for Orchestrated Container Clusters},
|
||||
author = {Oh, SeungYong and Kim, JongWon},
|
||||
year = 2018,
|
||||
booktitle = {2018 International Conference on Information and Communication Technology Convergence (ICTC)},
|
||||
volume = {},
|
||||
number = {},
|
||||
pages = {25--30},
|
||||
doi = {10.1109/ICTC.2018.8539562}
|
||||
}
|
||||
@article{Amza_etal.Treadmarks.1996,
|
||||
title={Treadmarks: Shared memory computing on networks of workstations},
|
||||
author={Amza, Cristiana and Cox, Alan L and Dwarkadas, Sandhya and Keleher, Pete and Lu, Honghui and Rajamony, Ramakrishnan and Yu, Weimin and Zwaenepoel, Willy},
|
||||
journal={Computer},
|
||||
volume={29},
|
||||
number={2},
|
||||
pages={18--28},
|
||||
year={1996},
|
||||
publisher={IEEE}
|
||||
}
|
||||
@article{Carter_Bennett_Zwaenepoel.Munin.1991,
|
||||
title={Implementation and performance of Munin},
|
||||
author={Carter, John B and Bennett, John K and Zwaenepoel, Willy},
|
||||
journal={ACM SIGOPS Operating Systems Review},
|
||||
volume={25},
|
||||
number={5},
|
||||
pages={152--164},
|
||||
year={1991},
|
||||
publisher={ACM New York, NY, USA}
|
||||
}
|
||||
@article{Itzkovitz_Schuster_Shalev.Millipede.1998,
|
||||
title={Thread migration and its applications in distributed shared memory systems},
|
||||
author={Itzkovitz, Ayal and Schuster, Assaf and Shalev, Lea},
|
||||
journal={Journal of Systems and Software},
|
||||
volume={42},
|
||||
number={1},
|
||||
pages={71--87},
|
||||
year={1998},
|
||||
publisher={Elsevier}
|
||||
}
|
||||
@inproceedings{Hu_Shi_Tang.JIAJIA.1999,
|
||||
title={JIAJIA: A software DSM system based on a new cache coherence protocol},
|
||||
author={Hu, Weiwu and Shi, Weisong and Tang, Zhimin},
|
||||
booktitle={High-Performance Computing and Networking: 7th International Conference, HPCN Europe 1999 Amsterdam, The Netherlands, April 12--14, 1999 Proceedings 7},
|
||||
pages={461--472},
|
||||
year={1999},
|
||||
organization={Springer}
|
||||
}
|
||||
@inproceedings {Zaharia_etal.RDD.2012,
|
||||
author = {Matei Zaharia and Mosharaf Chowdhury and Tathagata Das and Ankur Dave and Justin Ma and Murphy McCauly and Michael J. Franklin and Scott Shenker and Ion Stoica},
|
||||
title = {Resilient Distributed Datasets: A {Fault-Tolerant} Abstraction for {In-Memory} Cluster Computing},
|
||||
booktitle = {9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)},
|
||||
year = {2012},
|
||||
isbn = {978-931971-92-8},
|
||||
address = {San Jose, CA},
|
||||
pages = {15--28},
|
||||
url = {https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia},
|
||||
publisher = {USENIX Association},
|
||||
month = apr
|
||||
}
|
||||
@misc{WEB.APACHE..Apache_Hadoop.2023,
|
||||
url={https://hadoop.apache.org/},
|
||||
journal={Apache Hadoop},
|
||||
publisher={The APACHE Software Foundation}
|
||||
}
|
||||
@misc{WEB.APACHE..Apache_Spark.2023,
|
||||
url={https://spark.apache.org/},
|
||||
journal={Apache SparkTM - Unified Engine for large-scale data analytics},
|
||||
publisher={The APACHE Software Foundation}
|
||||
}
|
||||
@article{Lenoski_etal.Stanford_DASH.1992,
|
||||
title={The stanford dash multiprocessor},
|
||||
author={Lenoski, Daniel and Laudon, James and Gharachorloo, Kourosh and Weber, W-D and Gupta, Anoop and Hennessy, John and Horowitz, Mark and Lam, Monica S.},
|
||||
journal={Computer},
|
||||
volume={25},
|
||||
number={3},
|
||||
pages={63--79},
|
||||
year={1992},
|
||||
publisher={IEEE}
|
||||
}
|
||||
@misc{WEB.Ampere..Ampere_Altra_Datasheet.2023,
|
||||
url={https://uawartifacts.blob.core.windows.net/upload-files/Altra_Max_Rev_A1_DS_v1_15_20230809_b7cdce449e_424d129849.pdf},
|
||||
journal={Ampere Altra Max Rev A1 64-Bit Multi-Core Processor Datasheet},
|
||||
publisher={Ampere Computing}
|
||||
}
|
||||
@article{Bell_Gray.HPC_is_Cluster.2002,
|
||||
title={What's next in high-performance computing?},
|
||||
author={Bell, Gordon and Gray, Jim},
|
||||
journal={Communications of the ACM},
|
||||
volume={45},
|
||||
number={2},
|
||||
pages={91--95},
|
||||
year={2002},
|
||||
publisher={ACM New York, NY, USA}
|
||||
}
|
||||
@inproceedings{Werstein_Pethick_Huang.PerfAnalysis_DSM_MPI.2003,
|
||||
title={A performance comparison of dsm, pvm, and mpi},
|
||||
author={Werstein, Paul and Pethick, Mark and Huang, Zhiyi},
|
||||
booktitle={Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies},
|
||||
pages={476--482},
|
||||
year={2003},
|
||||
organization={IEEE}
|
||||
}
|
||||
@inproceedings{Lu_etal.MPI_vs_DSM_over_cluster.1995,
|
||||
title={Message passing versus distributed shared memory on networks of workstations},
|
||||
author={Lu, Honghui and Dwarkadas, Sandhya and Cox, Alan L and Zwaenepoel, Willy},
|
||||
booktitle={Supercomputing'95: Proceedings of the 1995 ACM/IEEE Conference on Supercomputing},
|
||||
pages={37--37},
|
||||
year={1995},
|
||||
organization={IEEE}
|
||||
}
|
||||
@article{Jia_etal.Tensorflow_over_RDMA.2018,
|
||||
title={Improving the performance of distributed tensorflow with RDMA},
|
||||
author={Jia, Chengfan and Liu, Junnan and Jin, Xu and Lin, Han and An, Hong and Han, Wenting and Wu, Zheng and Chi, Mengxian},
|
||||
journal={International Journal of Parallel Programming},
|
||||
volume={46},
|
||||
pages={674--685},
|
||||
year={2018},
|
||||
publisher={Springer}
|
||||
}
|
||||
@inproceedings{Lu_etal.Spark_over_RDMA.2014,
|
||||
title={Accelerating spark with RDMA for big data processing: Early experiences},
|
||||
author={Lu, Xiaoyi and Rahman, Md Wasi Ur and Islam, Nusrat and Shankar, Dipti and Panda, Dhabaleswar K},
|
||||
booktitle={2014 IEEE 22nd Annual Symposium on High-Performance Interconnects},
|
||||
pages={9--16},
|
||||
year={2014},
|
||||
organization={IEEE}
|
||||
}
|
||||
@article{Cai_etal.Distributed_Memory_RDMA_Cached.2018,
|
||||
title={Efficient distributed memory management with RDMA and caching},
|
||||
author={Cai, Qingchao and Guo, Wentian and Zhang, Hao and Agrawal, Divyakant and Chen, Gang and Ooi, Beng Chin and Tan, Kian-Lee and Teo, Yong Meng and Wang, Sheng},
|
||||
journal={Proceedings of the VLDB Endowment},
|
||||
volume={11},
|
||||
number={11},
|
||||
pages={1604--1617},
|
||||
year={2018},
|
||||
publisher={VLDB Endowment}
|
||||
}
|
||||
@inproceedings{Nelson_etal.Grappa_DSM.2015,
|
||||
title={$\{$Latency-Tolerant$\}$ software distributed shared memory},
|
||||
author={Nelson, Jacob and Holt, Brandon and Myers, Brandon and Briggs, Preston and Ceze, Luis and Kahan, Simon and Oskin, Mark},
|
||||
booktitle={2015 USENIX Annual Technical Conference (USENIX ATC 15)},
|
||||
pages={291--305},
|
||||
year={2015}
|
||||
}
|
||||
|
||||
@inproceedings{Endo_Sato_Taura.MENPS_DSM.2020,
|
||||
title={MENPS: a decentralized distributed shared memory exploiting RDMA},
|
||||
author={Endo, Wataru and Sato, Shigeyuki and Taura, Kenjiro},
|
||||
booktitle={2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)},
|
||||
pages={9--16},
|
||||
year={2020},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@book{AST_Steen.Distributed_Systems-3ed.2017,
|
||||
title={Distributed systems},
|
||||
author={Van Steen, Maarten and Tanenbaum, Andrew S},
|
||||
year={2017},
|
||||
publisher={Maarten van Steen Leiden, The Netherlands}
|
||||
}
|
||||
|
||||
@article{De_Wael_etal.PGAS_Survey.2015,
|
||||
title={Partitioned global address space languages},
|
||||
author={De Wael, Mattias and Marr, Stefan and De Fraine, Bruno and Van Cutsem, Tom and De Meuter, Wolfgang},
|
||||
journal={ACM Computing Surveys (CSUR)},
|
||||
volume={47},
|
||||
number={4},
|
||||
pages={1--27},
|
||||
year={2015},
|
||||
publisher={ACM New York, NY, USA}
|
||||
}
|
||||
|
||||
@misc{WEB.HPE.Chapel_Platforms-v1.33.2023,
|
||||
title={Platform-Specifc Notes},
|
||||
url={https://chapel-lang.org/docs/platforms/index.html#},
|
||||
journal={Chapel Documentation 1.33},
|
||||
publisher={Hewlett Packard Enterprise Development LP.},
|
||||
year={2023}
|
||||
}
|
||||
|
||||
@misc{WEB.LBNL.UPC_man_1_upcc.2022,
|
||||
title={upcc.1},
|
||||
url={https://upc.lbl.gov/docs/user/upcc.html},
|
||||
journal={Manual Reference Pages - UPCC (1)},
|
||||
publisher={Lawrence Berkeley National Laboratory},
|
||||
year={2022}
|
||||
}
|
||||
|
||||
@inproceedings{Zhou_etal.DART-MPI.2014,
|
||||
title={DART-MPI: An MPI-based implementation of a PGAS runtime system},
|
||||
author={Zhou, Huan and Mhedheb, Yousri and Idrees, Kamran and Glass, Colin W and Gracia, Jos{\'e} and F{\"u}rlinger, Karl},
|
||||
booktitle={Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models},
|
||||
pages={1--11},
|
||||
year={2014}
|
||||
}
|
||||
|
||||
@inproceedings{Ma_etal.SHM_FPGA.2020,
|
||||
title={A hypervisor for shared-memory FPGA platforms},
|
||||
author={Ma, Jiacheng and Zuo, Gefei and Loughlin, Kevin and Cheng, Xiaohe and Liu, Yanqiang and Eneyew, Abel Mulugeta and Qi, Zhengwei and Kasikci, Baris},
|
||||
booktitle={Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
|
||||
pages={827--844},
|
||||
year={2020}
|
||||
}
|
||||
|
||||
@inproceedings{Khawaja_etal.AmorphOS.2018,
|
||||
title={Sharing, Protection, and Compatibility for Reconfigurable Fabric with $\{$AmorphOS$\}$},
|
||||
author={Khawaja, Ahmed and Landgraf, Joshua and Prakash, Rohith and Wei, Michael and Schkufza, Eric and Rossbach, Christopher J},
|
||||
booktitle={13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)},
|
||||
pages={107--127},
|
||||
year={2018}
|
||||
}
|
||||
|
||||
@misc{Ven.LKML_x86_DMA.2008,
|
||||
title={Background on ioremap, cacheing, cache coherency on x86},
|
||||
url={https://lkml.org/lkml/2008/4/29/480},
|
||||
journal={lkml.org},
|
||||
author={Ven, Arjan van de},
|
||||
year={2008}
|
||||
}
|
||||
|
||||
@inproceedings{Li_etal.RelDB_RDMA.2016,
|
||||
title={Accelerating relational databases by leveraging remote memory and RDMA},
|
||||
author={Li, Feng and Das, Sudipto and Syamala, Manoj and Narasayya, Vivek R},
|
||||
booktitle={Proceedings of the 2016 International Conference on Management of Data},
|
||||
pages={355--370},
|
||||
year={2016}
|
||||
}
|
||||
|
||||
@article{Hong_etal.NUMA-to-RDMA-DSM.2019,
|
||||
title={Scaling out NUMA-aware applications with RDMA-based distributed shared memory},
|
||||
author={Hong, Yang and Zheng, Yang and Yang, Fan and Zang, Bin-Yu and Guan, Hai-Bing and Chen, Hai-Bo},
|
||||
journal={Journal of Computer Science and Technology},
|
||||
volume={34},
|
||||
pages={94--112},
|
||||
year={2019},
|
||||
publisher={Springer}
|
||||
}
|
||||
|
||||
@inproceedings{Kaxiras_etal.DSM-Argos.2015,
|
||||
author = {Kaxiras, Stefanos and Klaftenegger, David and Norgren, Magnus and Ros, Alberto and Sagonas, Konstantinos},
|
||||
title = {Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory},
|
||||
year = {2015},
|
||||
isbn = {9781450335508},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/2749246.2749250},
|
||||
doi = {10.1145/2749246.2749250},
|
||||
abstract = {A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.},
|
||||
booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing},
|
||||
pages = {3-14},
|
||||
numpages = {12},
|
||||
location = {Portland, Oregon, USA},
|
||||
series = {HPDC '15}
|
||||
}
|
||||
|
||||
@misc{FreeBSD.man-BPF-4.2021,
|
||||
title={FreeBSD manual pages},
|
||||
url={https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports},
|
||||
journal={BPF(4) Kernel Interfaces Manual},
|
||||
publisher={The FreeBSD Project},
|
||||
author={The FreeBSD Project},
|
||||
year={2021}
|
||||
}
|
||||
|
||||
@book{Nagarajan_etal.Primer_consistency_coherence_arch.2ed.2020,
|
||||
title={A primer on memory consistency and cache coherence},
|
||||
author={Nagarajan, Vijay and Sorin, Daniel J and Hill, Mark D and Wood, David A},
|
||||
year={2020},
|
||||
publisher={Springer Nature}
|
||||
}
|
||||
|
||||
@misc{ISO/IEC_9899:2011.C11,
|
||||
abstract = {Edition Status: Withdrawn on 2018-07-13},
|
||||
isbn = {9780580801655},
|
||||
keywords = {Data processing ; Data representation ; Languages used in information technology ; Programming ; Programming languages ; Semantics ; Syntax},
|
||||
language = {eng},
|
||||
publisher = {British Standards Institute},
|
||||
title = {BS ISO/IEC 9899:2011: Information technology. Programming languages. C},
|
||||
year = {2013},
|
||||
}
|
||||
|
||||
@misc{ISO/IEC_JTC1_SC22_WG21_N2427.C++11.2007,
|
||||
title={C++ Atomic Types and Operations},
|
||||
url={https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html},
|
||||
journal={C++ atomic types and operations},
|
||||
publisher={ISO/IEC JTC 1},
|
||||
author={Boehm, Hans J and Crowl, Lawrence},
|
||||
year={2007}
|
||||
}
|
||||
|
||||
@misc{Rust.core::sync::atomic::Ordering.2024,
|
||||
title={Ordering in core::sync::atomic - Rust},
|
||||
url={https://doc.rust-lang.org/core/sync/atomic/enum.Ordering.html},
|
||||
journal={The Rust Core Library},
|
||||
publisher={the Rust Team},
|
||||
year={2024}
|
||||
}
|
||||
|
||||
@misc{Manson_Goetz.JSR_133.Java_5.2004,
|
||||
url={https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html},
|
||||
journal={JSR 133 (Java Memory Model) FAQ},
|
||||
publisher={Department of Computer Science, University of Maryland},
|
||||
author={Manson, Jeremy and Goetz, Brian},
|
||||
year={2004}
|
||||
}
|
||||
|
||||
@inproceedings{Pinto_etal.Thymesisflow.2020,
|
||||
title={Thymesisflow: A software-defined, hw/sw co-designed interconnect stack for rack-scale memory disaggregation},
|
||||
author={Pinto, Christian and Syrivelis, Dimitris and Gazzetti, Michele and Koutsovasilis, Panos and Reale, Andrea and Katrinis, Kostas and Hofstee, H Peter},
|
||||
booktitle={2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
|
||||
pages={868--880},
|
||||
year={2020},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Couceiro_etal.D2STM.2009,
|
||||
title={D2STM: Dependable distributed software transactional memory},
|
||||
author={Couceiro, Maria and Romano, Paolo and Carvalho, Nuno and Rodrigues, Lu{\'\i}s},
|
||||
booktitle={2009 15th IEEE Pacific Rim International Symposium on Dependable Computing},
|
||||
pages={307--313},
|
||||
year={2009},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@inproceedings {Wang_etal.Concordia.2021,
|
||||
author = {Qing Wang and Youyou Lu and Erci Xu and Junru Li and Youmin Chen and Jiwu Shu},
|
||||
title = {Concordia: Distributed Shared Memory with {In-Network} Cache Coherence},
|
||||
booktitle = {19th USENIX Conference on File and Storage Technologies (FAST 21)},
|
||||
year = {2021},
|
||||
isbn = {978-1-939133-20-5},
|
||||
pages = {277--292},
|
||||
url = {https://www.usenix.org/conference/fast21/presentation/wang},
|
||||
publisher = {USENIX Association},
|
||||
month = feb
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{Kim_etal.DeX-upon-Linux.2020,
|
||||
author={Kim, Sang-Hoon and Chuang, Ho-Ren and Lyerly, Robert and Olivier, Pierre and Min, Changwoo and Ravindran, Binoy},
|
||||
booktitle={2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)},
|
||||
title={DeX: Scaling Applications Beyond Machine Boundaries},
|
||||
year={2020},
|
||||
volume={},
|
||||
number={},
|
||||
pages={864-876},
|
||||
keywords={Protocols;Instruction sets;Linux;Prototypes;Distributed databases;Programming;Kernel;Thread migration;distributed execution;distributed memory;RDMA},
|
||||
doi={10.1109/ICDCS47774.2020.00021}
|
||||
}
|
||||
|
||||
@inproceedings{Chaiken_Kubiatowicz_Agarwal.LimitLESS-with-Alewife.1991,
|
||||
author = {Chaiken, David and Kubiatowicz, John and Agarwal, Anant},
|
||||
title = {LimitLESS directories: A scalable cache coherence scheme},
|
||||
year = {1991},
|
||||
isbn = {0897913809},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/106972.106995},
|
||||
doi = {10.1145/106972.106995},
|
||||
booktitle = {Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems},
|
||||
pages = {224–234},
|
||||
numpages = {11},
|
||||
location = {Santa Clara, California, USA},
|
||||
series = {ASPLOS IV}
|
||||
}
|
||||
|
||||
@INPROCEEDINGS{Ding.vDSM.2018,
|
||||
author={Ding, Zhuocheng},
|
||||
booktitle={2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS)},
|
||||
title={vDSM: Distributed Shared Memory in Virtualized Environments},
|
||||
year={2018},
|
||||
volume={},
|
||||
number={},
|
||||
pages={1112-1115},
|
||||
keywords={Virtual machine monitors;Optimization;Protocols;Virtualization;Operating systems;Stress;Analytical models;component;distributed shared memory;virtuali-zation;low-latency network},
|
||||
doi={10.1109/ICSESS.2018.8663720}
|
||||
}
|
||||
|
||||
@misc{ARM.ARMv8-A.v1.0.2015,
|
||||
title={ARM® Cortex®-A Series Programmer's Guide for ARMv8-A},
|
||||
url={https://developer.arm.com/documentation/den0024/a},
|
||||
journal={Documentation - arm developer},
|
||||
publisher={ARM},
|
||||
author={ARM},
|
||||
year={2015}
|
||||
}
|
||||
|
||||
@inproceedings{Zhang_etal.GiantVM.2020,
|
||||
title={Giantvm: A type-ii hypervisor implementing many-to-one virtualization},
|
||||
author={Zhang, Jin and Ding, Zhuocheng and Chen, Yubin and Jia, Xingguo and Yu, Boshi and Qi, Zhengwei and Guan, Haibing},
|
||||
booktitle={Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
|
||||
pages={30--44},
|
||||
year={2020}
|
||||
}
|
||||
|
||||
@book{Holsapple.DSM64.2012,
|
||||
title={DSM64: A Distributed Shared Memory System in User-Space},
|
||||
author={Holsapple, Stephen Alan},
|
||||
year={2012},
|
||||
publisher={California Polytechnic State University}
|
||||
}
|
||||
|
||||
@inproceedings{Eisley_Peh_Shang.In-net-coherence.2006,
|
||||
title={In-network cache coherence},
|
||||
author={Eisley, Noel and Peh, Li-Shiuan and Shang, Li},
|
||||
booktitle={2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)},
|
||||
pages={321--332},
|
||||
year={2006},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@inproceedings{Schoinas_etal.Sirocco.1998,
|
||||
title={Sirocco: Cost-effective fine-grain distributed shared memory},
|
||||
author={Schoinas, Ioannis and Falsafi, Babak and Hill, Mark D and Larus, James R and Wood, David A},
|
||||
booktitle={Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. 98EX192)},
|
||||
pages={40--49},
|
||||
year={1998},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@article{Schaefer_Li.Shiva.1989,
|
||||
title={Shiva: An operating system transforming a hypercube into a shared-memory machine},
|
||||
author={Li, Kai and Schaefer, Richard},
|
||||
year={1989}
|
||||
} or was the order of authors other way around?
|
||||
|
||||
@article{Fleisch_Popek.Mirage.1989,
|
||||
title={Mirage: A coherent distributed shared memory design},
|
||||
author={Fleisch, Brett and Popek, Gerald},
|
||||
journal={ACM SIGOPS Operating Systems Review},
|
||||
volume={23},
|
||||
number={5},
|
||||
pages={211--223},
|
||||
year={1989},
|
||||
publisher={ACM New York, NY, USA}
|
||||
}
|
||||
|
||||
@misc{Kjos_etal.HP-HW-CC-IO.1996,
|
||||
copyright = {Copyright 2006 Elsevier B.V., All rights reserved.},
|
||||
issn = {0018-1153},
|
||||
journal = {Hewlett-Packard journal},
|
||||
keywords = {Computer Science ; Computer Science, Hardware & Architecture ; Engineering ; Engineering, Electrical & Electronic ; Instruments & Instrumentation ; Science & Technology ; Technology},
|
||||
language = {eng},
|
||||
number = {1},
|
||||
pages = {52-59},
|
||||
publisher = {Hewlett-Packard Co},
|
||||
abstract = {Hardware cache coherent I/O is a new feature of the PA-RISC architecture that involves the I/O hardware in ensuring cache coherence, thereby reducing CPU and memory overhead and increasing performance.},
|
||||
author = {Kjos, Toddj and Nusbaum, Helen and Traynor, Michaelk and Voge, Brendana},
|
||||
address = {PALO ALTO},
|
||||
title = {Hardware cache coherent input/output},
|
||||
volume = {47},
|
||||
year = {1996},
|
||||
}
|
||||
|
||||
@inproceedings{Giri_Mantovani_Carloni.NoC-CC-over-SoC.2018,
|
||||
title={NoC-based support of heterogeneous cache-coherence models for accelerators},
|
||||
author={Giri, Davide and Mantovani, Paolo and Carloni, Luca P},
|
||||
booktitle={2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS)},
|
||||
pages={1--8},
|
||||
year={2018},
|
||||
organization={IEEE}
|
||||
}
|
||||
|
||||
@misc{Corbet.LWN-NC-DMA.2021,
|
||||
url={https://lwn.net/Articles/855328/},
|
||||
journal={Noncoherent DMA mappings},
|
||||
publisher={LWN.net},
|
||||
author={Corbet, Jonathan},
|
||||
year={2021}
|
||||
}
|
||||
|
||||
@misc{Parris.AMBA_4_ACE-Lite.2013,
|
||||
title={Extended system coherency: Cache Coherency Fundamentals},
|
||||
url={https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/extended-system-coherency---part-1---cache-coherency-fundamentals},
|
||||
journal={Extended System Coherency: Cache Coherency Fundamentals - Architectures and Processors blog - Arm Community blogs - Arm Community},
|
||||
publisher={ARM Community Blogs},
|
||||
author={Parris, Neil},
|
||||
year={2013}
|
||||
}
|
||||
|
||||
@misc{Miller_Henderson_Jelinek.Kernelv6.7-DMA_guide.2024,
|
||||
title={Dynamic DMA mapping Guide},
|
||||
url={https://www.kernel.org/doc/html/v6.7/core-api/dma-api-howto.html},
|
||||
journal={The Linux Kernel},
|
||||
author={Miller, David S and Henderson, Richard and Jelinek, Jakub},
|
||||
year={2024}
|
||||
}
|
||||
|
||||
@misc{many.MSFTLearn-SMBDirect.2024,
|
||||
title={SMB Direct},
|
||||
url={https://learn.microsoft.com/en-us/windows-server/storage/file-server/smb-direct},
|
||||
journal={Microsoft Learn},
|
||||
publisher={Microsoft},
|
||||
author={Xelu86 and ManikaDhiman and dknappettmsft and v-alje and nedpyle and eross-msft and SubodhBhargava and JasonGerend and lizap and Heidilohr},
|
||||
year={2024}
|
||||
}
|
||||
Binary file not shown.
Binary file not shown.
|
|
@ -1,423 +0,0 @@
|
|||
% Yeah "slices" whatever lol
|
||||
\documentclass{beamer}
|
||||
\usepackage[style=authortitle-comp]{biblatex}
|
||||
\usepackage[export]{adjustbox}
|
||||
|
||||
\title{Progress Report: Page Cache Consistency Model}
|
||||
\author{Zhengyi Chen}
|
||||
\date{\today}
|
||||
|
||||
\addbibresource{../main.bib}
|
||||
|
||||
\begin{document}
|
||||
% Title page
|
||||
\frame{\titlepage}
|
||||
|
||||
% Page -2
|
||||
\begin{frame}
|
||||
\frametitle{
|
||||
Literature Review: (Shan, Tsai, \& Zhang. 2017\footcite{shan2017distributed})
|
||||
}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Concerns with the sharing of persistent memory --
|
||||
\begin{itemize}
|
||||
\item More or less similar to sharing regular memory, but\dots
|
||||
\item Data replication is key $\Rightarrow$ Multiple data provider.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Supports both Multi-Writer Multi-Reader and Multi-Writer Single-Writer Protocols
|
||||
\begin{itemize}
|
||||
\item MRMW ``support(s) great parallelism''
|
||||
\item MRSW enables ``stronger consistency''
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Makes distinction between 3 variants of nodes:
|
||||
\begin{itemize}
|
||||
\item Commit Node -- Node who wishes to commit changes wrt. the system.
|
||||
\item Owner Node -- Node(s) who act as data provider for latest page content.
|
||||
\item Manager Node -- Node who provide (serialized) write access control to page.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{
|
||||
Literature Review: (Shan, Tsai, \& Zhang. 2017\footcite{shan2017distributed})
|
||||
}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
For data replication and fault tolerance, necessitates:
|
||||
\begin{enumerate}
|
||||
\item Commit status logging (akin to journaled file system)
|
||||
\item Persistent Commit ID
|
||||
\item \textbf{Required} deg. of replication -- each ON shares to $N$ nodes.
|
||||
\end{enumerate}
|
||||
}
|
||||
\item {
|
||||
Fault tolerance is out of this thesis's scope. However\dots
|
||||
\begin{itemize}
|
||||
\item Prob. no need for requiring any degree of data replication.
|
||||
\item Dropping data replication req. $\Rightarrow$ no need for replication comms.
|
||||
\item Commit status logging \& persistent CID can be helpful \& should not introduce additional comms.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
MRSW provides ``simpler and more efficient'' commits than MRMW -- no concurrent
|
||||
commits to same shared memory object exists.
|
||||
\begin{itemize}
|
||||
\item Also makes more sense from a CPU-accelerator dichotomy outlook (ofc. wrt. this thesis's system).
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{MRSW: (Shan, Tsai, \& Zhang. 2017\footcite{shan2017distributed})}
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{w12_slides_resources/dspm.fig8.png}
|
||||
\end{figure}
|
||||
Note: CN: Node 1, MN: Node 2, ON: Node 2 \& 3. Node 4 may or may not already
|
||||
share the committed page prior to acquire.
|
||||
\end{frame}
|
||||
|
||||
% Page 0
|
||||
\begin{frame}
|
||||
\frametitle{Literature Review: (Ramesh. 2023)}
|
||||
\begin{itemize}
|
||||
\item Popcorn-derived.
|
||||
\item {
|
||||
Sequential consistency, MRSW protocol offloaded onto sNIC:
|
||||
\begin{itemize}
|
||||
\item DSM protocol processor implemented on sNIC FPGA core.
|
||||
\item sNIC \textbf{keeps track of memory ownership, status, R/W permissions} at page level granularity.
|
||||
\item Removes the need for distinct memory management nodes.
|
||||
\item (i.e., the sNIC IS the memory management node -- except of course allocation).
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Similar idea occurred in \textit{Concordia}\footcite{wang2021concordia}:
|
||||
\begin{itemize}
|
||||
\item Concurrency control and multicast offloaded to network switch.
|
||||
\item Authors claim this is more scalable (?)
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\footnote{
|
||||
Ramesh., ``SNIC-DSM: SmartNIC based DSM Infrastructure for Heterogeneous-ISA Machines''
|
||||
}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Literature Review: (Endo, Sato, \& Taura. 2020)\footcite{EndoWataru2020MADD}}
|
||||
\begin{itemize}
|
||||
\item Eager Release Consistency.
|
||||
\item Prob. using MSI coherence protocol? Authors did not mention it.
|
||||
\item MRMW: use timestamps to store reader ``intervals''.
|
||||
\item {
|
||||
Introduces the home-migration concept:
|
||||
\begin{itemize}
|
||||
\item At commit, make the CN the home node instead of invalidating the home node.
|
||||
\item This removes communications needed for diff-merging at home node -- this can be done locally.
|
||||
\item No support for multiple home nodes.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
No performance improvement over PGAS programming framework (OpenMPI).
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Literature Review: (Endo, Sato, \& Taura. 2020)\footcite{EndoWataru2020MADD}}
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{w12_slides_resources/menps.fig5.png}
|
||||
\end{figure}
|
||||
\end{frame}
|
||||
|
||||
% Page 1
|
||||
\begin{frame}
|
||||
\frametitle{The System}
|
||||
\begin{itemize}
|
||||
\item Remote node(s) abstracted as shared memory device ``\texttt{/dev/rshm}''
|
||||
\item {
|
||||
Heterogeneous Memory Management (HMM) ensures unified address space between
|
||||
local and device memory.
|
||||
}
|
||||
\item {
|
||||
Migration of pages between CPU and ``device'' is transparent to userspace
|
||||
-- no need for copying/mapping.
|
||||
}
|
||||
\item {
|
||||
In reality, ``\texttt{/dev/rshm}'' a handler for RDMA access between nodes.
|
||||
\begin{itemize}
|
||||
\item This involves remote read/write and moving page content between nodes.
|
||||
\item Local node serves as \emph{home node \& address space host} at share time.
|
||||
\item Remote nodes attached on \texttt{/dev/rshm} as accelerator.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 2
|
||||
\begin{frame}
|
||||
\frametitle{The Problem: Consistency Protocol}
|
||||
\begin{itemize}
|
||||
\item Single-Writer, Multiple-Reader Protocol
|
||||
% Why?
|
||||
% It may be that this mimics all sorts of logic for hardware acceleration
|
||||
% -- that is, in an HMM node each PCIe device have sole access to a page of memory.
|
||||
% For example, during machine learning you naturally can't access the same, say,
|
||||
% kernel by both CPU and GPU.
|
||||
% That said, I never shed a doubt on this issue except my advisor telling me not
|
||||
% to worry about it -- if I was asked this problem for some reason I'd be cooked!
|
||||
\item Need to be performant\dots with some ergonomics
|
||||
\item {
|
||||
Two Hypothetical Protocols:
|
||||
\begin{itemize}
|
||||
\item ``RwLock'' Consistency Protocol
|
||||
\item Acq-Rel Consistency Protocol
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Former ensures \emph{strong} single-writer consistency
|
||||
\begin{itemize}
|
||||
\item -- Also easier to program with!
|
||||
\end{itemize}
|
||||
}
|
||||
\item Latter allows concurrent in-memory \emph{non-committal} computation
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 3
|
||||
\begin{frame}
|
||||
\frametitle{``RwLock'' Consistency Protocol}
|
||||
Similar to a read-write lock where:
|
||||
\begin{itemize}
|
||||
\item Multiple readers can exist for a clean page -- the page is \textbf{shared}.
|
||||
\item Only one writer is allowed for a clean page -- the page becomes \textbf{exclusive}.
|
||||
\item {
|
||||
For one writer node be allowed sole write access to some page, all other
|
||||
readers need to have their page cache invalidated.
|
||||
}
|
||||
\item {
|
||||
While the sole writer node has not yet committed, no other reader or writer nodes
|
||||
are allowed to be served this page.
|
||||
}
|
||||
\item {
|
||||
When the sole writer commits, it becomes the new home node which serves the
|
||||
updated page content.
|
||||
}
|
||||
\item {
|
||||
Invalidates reader must fetch from MN for read access, which maintains RAW ordering.
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 4
|
||||
\begin{frame}
|
||||
\frametitle{``RwLock'' Consistency Protocol}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{
|
||||
w12_slides_resources/Fig-RwlockProtocol 2023-12-06 19_05_06.pdf
|
||||
}
|
||||
\end{figure}
|
||||
\end{frame}
|
||||
|
||||
% Page 5
|
||||
\begin{frame}
|
||||
\frametitle{Acq-Rel Consistency Protocol}
|
||||
In RwLock's case, read requests result in installation of read-only pages at
|
||||
remote nodes.
|
||||
|
||||
Alternatively, this protocol allows read/write pages to be installed at remote
|
||||
nodes at read time. Such writes are \emph{non-committal} and cannot be synced
|
||||
with the entire system.
|
||||
|
||||
To summarize:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
``Readers'' can write to its locally installed page without any means
|
||||
to synchronize the change.
|
||||
}
|
||||
\item {
|
||||
``Writers'' need to acquire global write access from the \emph{PT node},
|
||||
which invalidates all shared pages.
|
||||
}
|
||||
\item {
|
||||
i.e., Instead of write-invalidate, perform acquire-invalidate.
|
||||
}
|
||||
\end{itemize}
|
||||
|
||||
This may require pages to be marked as CoW if the sharer wants also to act as a home node.
|
||||
\end{frame}
|
||||
|
||||
% Page 6
|
||||
\begin{frame}
|
||||
\frametitle{Consistency Protocol: Knobs and Mods}
|
||||
We can modify these two protocols further as follows:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Multi-home Protocol: instead of having one home at a time, have
|
||||
multiple homes (e.g., when writer commits) to prevent network bottleneck.
|
||||
\begin{itemize}
|
||||
\item Home nodes can be dynamically assigned
|
||||
\item Extra metadata can limit scalability.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Auto-share: Automatically share pages at commit time using 1-way
|
||||
communications.
|
||||
\begin{itemize}
|
||||
\item Potential for communication reduction -- debatable.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Request aggregation: Aggregate RDMA requests for optimal RDMA transfer performance.
|
||||
\begin{itemize}
|
||||
\item Need to be coherent with program sequence!
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Why this design?}
|
||||
\begin{itemize}
|
||||
\item Largely inspired by DSPM\footcite{shan2017distributed}.
|
||||
\item Removed arrows for enforced data duplication -- duplication is solely on-demand.
|
||||
\item {
|
||||
Introduces transitional state ``T'':
|
||||
\begin{itemize}
|
||||
\item Used to flag a page as unserviceable -- visible only at MN.
|
||||
\item All read/write access to T-page is kept on hold until MN receives commit msg.
|
||||
\item After commit, MN forwards queued R/W access to moved home.
|
||||
\item This (at least) maintains RAW, WAW data dependency for whichever interleaving issues.
|
||||
\item Removing T allows stale data to be served -- violates RAW for better throughput.
|
||||
\end{itemize}
|
||||
}
|
||||
\item Extensible (as mentioned in prior page).
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Why not this design?}
|
||||
At the very least\dots
|
||||
\begin{itemize}
|
||||
\item {
|
||||
De-coupled home and access-management nodes require:
|
||||
\begin{itemize}
|
||||
\item Each home node need to be MN-aware (easy).
|
||||
\item {
|
||||
MN need to be home-aware (also easy with single-writer, but spatial complexity is a concern):
|
||||
\begin{itemize}
|
||||
\item Naive directory scheme is not scalable.
|
||||
\item Coarse directory scheme (e.g., SGI Origin 2000) is wasteful (but may be the fastest in practice).
|
||||
\item Distributed directory scheme may provide terrible latency.
|
||||
\item More sophisticated schemes are possible but needs work \& experimentation.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Strict consistency limits throughput.
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 7
|
||||
\begin{frame}
|
||||
\frametitle{What about Consistency \textbf{Model}?}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
The weaker a consistency model is, the more difficult it is to program with.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Weak ordering architectures (e.g., ARMv8) more or less depends on
|
||||
compiler/interpreter to emit barriers as see fit \footcite{Haynes_2022}.
|
||||
}
|
||||
\item {
|
||||
Bad for usability/portability -- programs may need
|
||||
to be compiled using a modified toolchain, else need to add these
|
||||
synchronization instructions/function calls everywhere.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
\footcite{cai2018efficient} uses Partial Store Order.
|
||||
\begin{itemize}
|
||||
\item Preserves RAR, WAR -- ``synchronous read\dots asynchronous write''
|
||||
\item Easier to use than relaxed ordering.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
\footcite{wang2021concordia} uses strong consistency, but warns about its scalability.
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 8
|
||||
\begin{frame}
|
||||
\frametitle{Consistency Model: Cont.}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Similar to Concordia\footcite{wang2021concordia}, the proposed protocols also assume
|
||||
strong consistency.
|
||||
}
|
||||
\item {
|
||||
Further work needed to see how to adapt these protocols for weaker consistency models.
|
||||
\begin{itemize}
|
||||
\item Low-hanging fruit: TSO
|
||||
\item Allowing read requests to be served for T-pages @ MN: W$\rightarrow$R violation.
|
||||
\item {
|
||||
Allowing read requests to be served via non-MN homes: also W$\rightarrow$R violation
|
||||
(exploits a race condition between write msg and invalidation msg).
|
||||
}
|
||||
\item Request workers work on one request at a time: no R$\rightarrow$W violation.
|
||||
\item W$\rightarrow$W violation simply cannot happen -- they always serialize @ MN.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Summary}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Based on MSI coherence protocol, with possible T-state extension.
|
||||
\begin{itemize}
|
||||
\item T-state can be instead implemented as an additional flag parallel to MSI FSM.
|
||||
\item T-pages cannot be serviced by MN -- all read/write requests blocked.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
One consistency model (for now): sequential consistency.
|
||||
\begin{itemize}
|
||||
\item Maintains RAW via T-state @ MN -- removing blocking on T-pages results in TSO.
|
||||
% Reads issued before Write -- read requests received before write.
|
||||
% Because RDMA QPs are FIFO, either read issued before or after write.
|
||||
% Assuming one worker thread works on requests sequentially, naturally WAR is preserved.
|
||||
% RAW is preserved because writes cannot be finished until commit message is received.
|
||||
% During which, T-state pages are blocked from being serviced.
|
||||
% This do introduce a semaphore-like situation, however...
|
||||
\item Maintains WAR via sequentially worked RDMA RQ.
|
||||
\item Maintains WAW via single-writer.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Two consistency protocols:
|
||||
\begin{itemize}
|
||||
\item RwLock consistency protocol only allows read-only sharing.
|
||||
\item {
|
||||
Acq-Rel consistency protocol differentiates non-committal writes,
|
||||
allows proc-local writable sharing.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
\end{document}
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
Before Width: | Height: | Size: 136 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 204 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 489 KiB |
Binary file not shown.
|
|
@ -1,638 +0,0 @@
|
|||
\documentclass{beamer}
|
||||
\usepackage[]{biblatex}
|
||||
\usepackage[export]{adjustbox}
|
||||
\usepackage{hyperref}
|
||||
|
||||
\title{
|
||||
Cache Coherency \& Memory Model in RDMA-Backed Software-Coherent DSM
|
||||
}
|
||||
\author{Zhengyi Chen}
|
||||
\date{\today}
|
||||
|
||||
\addbibresource{../main.bib}
|
||||
|
||||
\begin{document}
|
||||
% Title Page
|
||||
\frame{\titlepage}
|
||||
|
||||
% Table of Content
|
||||
\begin{frame}
|
||||
\frametitle{Table of Contents}
|
||||
\tableofcontents
|
||||
\end{frame}
|
||||
|
||||
% Part 1: Overview
|
||||
% =============================================================================
|
||||
\section{1. Overview}
|
||||
% Page 1
|
||||
\begin{frame}
|
||||
\frametitle{1. Overview}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
DSM used to be constrained by NIC bandwidth \& transfer rate (e.g.,
|
||||
during the 1990s).
|
||||
}
|
||||
\item {
|
||||
The advent of high(er) transfer rate NICs allows the DSM idea to be
|
||||
revived.
|
||||
}
|
||||
\item {
|
||||
Orthogonally, hardware acceleration resources are scarce and highly
|
||||
valuable.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Traditional Scheduling Mechanisms within a Cluster cannot
|
||||
dynamically allocate hardware accelerators without high
|
||||
overhead.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Ideally, via high-speed NICs, hardware accelerator could be
|
||||
statically allocated such that:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Every node have access to the hardware accelerator node in a
|
||||
time-shared fashion.
|
||||
}
|
||||
\item {
|
||||
Accelerator-attached node can access remote memory much like
|
||||
attaching accelerator over, say, PCIe.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Heterogeneous Memory Management}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
\textbf{HMM} facilitates shared address space and transparent data
|
||||
migration between CPU and peripherals. Specifically:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
HMM provides interface for duplicating the CPU page table
|
||||
with that of the device's, which are transparently
|
||||
synchronized.
|
||||
}
|
||||
\item {
|
||||
It also provides corresponding \texttt{struct page}
|
||||
representation of device memory pages, which are faulted
|
||||
between the CPU and device.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Theoretically, this should allow for devices in remote nodes to
|
||||
perform HMM using the DMA-capable NIC as a ``proxy HMM device''.
|
||||
}
|
||||
\item {
|
||||
Details of implementation of DSM-over-HMM is beyond this thesis's
|
||||
scope.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
This thesis focuses on studying and implementing cache
|
||||
coherency and later, memory model for the DSM subsystem of
|
||||
this wider, ongoing project.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Cache Coherency, and Why It Matters Here}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Cache-incoherent RDMA (e.g., mlx) performs DMA without
|
||||
synchronization with CPU cache.
|
||||
}
|
||||
\item {
|
||||
We cannot assume MMU to magically maintain coherence.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
This seems the case for x86\_64 (cache-coherent DMA), but
|
||||
not ARM64.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
At transportation time:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Send to remote: flushes cache into memory before posting
|
||||
send message.
|
||||
}
|
||||
\item {
|
||||
Receive from remote: invalidate cache entry after worked
|
||||
recv message.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Example: Linux kernel tree, \textit{smbdirect} implementation.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
\textit{smbdirect} opportunistically establish SMB over
|
||||
RDMA-capable network.
|
||||
}
|
||||
\item {
|
||||
\texttt{smbd\_post\_send} cleans cache entry prior to posting
|
||||
send request.
|
||||
}
|
||||
\item {
|
||||
\texttt{recv\_done} invalidates cache entry after exiting
|
||||
softirq for recv request (as callback from RDMA driver).
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Consistency Model and Protocol}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Majority of DSM literatures apply \textbf{release consistency} as
|
||||
the system's memory model.
|
||||
}
|
||||
\item {
|
||||
With \textbf{single-writer} protocol, however, the memory model can
|
||||
be strengthened with little increase in code complexity.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
\textit{DSPM}\cite{shan2017distributed}, for example,
|
||||
achieves a \textit{de-facto} TSO consistency from its
|
||||
multi-writer release consistency counterpart -- assuming
|
||||
correct memory barriers within each node's CPU, distributed
|
||||
writes are never reordered, and distributed reads can
|
||||
overtake writes.
|
||||
}
|
||||
\item {
|
||||
Consequently, one can easily achieve sequential consistency
|
||||
by designating the entire write-access duration as a critical
|
||||
section.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
HMM's ``CPU-or-device'' data migration model also strongly implies
|
||||
a single-writer consistency protocol.
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Part 2: Design
|
||||
% =============================================================================
|
||||
\section{2. Design}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{2. Design}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Designing a DSM necessitates designing:
|
||||
\begin{itemize}
|
||||
\item Consistency Model.
|
||||
\item Coherence Protocol and State Machine.
|
||||
\item Access Control.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Care needs to be taken to ensure that the in-kernel implementation
|
||||
is:
|
||||
\begin{itemize}
|
||||
\item Correct,
|
||||
\item Performant,
|
||||
\item Exploits RDMA's traits.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Protocol Overview}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Multiple readers can exist for a clean page -- the page is
|
||||
\textbf{shared}.
|
||||
}
|
||||
\item {
|
||||
Only one writer is allowed for a clean page -- the page becomes
|
||||
\textbf{exclusive}.
|
||||
}
|
||||
\item {
|
||||
For one writer node be allowed sole write access to some page, all
|
||||
other sharers need to have their page cache invalidated prior to
|
||||
making the change global (commit-invalidate).
|
||||
}
|
||||
\item {
|
||||
While the sole writer node has not yet committed, either:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
no other reader or writer nodes are allowed to be served
|
||||
this page (stronger consistency model).
|
||||
}
|
||||
\item {
|
||||
no writers are allowed to be served this page. Readers can
|
||||
be served stale data (provided data providers do not receive
|
||||
invalidation message prior to service).
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
When the sole writer commits, it becomes the sole home node
|
||||
(data provider) which serves the updated page content.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Optionally, some nodes can register to have commits written
|
||||
back instead.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Protocol Excerpt: Write-Invalidate}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{
|
||||
w12_slides_resources/Fig-RwlockProtocol 2023-12-06 19_05_06.pdf
|
||||
}
|
||||
\end{figure}
|
||||
The \textit{T}-state indicates a transitionary state for some shared page.
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Consistency Model: TSO}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Total Store Ordering allows Reads to overtake Stores.
|
||||
}
|
||||
\item {
|
||||
Assuming correct use of node-local synchronization on all nodes,
|
||||
applying TSO in a home-based DSM allows for:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Another node tries to read T-page from access-control
|
||||
node, served stale data: W$\rightarrow$R violation.
|
||||
}
|
||||
\item {
|
||||
Another node tries to read S-page from data-provider
|
||||
nodes, served stale data: W$\rightarrow$R violation
|
||||
(if e.g., the invalidation message from access-control node
|
||||
was received afterwards).
|
||||
}
|
||||
\item {
|
||||
Data-provider and access-control nodes work on one request
|
||||
at a time: no R$\rightarrow$W violation.
|
||||
}
|
||||
\item {
|
||||
Write-accesses serialized at access-control node: no
|
||||
W$\rightarrow$W violation.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Consistency Model: Strengthen to Sequential}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
By corollary, can reverse the previous page's statements to
|
||||
strengthen to sequential consistency:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Disallow T-pages from being serviced until new page content
|
||||
is installed: lengthens critical section.
|
||||
}
|
||||
\item {
|
||||
Abolish data-provider nodes: access-control nodes become
|
||||
bottleneck.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Coherence Protocol: Possible Features}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Multi-data-provider Protocol: Instead of having one data-provider,
|
||||
have multiple data-provider nodes that are automatically write-back
|
||||
to prevent network bottleneck.
|
||||
\begin{itemize}
|
||||
\item Data provider nodes may be dynamically assigned.
|
||||
\item Extra metadata can limit scalability.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Auto-share: likewise, write-back pages to non-data-provider nodes,
|
||||
which takes advantage of 1-sided communications provided by RDMA.
|
||||
}
|
||||
\item {
|
||||
Request aggregation: aggregate RDMA transfers for optimal transfer
|
||||
performance.
|
||||
\begin{itemize}
|
||||
\item Need to be coherent with program sequence!
|
||||
\item Enables write-request merging.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Stateful Nodes \& Transitions (Provisional)}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Nodes (e.g., within the cluster) become tightly bound with the
|
||||
properties of each shared page(s).
|
||||
}
|
||||
\end{itemize}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{
|
||||
w15_resources/截屏 2024-01-30 19.15.45 2024-01-30 19_16_19.png
|
||||
}
|
||||
\end{figure}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Stateful Nodes \& Transitions (Provisional) (Cont.)}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
MN (Manager Nodes): Provide access-control and (fallback)
|
||||
data-provision.
|
||||
}
|
||||
\item {
|
||||
HN (Home Nodes): Provide data-provision. Can be write-back or
|
||||
write-invalidate.
|
||||
}
|
||||
\item {
|
||||
SN (Sharer Nodes): Share data within a reader-only ``epoch''. Can be
|
||||
write-back or write-invalidate.
|
||||
}
|
||||
\item {
|
||||
NSN (Non-sharer Nodes): Nodes in network without sharing the
|
||||
particular page(s).
|
||||
}
|
||||
\item {
|
||||
CN (Commit Node): Node that acquired the single-writer access to the
|
||||
shared page.
|
||||
}
|
||||
\item {
|
||||
Problem: Message variants are not finalized:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Goal: Composable message chains that allow for
|
||||
``piggy-backing'' of multiple procedures.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Stateful Nodes: Transition Paths}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Filled line transitions indicate local requests remote to perform
|
||||
state transition.
|
||||
}
|
||||
\item {
|
||||
Dashed line transitions indicate local implicitly transitions prior
|
||||
to sending request to remote.
|
||||
}
|
||||
\item {
|
||||
\textit{Non-committal} path concerns about read-only and
|
||||
copy-on-write sharing. Sharers cannot make global modification to
|
||||
cached local data.
|
||||
}
|
||||
\item {
|
||||
\textit{Invalidation} path is duo with commit operations (due to
|
||||
write-invalidation).
|
||||
}
|
||||
\item {
|
||||
\textit{Committal} path concerns about global write sharing. Only
|
||||
one writer is allowed to write and commit at one time.
|
||||
}
|
||||
\item {
|
||||
Problem: How exactly to integrate RDMA remote read/write into this?
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Part 3: Progress
|
||||
% =============================================================================
|
||||
\section{3. Progress}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{3. Progress}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Goal: in-kernel implementation of software cache-coherency via
|
||||
non-coherent RDMA hardware.
|
||||
}
|
||||
\item {
|
||||
Optimistic Goal: in-kernel implementation of memory model in DSM.
|
||||
}
|
||||
\item {
|
||||
Progress: studied and isolated mechanism for data cache
|
||||
invalidation/flushing in ARM64, which allows the DSM to run in
|
||||
heterogeneous ISA clusters.
|
||||
}
|
||||
\item {
|
||||
Integration with kernel \& main DSM kernel module remains at hand:
|
||||
is it absolutely necessary to export new symbols for such an
|
||||
important operation?
|
||||
}
|
||||
\item {
|
||||
Repository: \url{https://github.com/rubberhead/unnamed_ba_thesis.git}.
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{On-demand Coherency in ARM64}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
ARMv8 defines two levels of cache coherence:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
\textit{Point-of-Unification}: Within a core, instruction
|
||||
cache, data cache, and TLB all agree in the copy seen for a
|
||||
particular address.
|
||||
\begin{itemize}
|
||||
\item Notably, changing PTE requires PoU.
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
\textit{Point-of-Coherence}: Between all DMA-capable
|
||||
peripherals (CPU or otherwise), they all agree in the copy
|
||||
seen for a particular address.
|
||||
}
|
||||
\end{itemize}
|
||||
For this thesis's purposes, strive for PoC.
|
||||
}
|
||||
\item {
|
||||
Operations to achieve the latter are encapsulated in the Linux
|
||||
kernel as \texttt{(d|i)cache\_(clean|inval)\_poc}.
|
||||
\begin{itemize}
|
||||
\item Declared under \texttt{arch/arm64/include/asm/cacheflush.h}.
|
||||
\item Defined in \texttt{arch/arm64/mm/cache.S}.
|
||||
\item {
|
||||
Takes virtual address wrt. \textit{current} address space to
|
||||
writeback/invalidate cache entries.
|
||||
}
|
||||
\item {
|
||||
Problem: Can only be called in process context (for user
|
||||
virtual addresses) or in all contexts (for kernel virtual
|
||||
addresses)?
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Kernel Patch for On-demand Coherency}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Problem: These symbols are not exported -- not intended for driver
|
||||
use.
|
||||
}
|
||||
\item {
|
||||
Temporary solution: re-export them via patching the kernel.
|
||||
\begin{itemize}
|
||||
\item Note: Kernel version v6.7.0
|
||||
\item {
|
||||
Longish-term solution: arrange kernel module code in a way
|
||||
that takes advantage of existing driver API
|
||||
(e.g., via DMA API, which for example \textit{smbdirect}
|
||||
uses).
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Implements wrapper function \texttt{\_\_dcache\_clean\_poc} to
|
||||
re-export \texttt{dcache\_clean\_poc} into driver namespace.
|
||||
}
|
||||
\item {
|
||||
Exports symbol into separate header file.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Declared in
|
||||
\texttt{arch/arm64/include/asm/cacheflush\_extra.h}.
|
||||
}
|
||||
\item Defined in \texttt{arch/arm64/mm/flush.c}.
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{Proof-of-Concept Kernel Module}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Dynamically allocates \texttt{GFP\_USER} pages and remaps to
|
||||
userspace on \texttt{mmap}.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
\texttt{GFP\_USER} so (for convenience) pages can be
|
||||
directly addressable in kernelspace (via kernel page table).
|
||||
}
|
||||
\item {
|
||||
Pages are lazily allocated and shared between multiple
|
||||
processes (i.e., user address spaces).
|
||||
}
|
||||
\item {
|
||||
Exposed as character device \texttt{/dev/my\_shmem}.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item Around 300+ LoC.
|
||||
\item {
|
||||
Problem: flawed premise for testing cache writeback!
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Summary: CPU datapath differs from DMA datapath, common cache
|
||||
coherency maintenance operations are already performed
|
||||
in common file/virtual memory area operation code.
|
||||
}
|
||||
\item {
|
||||
Idea: perform cache write-back on \texttt{vm\_ops->close}.
|
||||
}
|
||||
\item {
|
||||
Reality: virtual memory area already cleaned from cache and
|
||||
removed from address space prior to calling
|
||||
\texttt{vm\_ops->close}.
|
||||
}
|
||||
\item {
|
||||
Fix: Implement custom \texttt{ioctl}?
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Part 4: Future Work
|
||||
% =============================================================================
|
||||
\section{4. Future Work}
|
||||
|
||||
\begin{frame}
|
||||
\frametitle{4. Future Work}
|
||||
\begin{itemize}
|
||||
\item {
|
||||
TBD:
|
||||
\begin{enumerate}
|
||||
\item {
|
||||
Incorporate cache coherence mechanism into the larger
|
||||
project.
|
||||
}
|
||||
\item {
|
||||
Implement memory model within the larger project. This
|
||||
involves:
|
||||
\begin{itemize}
|
||||
\item {
|
||||
Making adjustment to message type and structure
|
||||
specifications for better inter-operation with RDMA.
|
||||
}
|
||||
\item {
|
||||
Implement memory model programmatically.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\end{enumerate}
|
||||
}
|
||||
\item {
|
||||
Further Studies:
|
||||
\begin{enumerate}
|
||||
\item {
|
||||
Swappable RDMA memory region.
|
||||
\begin{itemize}
|
||||
\item {
|
||||
As of now, all DMA pages are non-swappable -- they
|
||||
must be allocated using the SLAB/SLUB allocator for
|
||||
kernel memory, or via GFP page allocators.
|
||||
}
|
||||
\end{itemize}
|
||||
}
|
||||
\item {
|
||||
Automatic frequent sharer detection for MUX-ing between
|
||||
commit-invalidation and commit-writeback.
|
||||
}
|
||||
\end{enumerate}
|
||||
}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% References
|
||||
\begin{frame}
|
||||
\frametitle{References}
|
||||
\printbibliography
|
||||
\end{frame}
|
||||
|
||||
\end{document}
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 29 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 476 KiB |
Binary file not shown.
|
|
@ -1,159 +0,0 @@
|
|||
\documentclass{beamer}
|
||||
\usepackage[style=authortitle-comp]{biblatex}
|
||||
\usepackage[export]{adjustbox}
|
||||
|
||||
\title{Progress Report: Cache Replacement, Application Performance, and Relations to DSM}
|
||||
\author{Zhengyi Chen} % Amir?
|
||||
\date{\today}
|
||||
|
||||
\addbibresource{../main.bib}
|
||||
|
||||
\begin{document}
|
||||
% Title page
|
||||
\frame{\titlepage}
|
||||
|
||||
% Page 1
|
||||
\begin{frame}
|
||||
\frametitle{(Cache) Replacement Strategies}
|
||||
\begin{itemize}
|
||||
\item There have been significant development in (CPU) cache replacement strategies in the last decades.
|
||||
\item e.g., RRIP(++)\footcite{JTSE.2010.RRIP} and more recently (various) ML-\textit{derived} heuristics.
|
||||
\item Also popular is studying adequate cache replacement strategies for distributed systems (though more stagnant).
|
||||
% There is a lack of translation between advancements in hardware and their efficacy in software.
|
||||
% That said, this might be because they are (able to afford) machine learning techniques in dynamic replacement strategies at edge nodes...
|
||||
\item There are many variables within each cached system (whether CPU or distributed FS, etc.) that affect which strategy is more \textit{efficient} in operation.
|
||||
% A cached/distributed FS or CDN, for example, primarily captures frequency than recency.
|
||||
% An operating system might juggle between both depending on the type of access -- Linux's LRU_GEN attempts to capture this difference between file descriptor
|
||||
% accesses and just plain stack/heap/text section accesses.
|
||||
% The replacement problem for our kernel DSM is similar -- we want to capture the working set of all in-scope processes for each node in system. The existence of
|
||||
% swap space only complicates the matter:
|
||||
% - Swap locally to swap file?
|
||||
% - Swap remotely to other node's memory which our DSM might be able to do?
|
||||
% - Swap remotely to other node's swap file?
|
||||
% As Amir mentioned there is also the question of speed -- the replacement algorithm needs to be fast enough for the system to not stall, though the problem
|
||||
% of selecting a replacement algorithm may not (need to) be as time-sensitive?
|
||||
\item Moreover, different applications (e.g., threads) exhibit different access patterns which may be better served by one strategy than another.\footcite{SYS.2021.RLR}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 2
|
||||
\begin{frame}
|
||||
\frametitle{Notable (i.e., Encountered) Strategies}
|
||||
\begin{itemize}
|
||||
\item LRU family
|
||||
\item FIFO family
|
||||
\item Adaptive Replacement Cache
|
||||
\item CPU-LLC Intended: Dynamic Insertion Policy, Re-Reference Interval Prediction, Signature-based Hit Predictor, \dots
|
||||
% RRIP is basically an M-bit LFU.
|
||||
\item ML-derived: Reinforcement Learned Replacement, LeCaR, Cache Replacement Problem as Markov Decision Process\footcite{GWHSZ.2014.CacheReplAsMDP-QLearning},
|
||||
\dots
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 3
|
||||
\begin{frame}
|
||||
\frametitle{Notable (i.e., Encountered) Strategies}
|
||||
\begin{itemize}
|
||||
\item The performance of replacement strategies correlate strongly to the context of their operation.
|
||||
\item For example, LRU is theoretically better-performing than FIFO \textit{in their most textbook implementations} but recent studies
|
||||
\footcites{EHOFK.2020.IBM-LRUvsFIFO}{YQZYR.2023.FIFOwithTwist} have shown that FIFO can outperform LRU in practice (CDNs, for example, where even cache
|
||||
bookkeeping structures can be costly).
|
||||
% Now it's probable that these papers are unfairly competing a more state-of-the-art FIFO-esque algorithm with a less so LRU-esque algorithm...
|
||||
% In general:
|
||||
\item To summarize, \textbf{The (dynamic) choice of replacement algorithm in any system is of practical concern!}
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 4
|
||||
\begin{frame}
|
||||
\frametitle{LRU \& FIFO family -- Patches and Applications}
|
||||
\begin{itemize}
|
||||
\item The state-of-the-art implementations of LRU or FIFO is far-cry from their textbook implementations.
|
||||
% Also there are a LOT of them -- I can't find enough time to gather all of them RN.
|
||||
\item This is so that they can capture both \emph{recency} and \emph{frequency}:
|
||||
we desire to use both to predict/assume the \emph{re-reference interval} of a given entry.
|
||||
\item e.g., Linux uses \texttt{LRU\_GEN} which is a multi-queue LRU strategy wherein each queue(generation) represents a "similar" level of access recency and is evicted in batch.
|
||||
\item The kernel developers wanted a \emph{fast and reasonably good} replacer as opposed to an optimal one.
|
||||
% optimality and performance should both be considered when selecting replacement strategies.
|
||||
\item Likewise, Yang, et.al.\footcite{YQZYR.2023.FIFOwithTwist} shows that FIFO with \textit{lazy promotion} and \textit{quick demotion} outperforms textbook LRU.
|
||||
% recall that FIFO can exploit spatial locality better than LRU particularly in systems with slow data access!
|
||||
% i.e., algorithm performance can be constrained by system topology.
|
||||
\end{itemize}
|
||||
% The documentation of LRU_GEN really shows that the developers wanted the strategy itself to decide fast (as opposed to merely deciding well): the strategy itself "[tries] to profit from spatial
|
||||
% locality"
|
||||
\end{frame}
|
||||
|
||||
% Page 5
|
||||
\begin{frame}
|
||||
\frametitle{\texttt{LRU\_GEN} and Access Patterns}
|
||||
The \texttt{LRU\_GEN} algorithm specifically makes stronger protection of pages for memory accesses through PT than through FD: \
|
||||
\begin{itemize}
|
||||
\item Heap/Stack/Text access misses have higher cost -- executables perform blocking I/O at memory access, less likely for file access.
|
||||
\item They are also more likely to miss, as their in-kernel dirty bits are approximated.
|
||||
\item Finally, they can be reasonably assumed to more likely exhibit temporal locality.
|
||||
\end{itemize}
|
||||
Nevertheless, the algorithm is capable to dynamic adjustment on re-faults -- \textbf{the data model of programs can be file-based or object-based}.
|
||||
% Though as we know files (i.e., blocks in which file data reside) are loaded into (kernel) memory and heap allocation can always be swapped out,
|
||||
% so I guess object-based storage wins with less intermediate steps (e.g., filesystem calls), sans data protection.
|
||||
The same algorithm can deviate in fault rates on different programs on the same node.
|
||||
% i.e., any good algorithm must be able to dynamically adapt to fault rate feedbacks.
|
||||
% However, we don't want to run them through any complex learner...
|
||||
\end{frame}
|
||||
|
||||
% Page 6
|
||||
\begin{frame}
|
||||
\frametitle{Machine Learning as Analytic Tool: RLR, etc.}
|
||||
\begin{itemize}
|
||||
\item Large distributed systems (e.g., CDNs) can afford to perform machine learning for cache replacement tasks
|
||||
\footcite{GWHSZ.2014.CacheReplAsMDP-QLearning}: run-time is much faster than I/O so some cycles could be afforded.
|
||||
\item For page replacement in kernel, we can't really afford to run anything costly (Amir).
|
||||
\item ML analysis\footcite{SYS.2021.RLR} shows how different (computation-intensive) programs exhibit distinct
|
||||
access patterns.
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 7
|
||||
\begin{frame}
|
||||
\frametitle{Machine Learning as Analytic Tool: RLR, etc.}
|
||||
\includegraphics[height=0.6\textheight, center]{w4_slices_resources/RLR.Fig3.png}
|
||||
\footcite{SYS.2021.RLR}
|
||||
P.S. \textit{preuse}: set access since last access to address/line.
|
||||
\end{frame}
|
||||
|
||||
% Page 8
|
||||
\begin{frame}
|
||||
\frametitle{DSM, Applications, and Memory (Contention)}
|
||||
The speedup of applications on DSM systems is negatively correlated to shared memory contention.
|
||||
|
||||
Take TreadMarks\footcite{CDKP.1994.TreadMarks} for example:
|
||||
\begin{itemize}
|
||||
\item \textit{Jacobi} is a solver for linear system of equations via the \textit{successive over-relaxation} method.
|
||||
The memory access pattern should be map-reduce-like: the problem is parallelized w/ partial matrices for each node with immutable storage of the relevant matrices?
|
||||
TreadMarks achieves $\sim7x$-speedup on a 8-node system over one single-core node.
|
||||
\item \textit{Water} is a parallel $N$-body molecular dynamics simulator that requires at least $O(\frac{N}{2})$ communications per processor.
|
||||
TreadMarks only achieves $\sim4x$-speedup with around $47\%$ time used for blocking communications.
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 9
|
||||
\begin{frame}
|
||||
\frametitle{DSM, Applications, and Memory (Contention)}
|
||||
\begin{itemize}
|
||||
\item It's kinda difficult to compare statistics from different DSM systems.
|
||||
\item Even with the same programs being run, different parameters makes for different program behaviors wrt. contention, etc.
|
||||
\item Logically speaking, the more contention on the same address, the less speedup is possible for the system\footcite{de2000effect}.
|
||||
\item Should cache replacement strategies be aware of how contended a page may be to prevent it from e.g., being swapped out?
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
% Page 10
|
||||
\begin{frame}
|
||||
\frametitle{Hardware-based Dynamic Strategy Selection: DIP}
|
||||
Hardware-based replacement strategies can provide low-cost inspirations for software replacement strategies.
|
||||
|
||||
\includegraphics[height=0.6\textheight, center]{w4_slices_resources/DIP.Fig10.png}\footcite{QJPSE.2007.DIP}
|
||||
|
||||
Problem: How can this be scaled for multiple strategies?
|
||||
\end{frame}
|
||||
|
||||
\end{document}
|
||||
|
|
@ -1,17 +0,0 @@
|
|||
@Book{P2,
|
||||
author = "Chen-Chung Chang and H. Jerome Keisler",
|
||||
title = "Model Theory",
|
||||
publisher = "North-Holland",
|
||||
edition = {Third},
|
||||
year = 1990,
|
||||
}
|
||||
|
||||
@inproceedings{P1,
|
||||
author = {Hiroki Arimura},
|
||||
title = {Learning Acyclic First-Order Horn Sentences from Entailment},
|
||||
booktitle = {Proc. of the 8th Intl. Conf. on Algorithmic Learning Theory, ALT '97},
|
||||
year = {1997},
|
||||
pages = {432-445},
|
||||
ee = {http://dx.doi.org/10.1007/3-540-63577-7_59},
|
||||
bibsource = {DBLP, http://dblp.uni-trier.de}
|
||||
}
|
||||
BIN
tex/skeleton.pdf
BIN
tex/skeleton.pdf
Binary file not shown.
206
tex/skeleton.tex
206
tex/skeleton.tex
|
|
@ -1,206 +0,0 @@
|
|||
% UG project example file, February 2022
|
||||
% Do not change the first two lines of code, except you may delete "logo," if causing problems.
|
||||
% Understand any problems and seek approval before assuming it's ok to remove ugcheck.
|
||||
\documentclass[logo,bsc,singlespacing,parskip]{infthesis}
|
||||
\usepackage{ugcheck}
|
||||
|
||||
% Include any packages you need below, but don't include any that change the page
|
||||
% layout or style of the dissertation. By including the ugcheck package above,
|
||||
% you should catch most accidental changes of page layout though.
|
||||
|
||||
\usepackage{microtype} % recommended, but you can remove if it causes problems
|
||||
|
||||
\begin{document}
|
||||
\begin{preliminary}
|
||||
|
||||
\title{Unnamed Honours Thesis}
|
||||
|
||||
\author{Zhengyi Chen}
|
||||
|
||||
% CHOOSE YOUR DEGREE a):
|
||||
% please leave just one of the following un-commented
|
||||
%\course{Artificial Intelligence}
|
||||
%\course{Artificial Intelligence and Computer Science}
|
||||
%\course{Artificial Intelligence and Mathematics}
|
||||
%\course{Artificial Intelligence and Software Engineering}
|
||||
%\course{Cognitive Science}
|
||||
\course{Computer Science}
|
||||
%\course{Computer Science and Management Science}
|
||||
%\course{Computer Science and Mathematics}
|
||||
%\course{Computer Science and Physics}
|
||||
%\course{Software Engineering}
|
||||
%\course{Master of Informatics} % MInf students
|
||||
|
||||
% CHOOSE YOUR DEGREE b):
|
||||
% please leave just one of the following un-commented
|
||||
%\project{MInf Project (Part 1) Report} % 4th year MInf students
|
||||
%\project{MInf Project (Part 2) Report} % 5th year MInf students
|
||||
\project{4th Year Project Report} % all other UG4 students
|
||||
|
||||
|
||||
\date{\today}
|
||||
|
||||
\abstract{
|
||||
This skeleton demonstrates how to use the \texttt{infthesis} style for
|
||||
undergraduate dissertations in the School of Informatics. It also emphasises the
|
||||
page limit, and that you must not deviate from the required style.
|
||||
The file \texttt{skeleton.tex} generates this document and should be used as a
|
||||
starting point for your thesis. Replace this abstract text with a concise
|
||||
summary of your report.
|
||||
}
|
||||
|
||||
\maketitle
|
||||
|
||||
\newenvironment{ethics}
|
||||
{\begin{frontenv}{Research Ethics Approval}{\LARGE}}
|
||||
{\end{frontenv}\newpage}
|
||||
|
||||
\begin{ethics}
|
||||
% \textbf{Instructions:} \emph{Agree with your supervisor which
|
||||
% statement you need to include. Then delete the statement that you are not using,
|
||||
% and the instructions in italics.\\
|
||||
% \textbf{Either complete and include this statement:}}\\ % DELETE THESE INSTRUCTIONS
|
||||
% %
|
||||
% % IF ETHICS APPROVAL WAS REQUIRED:
|
||||
% This project obtained approval from the Informatics Research Ethics committee.\\
|
||||
% Ethics application number: ???\\
|
||||
% Date when approval was obtained: YYYY-MM-DD\\
|
||||
% %
|
||||
% \emph{[If the project required human participants, edit as appropriate, otherwise delete:]}\\ % DELETE THIS LINE
|
||||
% The participants' information sheet and a consent form are included in the appendix.\\
|
||||
%
|
||||
% IF ETHICS APPROVAL WAS NOT REQUIRED:
|
||||
% \textbf{\emph{Or include this statement:}}\\ % DELETE THIS LINE
|
||||
This project was planned in accordance with the Informatics Research
|
||||
Ethics policy. It did not involve any aspects that required approval
|
||||
from the Informatics Research Ethics committee.
|
||||
|
||||
\standarddeclaration
|
||||
\end{ethics}
|
||||
|
||||
|
||||
\begin{acknowledgements}
|
||||
Any acknowledgements go here.
|
||||
\end{acknowledgements}
|
||||
|
||||
|
||||
\tableofcontents
|
||||
\end{preliminary}
|
||||
|
||||
|
||||
\chapter{Introduction}
|
||||
|
||||
The preliminary material of your report should contain:
|
||||
\begin{itemize}
|
||||
\item
|
||||
The title page.
|
||||
\item
|
||||
An abstract page.
|
||||
\item
|
||||
Declaration of ethics and own work.
|
||||
\item
|
||||
Optionally an acknowledgements page.
|
||||
\item
|
||||
The table of contents.
|
||||
\end{itemize}
|
||||
|
||||
As in this example \texttt{skeleton.tex}, the above material should be
|
||||
included between:
|
||||
\begin{verbatim}
|
||||
\begin{preliminary}
|
||||
...
|
||||
\end{preliminary}
|
||||
\end{verbatim}
|
||||
This style file uses roman numeral page numbers for the preliminary material.
|
||||
|
||||
The main content of the dissertation, starting with the first chapter,
|
||||
starts with page~1. \emph{\textbf{The main content must not go beyond page~40.}}
|
||||
|
||||
The report then contains a bibliography and any appendices, which may go beyond
|
||||
page~40. The appendices are only for any supporting material that's important to
|
||||
go on record. However, you cannot assume markers of dissertations will read them.
|
||||
|
||||
You may not change the dissertation format (e.g., reduce the font size, change
|
||||
the margins, or reduce the line spacing from the default single spacing). Be
|
||||
careful if you copy-paste packages into your document preamble from elsewhere.
|
||||
Some \LaTeX{} packages, such as \texttt{fullpage} or \texttt{savetrees}, change
|
||||
the margins of your document. Do not include them!
|
||||
|
||||
Over-length or incorrectly-formatted dissertations will not be accepted and you
|
||||
would have to modify your dissertation and resubmit. You cannot assume we will
|
||||
check your submission before the final deadline and if it requires resubmission
|
||||
after the deadline to conform to the page and style requirements you will be
|
||||
subject to the usual late penalties based on your final submission time.
|
||||
|
||||
\section{Using Sections}
|
||||
|
||||
Divide your chapters into sub-parts as appropriate.
|
||||
|
||||
\section{Citations}
|
||||
|
||||
Citations (such as \cite{P1} or \cite{P2}) can be generated using
|
||||
\texttt{BibTeX}. For more advanced usage, we recommend using the \texttt{natbib}
|
||||
package or the newer \texttt{biblatex} system.
|
||||
|
||||
These examples use a numerical citation style. You may use any consistent
|
||||
reference style that you prefer, including ``(Author, Year)'' citations.
|
||||
|
||||
\chapter{Backgrounds}
|
||||
|
||||
This section provides a overview into
|
||||
|
||||
|
||||
\chapter{Your next chapter}
|
||||
|
||||
A dissertation usually contains several chapters.
|
||||
|
||||
\chapter{Conclusions}
|
||||
|
||||
\section{Final Reminder}
|
||||
|
||||
The body of your dissertation, before the references and any appendices,
|
||||
\emph{must} finish by page~40. The introduction, after preliminary material,
|
||||
should have started on page~1.
|
||||
|
||||
You may not change the dissertation format (e.g., reduce the font size, change
|
||||
the margins, or reduce the line spacing from the default single spacing). Be
|
||||
careful if you copy-paste packages into your document preamble from elsewhere.
|
||||
Some \LaTeX{} packages, such as \texttt{fullpage} or \texttt{savetrees}, change
|
||||
the margins of your document. Do not include them!
|
||||
|
||||
Over-length or incorrectly-formatted dissertations will not be accepted and you
|
||||
would have to modify your dissertation and resubmit. You cannot assume we will
|
||||
check your submission before the final deadline and if it requires resubmission
|
||||
after the deadline to conform to the page and style requirements you will be
|
||||
subject to the usual late penalties based on your final submission time.
|
||||
|
||||
\bibliographystyle{plain}
|
||||
\bibliography{mybibfile}
|
||||
|
||||
|
||||
% You may delete everything from \appendix up to \end{document} if you don't need it.
|
||||
\appendix
|
||||
|
||||
\chapter{First appendix}
|
||||
|
||||
\section{First section}
|
||||
|
||||
Any appendices, including any required ethics information, should be included
|
||||
after the references.
|
||||
|
||||
Markers do not have to consider appendices. Make sure that your contributions
|
||||
are made clear in the main body of the dissertation (within the page limit).
|
||||
|
||||
\chapter{Participants' information sheet}
|
||||
|
||||
If you had human participants, include key information that they were given in
|
||||
an appendix, and point to it from the ethics declaration.
|
||||
|
||||
\chapter{Participants' consent form}
|
||||
|
||||
If you had human participants, include information about how consent was
|
||||
gathered in an appendix, and point to it from the ethics declaration.
|
||||
This information is often a copy of a consent form.
|
||||
|
||||
|
||||
\end{document}
|
||||
Loading…
Add table
Add a link
Reference in a new issue