Added stuff, somehow biblatex fails to compile?

2024-02-27 17:06:31 +00:00 · 2024-02-27 17:06:31 +00:00 · a6d78ffc04
commit a6d78ffc04
parent 8e430d13f2
5 changed files with 93 additions and 24 deletions
--- a/src/aarch64-linux-flush-dcache/Makefile
+++ b/src/aarch64-linux-flush-dcache/Makefile
@ -5,7 +5,7 @@ CC += ${MY_CFLAGS}
 KDIR := /lib/modules/$(shell uname -r)/build
 KDIR_CROSS := ${HOME}/Git/linux
-KDIR_UOE := /disk/scratch/s2018374/linux
+KDIR_UOE := /tmp/s2018374/linux
 KDIR_SSHFS := /tmp/inf-sshfs/linux
 PWD := $(shell pwd)
--- a/src/aarch64-linux-flush-dcache/my_shmem.c
+++ b/src/aarch64-linux-flush-dcache/my_shmem.c
@ -54,9 +54,13 @@ const char* DEV_NAME = "my_shmem";
 */
 static void my_shmem_vmops_close(struct vm_area_struct *vma)
 {
-	pr_info("[%s] Entered.\n", __func__);
+	size_t nr_pages_in_cache = list_count_nodes(&my_shmem_pages); 
 	size_t nr_pages_of_vma = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 	pr_info(
 		"[%s] Entered. vma size: %ld; cached pages: %ld.\n", 
 		__func__, nr_pages_of_vma, nr_pages_in_cache
 	);
 	size_t nr_pages_offset = vma->vm_pgoff;
 	struct my_shmem_page *entry;
 	// u64 clean_time_bgn, clean_time_end;
--- a/tex/misc/background_draft.bib
+++ b/tex/misc/background_draft.bib
@ -344,8 +344,7 @@
  url={https://lkml.org/lkml/2008/4/29/480},
  journal={lkml.org},
  author={Ven, Arjan van de},
-  year={2008},
+  year={2008}
  month={Apr}
 }
@inproceedings{Li_etal.RelDB_RDMA.2016,
@ -356,3 +355,38 @@
  year={2016}
 }
@article{Hong_etal.NUMA-to-RDMA-DSM.2019,
  title={Scaling out NUMA-aware applications with RDMA-based distributed shared memory},
  author={Hong, Yang and Zheng, Yang and Yang, Fan and Zang, Bin-Yu and Guan, Hai-Bing and Chen, Hai-Bo},
  journal={Journal of Computer Science and Technology},
  volume={34},
  pages={94--112},
  year={2019},
  publisher={Springer}
 }
@inproceedings{Kaxiras_etal.DSM-Argos.2015,
  author = {Kaxiras, Stefanos and Klaftenegger, David and Norgren, Magnus and Ros, Alberto and Sagonas, Konstantinos},
  title = {Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory},
  year = {2015},
  isbn = {9781450335508},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/2749246.2749250},
  doi = {10.1145/2749246.2749250},
  abstract = {A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.},
  booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing},
  pages = {3-14},
  numpages = {12},
  location = {Portland, Oregon, USA},
  series = {HPDC '15}
 }
@misc{FreeBSD.man-BPF-4.2021, 
  title={FreeBSD manual pages}, 
  url={https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports}, 
  journal={BPF(4) Kernel Interfaces Manual}, 
  publisher={The FreeBSD Project}, 
  author={The FreeBSD Project}, 
  year={2021}
 } 
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -300,11 +300,12 @@ context of some user-defined group of associated nodes. Comparatively, a
 \textit{collective} PGAS object is allocated such that a partition of the object
 (i.e., a sub-array of the repr) is stored in each of the associated node -- for
 a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
-pointing to the same object, with different offsets and (naturally)
+pointing to the same object, with different offsets and (intuitively)
 independently-chosen virtual addresses. Note that this design naturally requires
 virtual addresses within each node to be \emph{pinned} -- the allocated object
-cannot be re-addressed to a different virtual address i.e., the global pointer
+cannot be re-addressed to a different virtual address, thus preventing the 
-that records the local virtual address cannot be auto-invalidated.
+global pointer that records the local virtual address from becoming 
 spontaneously invalidated.
 Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
 opt to use a map-like data structure for addressing instead. In general, despite
@ -315,27 +316,57 @@ movement manually when working with shared memory over network to maximize
 their performance metrics of interest.
 \subsection{Message Passing}
 \textit{Message Passing} remains the predominant programming model for 
 parallelism between loosely-coupled nodes within a computer system, much as it 
 is ubiquitous in supporting all levels of abstraction within any concurrent 
 components of a computer system. Specific to cluster computing systems is the 
 message-passing programming model, where parallel programs (or instances of 
 the same parallel program) on different nodes within the system communicate 
 via exchanging messages over network between these nodes. Such models exchange 
 programming model productivity for more fine-grained control over the messages 
 passed, as well as more explicit separation between communication and computation 
 stages within a programming subproblem. 
 Commonly, message-passing backends function as \textit{middlewares} -- 
 communication runtimes --  to aid distributed software development
 \cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend 
 expose facilities for inter-application communication to frontend developers 
 while transparently providing security, accounting, and fault-tolerance, much 
 like how an operating system may provide resource management, scheduling, and 
 security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. 
 This is the case for implementing the PGAS programming model, which mostly rely 
 on common message-passing backends to facilitate orchestrated data manipulation 
 across distributed nodes. Likewise, message-passing backends, including RDMA API, 
 form the backbone of many research-oriented DSM systems 
 \cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.
-% \dots
+Message-passing between network-connected nodes may be \textit{two-sided} or 
 \textit{one-sided}. The former models an intuitive workflow to sending and receiving 
 datagrams over the network -- the sender initiates a transfer; the receiver 
 copies a received packet from the network card into a kernel buffer; the 
 receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021}
 copies the internal message 
 into the message-passing runtime/middleware's address space; the receiver's 
 middleware inspects the copied message and performs some procedures accordingly, 
 likely also involving copying slices of message data to some registered distributed 
 shared memory buffer for the distributed application to access. Despite it 
 being a highly intuitive model of data manipulation over the network, this 
 poses a fundamental performance issue: because the process requires the receiver's 
 kernel AND userspace to exert CPU-time, upon reception of each message, the 
 receiver node needs to proactively exert CPU-time to move the received data 
 from bytes read from NIC devices to userspace. Because this happens concurrently 
 with other kernel and userspace routines in a multi-processing system, a 
 preemptable kernel may incur significant latency if the kernel routine for 
 packet filtering is pre-empted by another kernel routine, userspace, or IRQs.
-% Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
+Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, 
-% global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
+allows the network interface card to bypass in-kernel packet filters and 
-% technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
+perform DMA on registered memory regions. The NIC can hence notify the CPU via 
-% specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
+interrupts, thus allowing the kernel and the userspace programs to perform 
 callbacks at reception time with reduced latency. Because of this advantage, 
 many recent studies attempt to leverage RDMA APIs \dots
 % Contemporary works on DSM systems focus more on leveraging hardware advancements
 % to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
 % for example, implements a complex system for memory disaggregation over multiple
 % compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
 % they observed significant performance improvements over existing data-intensive
 % processing frameworks, for example APACHE Spark, Memcached, and Redis, over
 % no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
 % systems.
 % \subsection{Programming Model}
 \subsection{Data to Process, or Process to Data?}
 (TBD -- The former is costly for data-intensive computation, but the latter may
 be impossible for certain tasks, and greatly hardens the replacement problem.)