Added stuff, somehow biblatex fails to compile?

2024-02-27 17:06:31 +00:00 · 2024-02-27 17:06:31 +00:00 · a6d78ffc04
commit a6d78ffc04
parent 8e430d13f2
5 changed files with 93 additions and 24 deletions
--- a/src/aarch64-linux-flush-dcache/Makefile
+++ b/src/aarch64-linux-flush-dcache/Makefile
@ -5,7 +5,7 @@ CC += ${MY_CFLAGS}

 KDIR := /lib/modules/$(shell uname -r)/build
 KDIR_CROSS := ${HOME}/Git/linux
-KDIR_UOE := /disk/scratch/s2018374/linux
+KDIR_UOE := /tmp/s2018374/linux
 KDIR_SSHFS := /tmp/inf-sshfs/linux

 PWD := $(shell pwd)
--- a/src/aarch64-linux-flush-dcache/my_shmem.c
+++ b/src/aarch64-linux-flush-dcache/my_shmem.c
@ -54,9 +54,13 @@ const char* DEV_NAME = "my_shmem";
 */
 static void my_shmem_vmops_close(struct vm_area_struct *vma)
 {
-	pr_info("[%s] Entered.\n", __func__);
-
+	size_t nr_pages_in_cache = list_count_nodes(&my_shmem_pages); 
 	size_t nr_pages_of_vma = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	pr_info(
+		"[%s] Entered. vma size: %ld; cached pages: %ld.\n", 
+		__func__, nr_pages_of_vma, nr_pages_in_cache
+	);
+
 	size_t nr_pages_offset = vma->vm_pgoff;
 	struct my_shmem_page *entry;
 	// u64 clean_time_bgn, clean_time_end;
--- a/tex/misc/background_draft.bib
+++ b/tex/misc/background_draft.bib
@ -344,8 +344,7 @@
  url={https://lkml.org/lkml/2008/4/29/480},
  journal={lkml.org},
  author={Ven, Arjan van de},
-  year={2008},
-  month={Apr}
+  year={2008}
 }

@inproceedings{Li_etal.RelDB_RDMA.2016,
@ -356,3 +355,38 @@
  year={2016}
 }

+@article{Hong_etal.NUMA-to-RDMA-DSM.2019,
+  title={Scaling out NUMA-aware applications with RDMA-based distributed shared memory},
+  author={Hong, Yang and Zheng, Yang and Yang, Fan and Zang, Bin-Yu and Guan, Hai-Bing and Chen, Hai-Bo},
+  journal={Journal of Computer Science and Technology},
+  volume={34},
+  pages={94--112},
+  year={2019},
+  publisher={Springer}
+}
+
+@inproceedings{Kaxiras_etal.DSM-Argos.2015,
+  author = {Kaxiras, Stefanos and Klaftenegger, David and Norgren, Magnus and Ros, Alberto and Sagonas, Konstantinos},
+  title = {Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory},
+  year = {2015},
+  isbn = {9781450335508},
+  publisher = {Association for Computing Machinery},
+  address = {New York, NY, USA},
+  url = {https://doi.org/10.1145/2749246.2749250},
+  doi = {10.1145/2749246.2749250},
+  abstract = {A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.},
+  booktitle = {Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing},
+  pages = {3-14},
+  numpages = {12},
+  location = {Portland, Oregon, USA},
+  series = {HPDC '15}
+}
+
+@misc{FreeBSD.man-BPF-4.2021, 
+  title={FreeBSD manual pages}, 
+  url={https://man.freebsd.org/cgi/man.cgi?query=bpf&manpath=FreeBSD+14.0-RELEASE+and+Ports}, 
+  journal={BPF(4) Kernel Interfaces Manual}, 
+  publisher={The FreeBSD Project}, 
+  author={The FreeBSD Project}, 
+  year={2021}
+} 
--- a/tex/misc/background_draft.pdf
+++ b/tex/misc/background_draft.pdf
--- a/tex/misc/background_draft.tex
+++ b/tex/misc/background_draft.tex
@ -300,11 +300,12 @@ context of some user-defined group of associated nodes. Comparatively, a
 \textit{collective} PGAS object is allocated such that a partition of the object
 (i.e., a sub-array of the repr) is stored in each of the associated node -- for
 a $k$-partitioned object, $k$ global pointers are recorded in the runtime each
-pointing to the same object, with different offsets and (naturally)
+pointing to the same object, with different offsets and (intuitively)
 independently-chosen virtual addresses. Note that this design naturally requires
 virtual addresses within each node to be \emph{pinned} -- the allocated object
-cannot be re-addressed to a different virtual address i.e., the global pointer
-that records the local virtual address cannot be auto-invalidated.
+cannot be re-addressed to a different virtual address, thus preventing the 
+global pointer that records the local virtual address from becoming 
+spontaneously invalidated.

 Similar schemes can be observed in other PGAS backends/runtimes, albeit they may
 opt to use a map-like data structure for addressing instead. In general, despite
@ -315,27 +316,57 @@ movement manually when working with shared memory over network to maximize
 their performance metrics of interest.

 \subsection{Message Passing}
+\textit{Message Passing} remains the predominant programming model for 
+parallelism between loosely-coupled nodes within a computer system, much as it 
+is ubiquitous in supporting all levels of abstraction within any concurrent 
+components of a computer system. Specific to cluster computing systems is the 
+message-passing programming model, where parallel programs (or instances of 
+the same parallel program) on different nodes within the system communicate 
+via exchanging messages over network between these nodes. Such models exchange 
+programming model productivity for more fine-grained control over the messages 
+passed, as well as more explicit separation between communication and computation 
+stages within a programming subproblem. 

+Commonly, message-passing backends function as \textit{middlewares} -- 
+communication runtimes --  to aid distributed software development
+\cite{AST_Steen.Distributed_Systems-3ed.2017}. Such a message-passing backend 
+expose facilities for inter-application communication to frontend developers 
+while transparently providing security, accounting, and fault-tolerance, much 
+like how an operating system may provide resource management, scheduling, and 
+security to traditional applications \cite{AST_Steen.Distributed_Systems-3ed.2017}. 
+This is the case for implementing the PGAS programming model, which mostly rely 
+on common message-passing backends to facilitate orchestrated data manipulation 
+across distributed nodes. Likewise, message-passing backends, including RDMA API, 
+form the backbone of many research-oriented DSM systems 
+\cites{Endo_Sato_Taura.MENPS_DSM.2020}{Hong_etal.NUMA-to-RDMA-DSM.2019}{Cai_etal.Distributed_Memory_RDMA_Cached.2018}{Kaxiras_etal.DSM-Argos.2015}.

-% \dots
+Message-passing between network-connected nodes may be \textit{two-sided} or 
+\textit{one-sided}. The former models an intuitive workflow to sending and receiving 
+datagrams over the network -- the sender initiates a transfer; the receiver 
+copies a received packet from the network card into a kernel buffer; the 
+receiver's kernel filters the packet and (optionally)\cite{FreeBSD.man-BPF-4.2021}
+copies the internal message 
+into the message-passing runtime/middleware's address space; the receiver's 
+middleware inspects the copied message and performs some procedures accordingly, 
+likely also involving copying slices of message data to some registered distributed 
+shared memory buffer for the distributed application to access. Despite it 
+being a highly intuitive model of data manipulation over the network, this 
+poses a fundamental performance issue: because the process requires the receiver's 
+kernel AND userspace to exert CPU-time, upon reception of each message, the 
+receiver node needs to proactively exert CPU-time to move the received data 
+from bytes read from NIC devices to userspace. Because this happens concurrently 
+with other kernel and userspace routines in a multi-processing system, a 
+preemptable kernel may incur significant latency if the kernel routine for 
+packet filtering is pre-empted by another kernel routine, userspace, or IRQs.

-% Improvement in NIC bandwidth and transfer rate benefits DSM applications that expose
-% global address space, and those that leverage single-writer capabilities over hierarchical memory nodes. \textbf{[GAS and PGAS (Partitioned GAS)
-% technologies for example Openshmem, OpenMPI, Cray Chapel, etc. that leverage
-% specially-linked memory sections and \texttt{/dev/shm} to abstract away RDMA access]}.
+Comparatively, a ``one-sided'' message-passing scheme, notably \textit{RDMA}, 
+allows the network interface card to bypass in-kernel packet filters and 
+perform DMA on registered memory regions. The NIC can hence notify the CPU via 
+interrupts, thus allowing the kernel and the userspace programs to perform 
+callbacks at reception time with reduced latency. Because of this advantage, 
+many recent studies attempt to leverage RDMA APIs \dots


-% Contemporary works on DSM systems focus more on leveraging hardware advancements
-% to provide fast and/or seamless software support. Adrias \cite{Masouros_etal.Adrias.2023},
-% for example, implements a complex system for memory disaggregation over multiple
-% compute nodes connected via the \textit{ThymesisFlow}-based RDMA fabric, where
-% they observed significant performance improvements over existing data-intensive
-% processing frameworks, for example APACHE Spark, Memcached, and Redis, over
-% no-disaggregation (i.e., using node-local memory only, similar to cluster computing)
-% systems.
-
-% \subsection{Programming Model}
-
 \subsection{Data to Process, or Process to Data?}
 (TBD -- The former is costly for data-intensive computation, but the latter may
 be impossible for certain tasks, and greatly hardens the replacement problem.)