Designing Databases for Future High-Performance Networks م.م بان كاظم العماري

05/10/2022 Share :

1426

Distributed query and transaction processing has been an active field of research ever since the volume of the data to be processed outgrew the storage and processing capacity of a single machine. Two platforms of choice for data processing are database appliances, i.e., rack-scale clusters composed of several machines connected through a low-latency network, and scale-out infrastructure platforms for batch processing, e.g., data analysis applications such as Map-Reduce. A fundamental design rule on how software for these systems should be implemented is the assumption that the network is relatively slow compared to local in-memory processing. Therefore, the execution time of a query or a transaction is assumed to be dominated by network transfer times and the costs of synchronization. However, data processing systems are starting to be equipped with high-bandwidth, low-latency interconnects that can transmit vast amounts of data between the compute nodes and provide single-digit microsecond latencies. In light of this new generation of network technologies, such a rule is being re-evaluated, leading to new types of database algorithms [6, 7, 29] and fundamental system design changes [8, 23, 30]. Modern high-throughput, low-latency networks originate from high-performance computing (HPC) systems. Similar to database systems, the performance of scientific applications depends on the ability of the system to move large amounts of data between compute nodes. Several key features offered by these networks are (i) userlevel networking, (ii) an asynchronous network interface that allows the algorithm to interleave computation and communication, and (iii) the ability of the network card to directly access regions of main memory without going through the processor, i.e., remote direct memory access (RDMA). To leverage the advantages of these networks,2 Background and Definitions In this section, we explain how the concepts of Remote Direct Memory Access (RDMA), Remote Memory Access (RMA), and Partitioned Global Address Space (PGAS) relate to each other. Furthermore, we include an overview of several low-latency, high-bandwidth network technologies implementing these mechanisms. 2.1 Remote Direct Memory Access Remote Direct Memory Access (RDMA) is a hardware mechanism through which the network card can directly access all or parts of the main memory of a remote node without involving the processor. Bypassing the CPU and the operating system makes it possible to interleave computation and communication, thereby avoiding copying data across different buffers within the network stack and user space, which significantly lowers the costs of large data transfers and reduces the end-to-end communication latency. In many implementations, buffers need to be registered with the network card before they are accessible over the interconnect. During the registration process, the memory is pinned such that it cannot be swapped out, and the necessary address translation information is installed on the card, operations that can have a significant overhead [14]. Although this registration process is needed for many high-speed networks, it is worth noting that some network implementations also support registration-free memory access [10, 27]. RDMA as a hardware mechanism does not specify the semantics of a data transfer. Most modern networks provide support for one-sided and two-sided memory accesses. Two-sided operations represent traditional message-passing semantics in which the source process (i.e., the sender of a message) and the destination process (i.e., the receiver of a message) are actively involved in the communication and need to be synchronized; i.e., for every send operation there must exist exactly one corresponding receive operation. One-sided operations on the other hand, represent memory access semantics in which only the source process (i.e., the initiator of a request) is involved in the remote memory access. In order to efficiently use remote one-sided memory operations, multiple programming models have been developed, the most popular of which are the Remote Memory2.2 Remote Memory Access Remote Memory Access (RMA) is a shared memory programming abstraction. RMA provides access to remote memory regions through explicit one-sided read and write operations. These operations move data from one buffer to another, i.e., a read operation fetches data from a remote machine and transfers it to a local buffer, while the write operation transmits the data in the opposite direction. Data located on a remote machine can therefore not be loaded immediately into a register, but needs to be first read into a local main memory buffer. Using the RMA memory abstractions is similar to programming non-cache-coherent machines in which data has to be explicitly loaded into the cache-coherency domain before it can be used and changes to the data have to be explicitly flushed back to the source in order for the modifications to be visible on the remote machine. The processes on the target machine are generally not notified about an RMA access, although many interfaces offer read and write calls with remote process notifications. Apart form read and write operations, some RMA implementations provide support for additional functionality, most notably remote atomic operations. Examples of such atomic operations are remote fetch-and-add and compare-and-swap instructions. RMA has been designed to be a thin and portable layer compatible with many lower-level data movement interfaces. RMA has been adopted by many libraries such as ibVerbs [17] and MPI-3 [25] as their one-sided communication and remote memory access abstraction. RDMA-capable networks implement the functionality necessary for efficient low-latency, high-bandwidth one-sided memory accesses. It is worth pointing out that RMA programming abstractions can also be used over networks which do not support RDMA, for example by implementing the required operations in software [26] Partitioned Global Address Space Partitioned Global Address Space (PGAS) is a programming language concept for writing parallel applications for large distributed memory machines. PGAS assumes a single global memory address space that is partitioned among all the processes. The programming model distinguishes between local and remote memory. This can be specified by the developer through the use of special keywords or annotations [9]. PGAS is therefore usually found in the form of a programming language extension and is one of the main concepts behind several languages, such as Co-Array Fortran or Unified Parallel C. Local variables can only be accessed by the local processes, while shared variables can be written or read over the network. In most PGAS languages, both types of variables can be accessed in the same way. It is the responsibility of the compiler to add the necessary code to implement a remote variable access. This means that from a programming perspective, a remote variable can directly be a assigned to a local variable or a register and does not need to be explicitly loaded into main memory first as is the case with RMA. When programming with a PGAS language, the developer needs to be aware of implicit data movement when accessing shared variable data, and careful non-uniform memory access (NUMA) optimizations are required for applications to achieve high performance. 2.4 Low-latency, High-bandwidth Networks Many high-performance networks offer RDMA functionality. Examples of such networks are InfiniBand [19] and Cray Aries [2]. Both networks offer a bandwidth of 100 Gb/s or more and a latency in the single-digit microsecond range. However, RDMA is not exclusively available on networks originally designed for supercomputers: RDMA over Converged Ethernet (RoCE) [20] hardware adds RDMA capabilities to a conventional Ethernet network. Access (RMA) and the Partitioned Global Address Space (PGAS) concepts