COMEX: Cooperative Memory Expansion

Design

COMEX Design

COMEX is realized by page table extension for tracking remote physical memory (in other connected nodes) allocated to an application on-demand during its execution, which allows its related PTEs (page table entries) point to COMEX page frames employed for harboring evicted pages in remote nodes.

Page frame reclaim

COMEX is integrated into the page frame reclaim, by letting reclaimed pages held in the pre-registered RDMA write buffer before transmitted eventually to the pindown memory page frames via RDMA writes. This way allows reclaimed page frames to be freed immediately after their contents are copied to the COMEX RDMA write buffer. It accelerates page frame reclaiming when compared with reclaiming under a typical OS, which asynchronously waits for slow confirmation from the I/O block device driver. COMEX issues one RDMA write to transfer all reclaimed pages that are to be held in the same remote node.
Pindown memory area management

Pre-allocated pindown memory area contributing to the COMEX global memory pool has to be managed properly for its efficient utilization in satisfying requests from other compute nodes for holding swapout pages during job executions on those nodes. It is managed by a buddy system, using one list to track groups of contiguous page frames with a given size. On the other hand, if allocated pages to remote compute nodes become unneeded (after their contents are fetched back to their respective compute nodes where page faults occur), those compute nodes then return unneeded pages back to the buddy system of their original node. So that buddy system gathers returned pages and gradually merge them into contiguous larger area.
Remote page frame management

Remote page frames allocated by memory nodes to a compute node are tracked via linked lists, one entry per a contiguous area. Multiple swap-out pages of a given process are transferred by one single RDMA write to contiguous page frames in the same staged memory node. With swap-out pages staged in contiguous memory page frames, a later page fault (due to a swap-out page) is fulfilled by fetching the target swap-out page plus its neighbor multiple page frames (say, 32 pages) via one RDMA read, held in one entry of the RDMA read buffer of the compute node.

A threshold is assigned to a linked list. If the number of page frames of a list drops below its threshold, a memory replenishing request is issued to the corresponding memory node, via a lightweight RDMA verb. Upon receiving such a request, the memory node allocates as many contiguous page frames as possible (but no smaller than 2⁷ frames) in one verb reply, based on its buddy system (as mentioned earlier). A reply from the requested memory node is then sent by an RDMA verb.
Page frame discovery

COMEX design aims to have participating nodes work independently and to avoid a single point of failure. Given that multiple memory nodes together contribute to the shared memory pool for staging swap-out pages from all execution processes in the system, it is highly desirable to balance traffic and memory utilization across contributing memory nodes to ensure high performance. COMEX employs locality-aware page frame discovery which can be realized by involving the process ID (PID), so that swap-out pages from a given execution process are staged in the same memory node (called its preferred node). In a clustered system, different compute nodes may involve execution processes with an identical PID, making it desirable from the load-balancing standpoint to include the node ID (NID) for deciding page frames. Hence, a given execution process in this design option may hash NID plus PID to identify its preferred memory node in which its swap-out pages are staged, to balance traffic and utilization over available memory nodes for all execution processes.
Page fault handling and prefetching

Like native Linux, COMEX is designed to handle a page fault by not only fetching the faulty page but also prefetching its neighboring pages from the same staged memory node in one RDMA read, with all pages transmitted in one read held in an RDMA read buffer entry. Unlike Linux, it prefetches more pages (say, 32 instead of 8) at a time. For high performance, an RDMA read buffer is pre-allocated statically at each compute node during initialization. The number of read buffer entries is a design parameter. Buffer entries are shared by all RDMA reads, no matter where they are from. This shared buffer design is preferred over its dedicated counterpart (where each staged memory node is provided with a separate buffer slice statically), promising better buffer utilization.

NSF CCF-1423302

COMEX:

Cooperative Memory Expansion

Design

COMEX Design

Page frame reclaim

Pindown memory area management

Remote page frame management

Page frame discovery

Page fault handling and prefetching