Abstract

As a basic algorithm for big data processing, external sorting suffers from massive read and write operations in the external memory. Recent works separate part of the data processing work from the host side to the solid state drive (SSD) to reduce data transmission. However, the internal memory of the SSD is limited, and undesirable data retention could occur during the merge phase. Therefore, to improve the efficiency of memory, we propose an algorithm named ISort. Specifically, we build an index table between the memory and the address. The index table determines the order of pages being read in the merge phase according to their minimum values, which are read into memory sequentially to reduce the data residing in memory and improve memory efficiency. Since the merge phase is performed inside the SSD, ISort can take advantage of the high IO bandwidth within the SSD to speed up the execution of the merge phase. We search for the optimal ratio of read and write channels by comparing the “specialized channel” and the “hybrid channel” for data of read and write performance because the utilization of the channel will directly influence performance. Experimental results show that ISort can maintain better data processing speed when SSD memory is limited, outperforming other robust algorithms. In addition, the algorithm’s performance using the crossover strategy is better than that using the specialization strategy.

1. Introduction

The development of storage technology and cloud computing has made it possible to process terabytes and petabytes of data [1]. However, the widespread use of data-intensive applications and personal mobile devices generates massive amounts of data, estimated to reach 185 zettabytes in 2025 [2]. Therefore, how to quickly mine valuable information from massive data has become an urgent problem to be solved in the era of rapid data growth.

External sorting is one of the most fundamental algorithms in data management systems, being used to deal with the situation that the main-memory capacity cannot hold all the data when data volume is too large. For example, in the MapReduce framework of Hadoop, a large number of external sorting is exploited to sort intermediate data and the final data in both operations of mapping and reducing [3]. Another instance is that external sorting plays a critical role in the database query procedure since a large amount of data is often involved in finding the desired results [4]. External sorting contains two phases: the run generation and the run merge. In the first phase, the input data are divided into blocks that can hold the memory capacity and then loaded into the memory for sorting. Sorted data blocks (Run) are then written back to the storage device. In the second phase, multiple runs are merged into a fully sorted chunk of data [5], which will require lots of I/O operations, resulting in I/O overhead. Therefore, the I/O time is critical in the external merge sorting elapsed time.

Traditional external sorting algorithms are mainly designed for hard disk drives (HDDs) characterized by slow speed, high power, and poor earthquake resistance. In contrast, SSDs have more obvious advantages such as no robotic arm, random access, high read and write bandwidth, earthquake resistance, low energy consumption, high stability, long service life, and no noise. [6]. With the development of flash memory technology and the price reduction, SSDs are gradually replacing HDDs in the storage market [7]. However, external sorting is I/O-intensive because there are many read/write operations on storage devices in the execution process, which affects the performance of the algorithm and the service life of the SSD. To solve this problem, experts have tried to transfer computing to the SSD, called computing and storage fusion.

Many efforts have been made towards external sorting. Reference [8] makes use of the computing resources in SSD to accelerate deep learning. The blueDBM architecture [9] accelerates data queries in SSD computing. Reference [10] unloaded the external sorting work to SSD. In Reference [11], source data are divided into multiple blocks and sorted separately in memory, and the merge work begins when there is an access request. This approach uses the channel parallelism of SSD but does not consider the situation that the data are partially sorted. All the above methods suffer from the issue that when the pages of big and smaller data are read simultaneously, the big data will remain in memory for a long time, reducing the memory utilization. To tackle this issue, we build an index table to record the minimum value of each run for each block that is sequentially read to the input buffer and merged within the SSD. The channel congestion problem caused by the read/write rate is also discussed. In summary, our major contributions can be summarized as follows:(i)We present a new external sorting algorithm named ISort that implements rapid sorting within the SSD. For partially sorted data, it records the minimum values of sorted blocks and indexes them to determine the order for merging. By avoiding the extended storage of large values in memory, ISort can enhance the internal memory utilization of SSD and significantly improve external sorting performance.(ii)The specific proportion adjustment of SSD hardware equipment is carried out during the operation of ISort algorithm. We find the best ratio of parallel channel read-write numbers by comparing the effects of different ratios of the read/write channel on external sorting.(iii)The experimental results show that ISort has better read and write performance than previous works. For example, ISort improves the read and write performance when the total amount of data increases. ISort also improves the performance when the data size remains the same and the memory size increases.

The rest of this paper is organized as follows. The background and motivation are introduced in Section 2. Section 3 describes the detailed implementation of ISort and different channel strategies. Simulation experiments are presented in Section 4. Section 5 provides an overview of the related work. The conclusion is presented in Section 6.

2. Background and Motivation

In this section, we first describe the basic external sorting algorithm and general architecture of a typical SSD, then we discuss the motivation of this work.

2.1. External Sorting

Traditional external sorting is generally divided into two phases, as illustrated in Figure 1. Source data are initially in the storage device. The first phase divides the data into accommodating blocks according to the input buffer size. Then, each block is loaded into host memory for sorting, and the sorted data blocks are written back to the storage device. The second phase merges the sorted data blocks generated in the first phase into a sorted output through several iterations [5]. After these iterations, the merge operation will produce multiple read and write operations for the storage devices, resulting in high I/O overhead. Because of the large performance gap between DRAM and storage devices, the I/O times are decisive in the elapsed time of external merge sorting. A critical evaluation factor is for data-intensive applications, whether the data can be processed quickly and allow a response to the results.

2.2. Solid State Drives (SSDs)

With the development of flash memory technology and price reduction, SSDs have gradually become the mainstream type of high-performance storage media. SSDs based on flash memory have been widely studied by industry and academia because they provide random access, high speed, high throughput, and low energy consumption.

Considering the architecture, flash memory can be divided into single-level cell flash memory (SLC) and multilevel cell flash memory (MLC) [12]. MLC allows a single storage unit to hold twice as much data, making it cheaper to manufacture. MLC has a slower writing speed, higher power consumption, shorter life, and higher error rate than SLC. SSD can be divided into NOR flash memory and NAND flash memory according to the type. The random read speed of NOR flash memory is fast, but the erase and the programming operation is slow, and its flash capacity is relatively small. NOR flash memory allows random storage, is suitable for frequent read and write situations, and is usually used to store program code. Compared with NOR flash memory, NAND flash memory has a lower cost, higher density, higher capacity, and faster erase and write speed, being suitable for data storage. This article only discusses NAND flash memory, which provides three basic operations:(i)Read/write operations: The basic unit of read/write operations is the page, but the erase operation’s basic unit is the block. The write operation of flash memory is generally 200–700 µs, approximately ten times that of the read operation.(ii)Erase operation: The erase operation sets all values on the target block to 1. However, if a flash page has been written, and we want to write the block again, we need to erase it first. This process is called erasing before writing [13]. The delay of the erase operation is about 2–3 ms longer than that of the I/O operation. Therefore, frequent erase operations will affect the overall performance.

Flash memory can only be subjected to a limited number of erasures. If the data block is erased frequently, it can no longer be used. The SSD controller adds a transformation layer named the flash translation layer (FTL) to avoid writing after erasure. Flash memory does not support overwriting. FTL writes the new data to other free pages when updating data, and the original data are marked invalid. FTL has three main functions: address mapping, load balancing, and garbage collection [14]. Address mapping can be divided into page mapping, block mapping, and hybrid mapping [15]. The mapping table for block mapping is tiny, being able to reduce memory overhead and offer an excellent response to read requests. Load balancing can improve the performance of the SSD and prolong its service life. Garbage collection [16] periodically reclaims space occupied by invalid data and erases appropriate blocks to recycle free pages.

Figure 2 illustrates the general architecture of a NAND SSD, which is composed of a master controller chip, a set of DRAM, multiple interfaces, and an array of NAND flash memory chips connected to flash controllers by multiple channels.

2.3. Motivation

In our practical application, source data usually have data locality. The most recent research algorithms mainly focus on reducing the amount of data transferred between memory and out-of-memory devices. Reference [10] takes advantage of SSDs’ internal computing power but does not consider their limited internal memory resources. Completely ignoring the data’s characteristics will lead to a large amount of data being stuck in the memory for a long time. Therefore, we hope to make full use of the internal resources of the SSD and the characteristics of the data itself to achieve the acceleration of the external sorting algorithm.

3. ISort Design

We present ISort, an external sorting, that performs data merging by exploiting the internal hardware infrastructure of SSDs. We introduce its architecture and elaborate on its design techniques.

3.1. Overall Architecture

We propose a new external sorting mechanism called ISort that performs data merging by exploiting the internal hardware infrastructure of SSDs. The traditional external sorting algorithm cannot be transferred to SSD because it uses FTL to process the host-side data request [17], as described in Figure 2. Therefore, we need to change the SSD standard software architecture layer. However, the direct use of the merge sort algorithm will result in the high consumption of memory resources in the SSD, which will significantly impact SSD performance. To solve this problem, we built a page min index to record the minimum values of all pages. The minimum values determine the order of pages entering the input buffer. The whole process of ISort is shown in Figure 3, where the gray block represents unsorted data, the yellow and blue block represent the internal ordered data, and the green block represents fully sorted data. We divide ISort into two phases. The first is the run generation phase, which differs from the traditional external sorting in that the data are written back to the storage device. The run merge phase performs operations inside the SSD.

Table 1 defines notations for ISort. The size of the key in the record is . Keys are allocated in blocks, each represented as with . We use to denote the host memory size assigned to perform the sort. The record is divided into parts, indicating the number of runs. indicates the number of pages in a block; represents the number of channels; represents fully sorted datasets; represents the minor pages; and represents the second minor pages.

3.2. Run Generation Phase

Algorithm 1 aims to generate intermediate sorted files called runs. We discard the value in the record because we only need to sort the key. We split into , each of which is the size of an input buffer (see lines 2 and 3 in Algorithm 1). To accelerate the speed of writing storage, ISort activates multiple flash channels simultaneously, making full use of the parallelism of the SSD. However, when the key value is skewed in the run, the channel will be blocked, resulting in slower read operations. We slice each run into the page using interlaced write between channels. During the write-back process, we recorded the page minimum index table in the SSD memory, recorded the page’s minimum value in each run, and built an index called page index.

(1)Input: Unsorted data
(2)Output: Sorted runs
(3)
(4)for from 0 to do
(5)
(6)
(7)end for
(8)
(9)for from 0 to do
(10)for from 0 to do
(11)  Open Channels
(12)  write from to
(13)  InsertIndexToMinIndex (minimum , page.id)
(14)  SortMinIndex()
(15)end for
(16)end for = 0
3.3. Run Merge Phase

Algorithm 2 describes the run merge phase of ISort. The min index order recorded by Algorithm 1 reads into the SSD’s internal memory input buffer. Unlike traditional sorting methods, we do not look for a minor page every run. It is also possible that the parallel page read simultaneously is from the same run. In ISort, the order in which pages are loaded into the SSD’s internal memory only follows the min page index. Because of the partial ordering of the data, a page we would like to see may be in high-value runs that will not be read for a long time. These data will not be output after they are read into memory but instead will be output when a more extensive page is encountered. Next, we read to the input buffer in order as the buffers for . By doing so, the data transmission capacity can be better matched with computing power. When a page is consumed, we can supplement the data without affecting the sorting of . When ISort is satisfied such that there are C pages in memory, the merging process starts synchronizing with the buffer data transfer process. The minor key in the input buffer is copied to the output buffer in each iteration. We used a qsort in memory, and the computational complexity is O(n). We flush the output buffer to a flash chip if the output buffer becomes full. The same run is interlaced on a different channel. Therefore, this process will not occur when a channel does not have a page, except in the final phase. However, parallel read/write operations may cause channel congestion, which is discussed in the experiment section.

(1)Input: Partial sorted runs
(2)Output: Sorted data
(3)Read and
(4)while has not yet processed all pages do
(5)if there are pages in the memory then
(6)  Sort
(7)  Output minimum key into buffer
(8)  if the output buffer is full then
(9)   Flush the output buffer to flash chip
(10)  end if
(11)end if
(12)end while = 0

Figure 4 illustrates an example. For the convenience of demonstration, we draw six channels and six input buffers to illustrate the merge phase of the ISort algorithm in more detail. Suppose there are six runs interleaved across six channels. We represent them in different colors. The in the flash chips is transferred to the input buffer in parallel, as shown in process 1. is transferred to the input buffer in parallel, as shown in process 2. When a page in is exhausted, the page of the same channel in is immediately converted to . At the same time, the reading of the next page is triggered. will be continuously converted into as it is consumed. At best, is distributed on a different channel, and we can implement the concurrent reading of the channel, as shown in process 1. In the worst case, is distributed on the same channel, and we can only read it serially, as in the traditional method. Because our data are partially ordered, and the data of each run are cross-placed on a different channel, and the worst-case probability is negligible.

Figure 5 shows six sorted runs. Each run consists of three pages, and each page contains three keys. Let us assume that the input buffer can drop 12 pages, as shown in 4. The middle of Figure 5 shows the traditional method of reading the minimum page of each run into the input buffer. When a page with large values and a page with decimal values appear in the input buffer simultaneously, it will cause long-term retention in memory, thereby reducing memory utilization. The lower part of Figure 5 shows our method. Based on the page-min-index, ISort reads sequentially to avoid the occurrence of pages in the input buffer and improve input buffer space utilization.

4. Experimental Results

4.1. Evaluation Design

This section describes the experimental platform setup and the methodology to evaluate ISort.

In the following experiments, we used SSDsim [18], an open-source solid state simulation system that follows the ONFI protocol, having high accuracy and modularization advantages. The hardware configuration parameters of the SSDsim simulator used in this paper are shown in Table 2.

We take ActiveSort as the baseline that includes an additional write-back operation than ISort. The comparison is conducted from the perspective of dataset size and memory. We also evaluate the impact of SSD memory and I/O trace on performance. Also, we use different channel ratios to test the performance of a specialized channels and hybrid channels.

4.2. Experimental Results

Since external sorting is an IO-intensive algorithm, read and write requests are initiated frequently and alternately in the merging phase, as shown in Figure 6. When more channels are used for writing, both the read time (RT) and write time (WT) of ActiveSort increase evidently, while ISort decreases, indicating the superior performance of ISort.

If the read-write request separation processing is carried out and the number of reading channels increases, it will lead to writing request processing congestion. Similarly, reducing the number of reading channels can reduce read request processing congestion. To avoid idle channels and improve channels’ resource utilization, we can make all channels read and write requests during the merge phase.

As shown in Figure 7, DRAM within SSD can cache read and write requests. With the increase of DRAM capacity, the hit rate of read and write requests can be improved.

Figure 8 shows the results of different data sets. We can find that ISort has a relatively more stable performance improvement than ActiveSort.

Figure 9 shows the results of different page sizes. When page size is 4 kB or 8 kB, ISort has better performance. However, the performance will degrade as the page size increases or decreases. When the page size is relatively large, it will cause channel congestion and increase the request processing time.

When source data are too large, it is necessary to use external sorting when it is impossible to load all the data into limited memory for sorting at one time. External sorting can be divided into HDD-based external sorting, embedded flash memory-based external sorting, SSD-based external sorting, and NVM-based external sorting.

The external sorting based on HDD generally reduces the search time and rotation delay by optimizing the algorithm and reducing the random access to the external memory device. Reference [19] staggered placement and a new reading strategy are proposed, speeding up the execution of the external sorting algorithm based on HDD and improving the performance. Reference [20] proposed an external sorting algorithm based on HDD. This external sorting algorithm does not require additional disk space and does not generate intermediate data. The main idea is to use quick sorting and a particular merging strategy to reduce the number of comparisons in the sorting process to improve execution performance.

Compared with HDD, flash-based SSD has no disk head and a mechanical arm, so there is no seek time and rotation delay [21]. Reference [22] designed an FTL (FTL-SS) based on a single channel and single way and extended the FTL to the case of multichannel and multiways, thus verifying the versatility and effectiveness of this method. References [2326] make full use of the internal parallelism of SSD through two phases and request rescheduling and dynamic write request mapping to improve the performance of SSD. Reference [27] proposes a channel striping technology to improve the resource utilization of the channel. FMsort makes full use of SSD’s fast access delay and high random. The I/O bandwidth is able to speed up the execution of external merge sorting [28]. Montres [29] takes advantage of the performance of SSD to speed up external sorting processing. ActiveSort implements the merging operation of external sorting inside SSD by using Active SSD [10]. Active SSD is a special kind of SSD [30]. Kang et al. proposed a multichannel storage system based on NAND flash memory. The storage system has a plurality of independent channels, with each channel having a plurality of NAND flash memory chips [31].

With the development of new storage technology, new storage technology such as PCM, STT-RAM, and ReRAM have been widely used. PCM [32, 33] is a new nonvolatile storage medium with byte-addressable, high density, and high persistence. NVM is a nonvolatile storage device with byte-addressable, nonvolatile, random access, high density, low energy consumption, and high access speed [34]. However, NVM also has some limitations. The service life of NVM is limited, and the reading and writing performance is asymmetric [3537]. Ahmed Khernache et al. proposed MONTRES-NVM, which is an external sorting algorithm based on the PCM and DRAM hybrid storage system [38].

6. Conclusion

The amount of data has increased exponentially in recent years, and our demand for data processing speed has gradually increased. The emergence of ActiveSSD provides a new possibility for us to process data in the near data segment. Traditional sorting algorithms need to be adjusted to better adapt to changes in memory size. This paper analyzes the latest algorithms and concludes that the large numerical data generated in memory will remain in memory for a long time, affecting memory utilization. The main idea of ISort is to use the computing resources within SSD to deal with the merging phase. We use each page’s minimum order to read the data to solve the problem of limited internal memory in SSD. To further improve the speed, we adopt the interleaving strategy in the write back part of the run generation phase. Many IO operations will produce varying degrees of read and write congestion; we have carried out different channel read-write ratio tests. We evaluated the performance of different read-write channel ratios, data size, page size, and SSD memory size. Compared with active sort + write, the performance of ISort reduces execution time by more than 36%. As a perspective for future work, it is significant to work to study the influence of different storage devices on various algorithms. For benefits from data access according to the characteristics of other storage devices and to further reduce the time overhead, this paper only discusses channel-level parallelism. In future in-depth research, we can continue to explore the deeper level of parallelism. At the same time, in future research, we will continue to study that data placement leads to increased garbage collection load. During the merge phase of the outer sort, writing the ordered data back to SSD can continue to explore where the output structure is written back when compared with the effect of opening up new space and allocating piecemeal space, which is better.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by National Key R&D Program of China under Grant no. 2021YFB0300103, National Natural Science Foundation of China (no. 61872392, U1911401), and the Major Program of Guangdong Basic and Applied Research (No. 2019B030302002).