设计工具
公司

什么是异构内存存储引擎?

美光科技| 2020年6月

沙巴体育结算平台组合, SCM and ssd that gives us the insight and expertise to build a storage engine that intelligently manages data placement across disparate memory and storage media types. 与为硬盘驱动器编写的传统存储引擎不同, HSE was designed from the ground up to exploit the high throughput and low latency of SCM and ssd.

\r\n

实现

\r\n

HSE uses the advantages of discrete media types to support two media classes for data storage: a “staging” media class and a “capacity” media class. A staging media class is typically configured to run on high-performance (IOPS and/or MB/s), 低延迟和高写入持久性介质(例如, SCM或数据中心级ssd(带NVMe™). 热备数据, 短期访问在冷时分配给暂存媒体类, 长期数据通常被配置为以较低的成本运行, lower write endurance media (like quad-level cell [QLC] ssd) in the capacity media class tier. This enables HSE to achieve high throughput and low latency while also conserving write cycles on lower endurance media.

\r\n

可配置的持久性层

\r\n

The HSE durability layer is a user-configurable logical construct that resides on the staging media class. The durability layer provides user-definable data persistence in which the user specifies an upper bound on how many milliseconds of data may be lost in the event of a system failure, 比如功率损失.

\r\n

数据最初从DRAM摄取到持久性层. 存储从更快的暂存媒体类分配,以满足低延迟, 耐久性层的高吞吐量要求. 与传统的预写日志(WAL)不同, this durability layer avoids the “double write problem” common with classic journaling to significantly reduce write amplification.

\r\n

数据老化

\r\n

随着存储数据的增长, the data migrates through multiple layers of the system and is rewritten as part of garbage collection to optimize query performance (completion time). 以下是高级流程:

\r\n

当需要存储新数据时,首先将其写入持久性层.

\r\n

As the data ages, it is rewritten to the capacity media class as a background maintenance operation.

\r\n

当新数据到达时, that new data may render existing data obsolete (by updating or deleting records that were previously written). 维护操作定期扫描现有数据,进行空间回收. 如果大部分数据现在无效或过时, these operations reclaim space by rewriting just the data that is still valid —freeing up all the space the old data occupied (i.e.、垃圾收集). To service queries efficiently, valid data is also arranged so that it can be scanned easily.

\r\n

有效的数据被重新组织到层中,以便更快地处理查询. Key and value data are isolated into separate streams throughout this process — keys are written to the staging media class to facilitate faster lookups. Eventually, older data at the bottom tier is written to the designated capacity media class devices.

\r\n

As queries are serviced and data is read from both media classes, indexes are page-cached into DRAM. An LRU (least recently used) algorithm dynamically ranks indexes to facilitate index tracking, 夹住最热的(i).e.(访问最频繁的索引),假设系统DRAM可用.

\r\n

媒体课表演

\r\n

我们的测试设置使用了一个 美光9300固态硬盘与NVMe™ 作为分期媒体类设备和四种 美光5210 SATA QLC固态硬盘 作为容量介质类设备. 我们使用 雅虎!®云服务基准(YCSB) 来比较每秒操作数和99.9%尾部延迟:

\r\n
    \r\n
  • 首次运行:配置为容量媒体类设备的4个微米 5210 QLC ssd
  • \r\n
  • Second run: Four 5210年ssd configured as capacity media class devices and one 微米 9300 固态硬盘 with NVMe as a staging media class device
  • \r\n
\r\n

我们跑 YCSB工作负载A、B、C、D和F 两种配置的线程数相同1. 表1总结了几种YCSB工作负载组合, 应用程序示例取自YCSB文档. 表2到表4分享了沙巴体育安卓版下载硬件的其他测试细节, 软件和基准配置.

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n
表1:工作负载
YCSB工作量I / O操作
\r\n
应用实例 
A50%读
\r\n50%的更新
\r\n
会话存储记录用户会话活动
\r\n
B95%读
\r\n5%的更新
\r\n
照片标记
\r\n
C100%读
\r\n
用户配置文件缓存
\r\n
D95%读
\r\n5%的插入
\r\n
用户状态更新
\r\n
F50%读
\r\n50%读-修改-写
\r\n
用户数据库或记录用户活动
\r\n
\r\n

1. 没有测试工作负载E,因为它没有得到普遍支持

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n
表2:硬件细节
服务器平台
服务器平台Server Platform Intel® based (dual-socket) 
处理器
\r\n
2 Intel E5-2690 v4
\r\n
内存
\r\n
256 gb DDR4
\r\n
ssd
\r\n
暂存级介质:1x带NVMe的美光9300 固态硬盘
\r\n容量级介质:4倍微米 5210.68TB SATA ssd
\r\n
介质配置
\r\n
LVM分条逻辑卷
\r\n
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n
表3:软件细节
软件详细信息
操作系统Red Hat Enterprise Linux 8.1
\r\n
HSE版本
\r\n
1.7.0
\r\n
RocksDB版本
\r\n
6.6.4
\r\n
YCSB版本
\r\n
0.17.0
\r\n
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n
表4:基准2
YCSB基准配置
数据集2TB(20亿个1000字节记录) 
客户端线程
\r\n
96
操作
\r\n
每个工作负载20亿美元
\r\n
\r\n

2. 不同的配置可能显示不同的结果.

\r\n

吞吐量

\r\n

YCSB首先加载数据库. 这是一个100%的插入工作负载. Adding a 9300 to the mix reduces the time taken to load the 2TB database by a factor of four.

\r\n

图1显示了五个YCSB工作负载的负载阶段和运行阶段的吞吐量. For write-intensive workloads like Workload A (50% update) and Workload F (50% inserts), 添加微米 9300作为登台媒体类可以提高总体吞吐量2.3和2.分别为1次. Workloads B and D (5% updates/inserts) show more modest improvements in throughput because 95% of these workloads are reads coming almost entirely from the 5210年ssd comprising the capacity media class.

\r\n"}}' id="text-132686a9c8">

如果你还没有听说,美光最近发布了它的 异构内存存储引擎 致开源社区. Our design focuses on providing a solution that makes storage class memory (SCM) and ssd more performant, 通过减少写放大,增加了固态硬盘的有效寿命, 所有这些都被大规模部署. 与传统存储引擎相比,HSE通常有利于雅虎等工作负载! 云服务基准测试(YCSB)多次.

什么是异构内存存储引擎(HSE)?

为什么异构? 美光拥有广泛的DRAM沙巴体育结算平台组合, SCM and ssd that gives us the insight and expertise to build a storage engine that intelligently manages data placement across disparate memory and storage media types. 与为硬盘驱动器编写的传统存储引擎不同, HSE was designed from the ground up to exploit the high throughput and low latency of SCM and ssd.

实现

HSE uses the advantages of discrete media types to support two media classes for data storage: a “staging” media class and a “capacity” media class. A staging media class is typically configured to run on high-performance (IOPS and/or MB/s), 低延迟和高写入持久性介质(例如, SCM或数据中心级ssd(带NVMe™). 热备数据, 短期访问在冷时分配给暂存媒体类, 长期数据通常被配置为以较低的成本运行, lower write endurance media (like quad-level cell [QLC] ssd) in the capacity media class tier. This enables HSE to achieve high throughput and low latency while also conserving write cycles on lower endurance media.

可配置的持久性层

The HSE durability layer is a user-configurable logical construct that resides on the staging media class. The durability layer provides user-definable data persistence in which the user specifies an upper bound on how many milliseconds of data may be lost in the event of a system failure, 比如功率损失.

数据最初从DRAM摄取到持久性层. 存储从更快的暂存媒体类分配,以满足低延迟, 耐久性层的高吞吐量要求. 与传统的预写日志(WAL)不同, this durability layer avoids the “double write problem” common with classic journaling to significantly reduce write amplification.

数据老化

随着存储数据的增长, the data migrates through multiple layers of the system and is rewritten as part of garbage collection to optimize query performance (completion time). 以下是高级流程:

当需要存储新数据时,首先将其写入持久性层.

As the data ages, it is rewritten to the capacity media class as a background maintenance operation.

当新数据到达时, that new data may render existing data obsolete (by updating or deleting records that were previously written). 维护操作定期扫描现有数据,进行空间回收. 如果大部分数据现在无效或过时, these operations reclaim space by rewriting just the data that is still valid —freeing up all the space the old data occupied (i.e.、垃圾收集). To service queries efficiently, valid data is also arranged so that it can be scanned easily.

有效的数据被重新组织到层中,以便更快地处理查询. Key and value data are isolated into separate streams throughout this process — keys are written to the staging media class to facilitate faster lookups. Eventually, older data at the bottom tier is written to the designated capacity media class devices.

As queries are serviced and data is read from both media classes, indexes are page-cached into DRAM. An LRU (least recently used) algorithm dynamically ranks indexes to facilitate index tracking, 夹住最热的(i).e.(访问最频繁的索引),假设系统DRAM可用.

媒体课表演

我们的测试设置使用了一个 美光9300固态硬盘与NVMe™ 作为分期媒体类设备和四种 美光5210 SATA QLC固态硬盘 作为容量介质类设备. 我们使用 雅虎!®云服务基准(YCSB) 来比较每秒操作数和99.9%尾部延迟:

  • 首次运行:配置为容量媒体类设备的4个微米 5210 QLC ssd
  • Second run: Four 5210年ssd configured as capacity media class devices and one 微米 9300 固态硬盘 with NVMe as a staging media class device

我们跑 YCSB工作负载A、B、C、D和F 两种配置的线程数相同1. 表1总结了几种YCSB工作负载组合, 应用程序示例取自YCSB文档. 表2到表4分享了沙巴体育安卓版下载硬件的其他测试细节, 软件和基准配置.

表1:工作负载
YCSB工作量 I / O操作
应用实例 
A 50%读
50%的更新
会话存储记录用户会话活动
B 95%读
5%的更新
照片标记
C 100%读
用户配置文件缓存
D 95%读
5%的插入
用户状态更新
F 50%读
50%读-修改-写
用户数据库或记录用户活动

1. 没有测试工作负载E,因为它没有得到普遍支持

表2:硬件细节
服务器平台
服务器平台 服务器平台基于Intel®(双插槽) 
处理器
2 Intel E5-2690 v4
内存
256 gb DDR4
ssd
暂存级介质:1x带NVMe的美光9300 固态硬盘
容量级介质:4倍微米 5210.68TB SATA ssd
介质配置
LVM分条逻辑卷
表3:软件细节
软件详细信息
操作系统 Red Hat Enterprise Linux 8.1
HSE版本
1.7.0
RocksDB版本
6.6.4
YCSB版本
0.17.0
表4:基准2
YCSB基准配置
数据集 2TB(20亿个1000字节记录) 
客户端线程
96
操作
每个工作负载20亿美元

2. 不同的配置可能显示不同的结果.

吞吐量

YCSB首先加载数据库. 这是一个100%的插入工作负载. Adding a 9300 to the mix reduces the time taken to load the 2TB database by a factor of four.

图1显示了五个YCSB工作负载的负载阶段和运行阶段的吞吐量. For write-intensive workloads like Workload A (50% update) and Workload F (50% inserts), 添加微米 9300作为登台媒体类可以提高总体吞吐量2.3和2.分别为1次. Workloads B and D (5% updates/inserts) show more modest improvements in throughput because 95% of these workloads are reads coming almost entirely from the 5210年ssd comprising the capacity media class.

YCSB工作负载图表显示了微米 9300和5210年固态硬盘的每秒操作数 图1:YCSB工作负载

延迟

图2显示了99.9%的读(尾)延迟. The read tail latencies for all workloads are considerably improved (2 to 3 times) after adding the 9300 (except for Workload C, 也就是100%读取). Recall that newly arrived writes are first absorbed by the 9300 and gradually written in the background to the 5210s as the data ages. Key data (indexes) are written to the 9300, making lookups faster in the second configuration. A fraction of the reads are serviced by the 9300 instead of the 5210s (depending on the query distribution and age of the data being read).

另外, 通过减少对5210的写入次数, even the reads that are serviced by the 5210s suffer less contention from ongoing writes, 所以尾部读延迟更低. The insert/update latencies are not pictured as they are similar in the two configurations during the run phase.

美光9300 vs 5210年固态硬盘的YCSB工作负载读延迟 图2:YCSB工作负载的延迟

字节写

最后, we measured the amount of data written to the 5210s in the course of executing each workload. 添加9300作为暂存媒体类可以减少写入5210的字节数, 保持写周期并延长5210的写寿命. 在加载(仅插入阶段)期间, 写入5210的字节数减少了1 / 2.4,见表5.

表5:写减少
字节写
配置 4x 5210 9300 + 4x 5210
GB写入5210(容量介质) 7260 2978
GB写入9300(暂存介质) N/A 4158

Figure 3 shows the total number of gigabytes written during the run phase of the YCSB workloads. 注意,这包括用户和后台写入. 除了工作负载C(100%读取), the other workloads show at least a twofold reduction in the total number of bytes written to the 5210s by adding one 9300 to the configuration.

YCSB总gb写入微米 9300 vs. 5210年固态硬盘 图3:写入数据的减少

未来的工作

作为未来工作的一部分, 我们正在寻求扩大HSE API的具体方面,以提高其使用, 比如自定义媒体类策略,它赋予应用程序更多的控制权. 例如, 如果应用程序创建了键值存储(KVS), the equivalent of a table in a relational database) that will be used only for indexing, it can specify that the particular KVS should use a staging media class to speed up lookups. If the size of the indexing KVS grows too large to be accommodated on the staging device, the application can specify a policy that uses staging media but falls back to capacity media. We may also introduce predefined media class policy templates and extend the HSE API to allow an application to use them based on its needs. 一定要保持联系,了解潜在的发展.

" class="hidden">易车会