This is an update of the previous year's presentation about scaling LDISKFS. LDISKFS is a commonly used Lustre FS backend. OST and MDT targets can be formatted as LDISKFS to store Lustre objects. Total data storage size depends on target size or target count. Increasing the target count requires adding new hardware or increasing target size. Target size can be increased by switching to hard drives with larger capacity. Last year, Seagate significantly upgraded the size of its hard drives. Current solutions have 8TB,10TB hard drives aboard, and 12TB and 16TB are coming soon. LDISKFS based on the EXT4 file system are scalable, but keeping the size of the target device (~300TB for 10TB hard drives) in mind, some preparation is needed. There are some verification steps that were used in previous LDISKFS scaling phases, pointed out in last year's presentation, and the same work is done for this iteration:
- Issues with external inode xattr are fixed. New tests are added;
- Large memory structures are checked to be ready;
- Inode count limitation public discussion is started;
- Large_dir support is added to ext4 and ready to be added to Lustre FS.
Another important problem in this scaling iteration are EXT4 (and LDISKFS) metadata limits. Without additional support, block groups can be allocated only for partitions < 256TB. There are two possible solutions for this problem. The “bigalloc” feature makes block sizes bigger, so there's no need to allocate many block groups. “Meta_bg” changes the filesystem structure so that enough block groups can be allocated. Both of these approaches have their own advantages and disadvantages. To make the right choice, Seagate performed functional and stress testing that shows the following results:
- Bigalloc has some known issues. Fixes are exists in ext4 and need to be ported to Lustre;
- Some new issues in bigalloc have been found: quota and symbolic links are not ready;
- Mount time with meta_bg is too long. Some patches were added to preload metadata, decreasing mount time dramatically;
- mkfs.Lustre requires a fix to exclude resize_inode if meta_bg is enabled (these options can not be set at the same time).
At this moment “meta_bg” looks more stable and more attractive as a solution. As the result of much work, safe partition size limit has increased to 512TB. Hard drive capacity is continuously increasing, so in the feature the Lustre FS community will face the need to create partitions > 512TB. Meanwhile, Seagate continues to work to extend LDISKFS capacity.