Evaluate the I/O topology of the used machine and eliminate bottlenecks
e.g., distribute SSDs to multiple controllers
Where is your ssd connectted ?
Some sata/M.2 ports through DMI conencetted.
DMI 2.0, introduced in 2011, doubles the data transfer rate to 2 GB/s with a ×4 link. It is used to link an Intel CPU with the Intel Platform Controller Hub (PCH), which supersedes the historic implementation of a separate northbridge and southbridge.
DMI 3.0, released in August 2015, allows the 8 GT/s transfer rate per lane, for a total of four lanes and 3.93 GB/s for the CPU–PCH link. It is used by two-chip variants of the Intel Skylake microprocessors, which are used in conjunction with Intel 100 Series chipsets
- Use software RAID to overcome the performance limitations of hardware RAID
- Bottlenecks in software RAID implementations can still occur though at higher performance level
- Reading flash pages faster than writing
- Writes in parity-based RAIDs slower than reads due to Read-Modify-Write operations
- Effects can accumulate Even faster reads and slower writes
- SSDs have limited number of erase cycles
- Lifespan of SSD depends on write workload
- In RAIDs writes often distributed equally to all drives
- Multiple drives may wear out at the same time
If broken ssd just read only ,that is fine
Until 2016, I think write endurance is not problem.
- Increase spare capacity to ensure that enough free flash blocks will be available anytime
- Garbage collector
- Write amplification
OP (Over Provisioning) is very important for Garbage collector and write amplification
More OP, More perforamnce, Lower fragments(Lower write amplification), More write endurance….
There’s a difference in over-provisioning levels too as the S3710 features a higher 30-40% over-provisioning with the S3610 having only 10-20%, in both models the exact over-provisioning depends on the capacity
In intel’ s ssd-server-storage-applications-paper
The document said “With SSDs, RAID level selection criteria are similar to HDDs. However, because of shorter rebuild times under load and lower performance cost of reads during Read-Modify-Write operations, RAID 5/6/50/60 may be more effective in some applications that traditionally would require the use of RAID 10 with HDDs.”
Does it means you could run SSDs in raid 50/60 (intel hardware) ?
kernel version > 3.7 (ext4,btrfs,xfs,jfs)
add “discard” tag on each partition of the SSD
/dev/sda1 / ext4 defaults,noatime,discard 0 1
wiping out all fragmentation and physically erasing all NAND blocks
hdparm --user-master master --security-set-pass password /dev/sdX
hdparm --user-master master --security-erase password /dev/sdX
hdparm -Np468862128 /dev/sdX #PIO mode
<Power cycle the drive>
ATA defines two classes of transfer mode, called PIO Mode (Programmed I/O Mode) and DMA Mode (Direct Memory Access Mode). PIO mode transfers are much slower and require the processor to arbitrate transfers between the device and memory. DMA mode transfers are much faster and occur without processor intervention
(Recommended) Create partition(s) that occupy only the desired usable capacity and leave the
remaining capacity unused
User Capacity: 900,184,411,136 bytes [900 GB]
libata.ignore_hpa= [LIBATA] Ignore HPA limit
echo 1 > /sys/module/libata/parameters/ignore_hpa
echo options libata ignore_hpa=1 > /etc/modprobe.d/libata.conf
With SSDs in Linux* using direct IO instead of buffered IO is recommended, when possible. The Linux
IO subsystem provides read and write buffering at the block device level. In most cases, buffering is
undesirable with SSDs for the following reasons:
- SSDs have lower latencies than HDDs, therefore the benefits of buffering are reduced.
- Buffering read IOs consumes extra CPU cycles for memory copy operations. At IO rates typical for
SSDs, this extra CPU consumption may be high and create a read performance bottleneck.
To use direct IO and bypass the buffer, software can set O_DIRECT flag when opening a file. Many
applications and test tools have configurable options that allow selecting direct or buffered IO, for
- FIO* utility: use ‘–direct=1’ option
- MySQL InnoDB*: use ‘–innodb_flush_method=O_DIRECT’ option
- Oracle* 10g/11g: use ‘filesystemio_options = SETALL’ or
‘filesystemio_options = DIRECTIO’
To optimize the speed of random IOPS, stripe unit size should be at least 2X of the typical transfer size used by the application. This minimizes the frequency of transfers crossing the stripe boundaries and writing to more than one drive at a time. If the files and transfers are aligned on the stripe boundaries, the stripe size can be set to be equal to the typical transfer size.
In large RAID sets, you must verify that your stripe size is large enough so that the stripe per drive is
not less than 4KB. To calculate the smallest stripe:
4096 X Number_of_Striped_Drives = Smallest_Stripe_Size
For higher sequential bandwidth with applications that cannot use IO queuing, the following stripe unit
size or smaller should be used:
0.5 x Transfer_size / Number_of_striped_drives
That same with what I guess. I ‘m so glad :)
Again please remember that you can write multiple data blocks into one strip. Strip is not the smallest storage unit size on the disk.
Let’s take your example. we have a 32KB file and a 96KB file. On the file system they’re stored in a total of 4 clusters, if you have a 32KB cluster. Let’s say you have a 64KB strip size on RAID controller, and let’s assume it’s a sequential write. The first file will be written on the first 32KB of the strip on the first disk. Then the first 32KB of the 2nd file will be written to the rest 32KB of the strip on the first disk. The remaining 64KB goes to the strip on the second disk.
Now let’s change the example a little bit, with multiple (let’s say eight) 8KB files, 32KB cluster size and 64KB strip size. because of the 32KB cluster size, each 8KB file will use 32KB on your file system. Each 64KB strip can still take two 32KB clusters, so you’re using four strips in total, two on each drive. But you’re wasting 24KB * 8 = 192KB drive space here.
What if your cluster size is 4KB? Each 8KB file will take two clusters, and all 8 files can be stored in one 64KB strip.
What about a 1KB file? If you have a 4KB cluster size and 64KB strip size, you still have 60KB on the strip left for other files, wasting 3KB here, which is inevitable. But if you have a 32KB cluster size, you’ll have only 32KB on the strip left for other files, wasting 31KB.
Using RAID Write-Back Caching (BBU)
Disabling RAID Read Ahead, except sequential read bandwidth with single threaded applications
Disabling RAID Read Cache, read caching consumes RAID memory that could be used by write-back caching.
For more info