Raid in SSD

Chipset Bottlenecks can occur SSDs in a RAID

Evaluate the I/O topology of the used machine and eliminate bottlenecks
e.g., distribute SSDs to multiple controllers

Where is your ssd connectted ?
Some sata/M.2 ports through DMI conencetted.

DMI bandwidth
DMI 2.0, introduced in 2011, doubles the data transfer rate to 2 GB/s with a ×4 link. It is used to link an Intel CPU with the Intel Platform Controller Hub (PCH), which supersedes the historic implementation of a separate northbridge and southbridge.

DMI 3.0, released in August 2015, allows the 8 GT/s transfer rate per lane, for a total of four lanes and 3.93 GB/s for the CPU–PCH link. It is used by two-chip variants of the Intel Skylake microprocessors, which are used in conjunction with Intel 100 Series chipsets

Popular server only DMI 2.0 shipset at 2016-04
For example
Supermicro 1028U-E1CR4+
Same with 1028U-TN10RT+
Intel C612 chipset

Enterprise-class hardware RAID controllers IOPS bottlenecks

  • Use software RAID to overcome the performance limitations of hardware RAID
  • Bottlenecks in software RAID implementations can still occur though at higher performance level

Asymmetry between Read/Write Speed

  • Reading flash pages faster than writing
  • Writes in parity-based RAIDs slower than reads due to Read-Modify-Write operations
  • Effects can accumulate Even faster reads and slower writes

Synchronous SSD Aging

  • SSDs have limited number of erase cycles
    • Lifespan of SSD depends on write workload
  • In RAIDs writes often distributed equally to all drives
    • Multiple drives may wear out at the same time

If broken ssd just read only ,that is fine
Until 2016, I think write endurance is not problem.

Workload History Dependency

  • Increase spare capacity to ensure that enough free flash blocks will be available anytime
  • Garbage collector
  • Write amplification

OP (Over Provisioning) is very important for Garbage collector and write amplification

More OP, More perforamnce, Lower fragments(Lower write amplification), More write endurance….

There’s a difference in over-provisioning levels too as the S3710 features a higher 30-40% over-provisioning with the S3610 having only 10-20%, in both models the exact over-provisioning depends on the capacity

The SM863 actually provides lower random write performance than the 845DC PRO, which is due to the reduced default over-provisioning as the SM863 only has 12% compared to 28% in the 845DC PRO


6x dc s3500 raid performance
4x intel 530 raid performance

In intel’ s ssd-server-storage-applications-paper
The document said “With SSDs, RAID level selection criteria are similar to HDDs. However, because of shorter rebuild times under load and lower performance cost of reads during Read-Modify-Write operations, RAID 5/6/50/60 may be more effective in some applications that traditionally would require the use of RAID 10 with HDDs.”

Does it means you could run SSDs in raid 50/60 (intel hardware) ?


Posix file system

kernel version > 3.7 (ext4,btrfs,xfs,jfs)
add “discard” tag on each partition of the SSD

/dev/sda1 / ext4 defaults,noatime,discard 0 1

Secure Erase

wiping out all fragmentation and physically erasing all NAND blocks

hdparm --user-master master --security-set-pass password /dev/sdX
hdparm --user-master master --security-erase password /dev/sdX
hdparm -Np468862128 /dev/sdX #PIO mode
<Power cycle the drive>

ATA defines two classes of transfer mode, called PIO Mode (Programmed I/O Mode) and DMA Mode (Direct Memory Access Mode). PIO mode transfers are much slower and require the processor to arbitrate transfers between the device and memory. DMA mode transfers are much faster and occur without processor intervention

(Recommended) Create partition(s) that occupy only the desired usable capacity and leave the
remaining capacity unused

User Capacity:    900,184,411,136 bytes [900 GB]
awk 'BEGIN{print 900184411136/1024/1024/1024}'
hdparm -N /dev/sdb
max sectors = 1953525168/1953525168, HPA is disabled
awk 'BEGIN{print 1953525168*512/1024/1024/1024}'
# Setting 80% could be used, improve OP capacity
awk 'BEGIN{printf "%0.2f",1953525168*0.8}'
hdparm -Np1562820134 --yes-i-know-what-i-am-doing /dev/sdb
setting max visible sectors to 1562820134 (permanent)
max sectors = 1562820134/1953525168, HPA is enabled

#Available space
awk 'BEGIN{print 1562820134*512/1024/1024/1024}'

Enable HPA in Kernel Boot Parameters

libata.ignore_hpa=      [LIBATA] Ignore HPA limit
libata.ignore_hpa=0 keep BIOS limits (default)
libata.ignore_hpa=1 ignore limits, using full disk


echo 1 > /sys/module/libata/parameters/ignore_hpa
echo options libata ignore_hpa=1 > /etc/modprobe.d/libata.conf

HPA reference

Use Direct IO

With SSDs in Linux* using direct IO instead of buffered IO is recommended, when possible. The Linux
IO subsystem provides read and write buffering at the block device level. In most cases, buffering is
undesirable with SSDs for the following reasons:

  • SSDs have lower latencies than HDDs, therefore the benefits of buffering are reduced.
  • Buffering read IOs consumes extra CPU cycles for memory copy operations. At IO rates typical for
    SSDs, this extra CPU consumption may be high and create a read performance bottleneck.
    To use direct IO and bypass the buffer, software can set O_DIRECT flag when opening a file. Many
    applications and test tools have configurable options that allow selecting direct or buffered IO, for
  • FIO* utility: use ‘–direct=1’ option
  • MySQL InnoDB*: use ‘–innodb_flush_method=O_DIRECT’ option
  • Oracle* 10g/11g: use ‘filesystemio_options = SETALL’ or
    ‘filesystemio_options = DIRECTIO’

echo noop > /sys/block/sdX/queue/scheduler

Raid setting

To optimize the speed of random IOPS, stripe unit size should be at least 2X of the typical transfer size used by the application. This minimizes the frequency of transfers crossing the stripe boundaries and writing to more than one drive at a time. If the files and transfers are aligned on the stripe boundaries, the stripe size can be set to be equal to the typical transfer size.

In large RAID sets, you must verify that your stripe size is large enough so that the stripe per drive is
not less than 4KB. To calculate the smallest stripe:
4096 X Number_of_Striped_Drives = Smallest_Stripe_Size
For higher sequential bandwidth with applications that cannot use IO queuing, the following stripe unit
size or smaller should be used:
0.5 x Transfer_size / Number_of_striped_drives

How small files save to Raid stripe

That same with what I guess. I ‘m so glad :)

Again please remember that you can write multiple data blocks into one strip. Strip is not the smallest storage unit size on the disk.

Let’s take your example. we have a 32KB file and a 96KB file. On the file system they’re stored in a total of 4 clusters, if you have a 32KB cluster. Let’s say you have a 64KB strip size on RAID controller, and let’s assume it’s a sequential write. The first file will be written on the first 32KB of the strip on the first disk. Then the first 32KB of the 2nd file will be written to the rest 32KB of the strip on the first disk. The remaining 64KB goes to the strip on the second disk.

Now let’s change the example a little bit, with multiple (let’s say eight) 8KB files, 32KB cluster size and 64KB strip size. because of the 32KB cluster size, each 8KB file will use 32KB on your file system. Each 64KB strip can still take two 32KB clusters, so you’re using four strips in total, two on each drive. But you’re wasting 24KB * 8 = 192KB drive space here.

What if your cluster size is 4KB? Each 8KB file will take two clusters, and all 8 files can be stored in one 64KB strip.

What about a 1KB file? If you have a 4KB cluster size and 64KB strip size, you still have 60KB on the strip left for other files, wasting 3KB here, which is inevitable. But if you have a 32KB cluster size, you’ll have only 32KB on the strip left for other files, wasting 31KB.

Using RAID Write-Back Caching (BBU)
Disabling RAID Read Ahead, except sequential read bandwidth with single threaded applications
Disabling RAID Read Cache, read caching consumes RAID memory that could be used by write-back caching.
For more info

How stripe size

Garbage collector Reference
Garbage collector Reference