Enterprise Storage on RHEL4



Red Hat Enterprise Linux 4

Enterprise Storage Quickstart


By Rod Nayfield

(nayfield at redhat dot com)


A hands-on guide to enterprise storage features of Red Hat Enterprise Linux 4, including Fiber Channel connected volume management and multipath I/O.


Revised 02/21/06




Introduction


Recent releases of Red Hat Enterprise Linux include new features for advanced volume management and multipath disk I/O. This document provides a brief tour of the basic use of these features, and can be used to familiarize yourself with how LVM2 and device-mapper-multipath works, and provide a jumping off point for your own storage work.


The multipath features in RHEL 4 operate via device-mapper, a light-weight kernel component that enables the creation of block devices to support volume management, such as LVM2. Device-mapper-multipath allows you to multipath across any HBA supported by the kernel, regardless of vendor. It is possible to have eight paths with differing priorities on a single system. With grouping policies, you can effectively utilize different speed HBAs, different storage controllers, and create complex environments. The tools are also extensible if you desire behaviors which are not included by default.


The file /usr/share/doc/device-mapper-multipath-0.4.5/README contains more information on the supported controllers, which include most active-active SANs as well as active-passive storage such as the CX products. The file multipath.conf.defaults contains more information.


Configuration Used


The configuration described within was tested on a 1U 32-bit x86 server with two FC HBAs. The HBAs used were a Qlogic 2100 and a Qlogic 2200, in order to test different models. The SAN was a DotHill active/active system which presented three LUNs to the host.


I installed RHEL 4 Update 3, only selecting the web-server package group.


Let's dig in!



Initial Baseline


After the system boots, let's see what we can see:

# more /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: SEAGATE  Model: ST318305LC       Rev: 2203
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: DELL     Model: 1x3 U2W SCSI BP  Rev: 1.21
  Type:   Processor                        ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 04
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 06
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 04
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 06
  Vendor: DotHill  Model: SANnet RAID X300 Rev: 0315
  Type:   Direct-Access                    ANSI SCSI revision: 03


This shows us our two controllers, and each sees three LUNs.


If you need re-do a SCSI scan, you can do


echo "- - -" > /sys/class/scsi_host/host0/scan


Where host0 is replaced by the HBA you wish to use. You also can do a fabric rediscover like this:


echo “1” > /sys/class/fc_host/host0/issue_lip
echo "- - -" > /sys/class/scsi_host/host0/scan

This will send a LIP (loop initialization primitive) to the fabric. During the initialization, HBA access may be slow and/or experience timeouts.




Install and Configure Multipath


The only package not installed by default is device-mapper-multipath. I installed it like so:


[root@clu1 RPMS]# rpm -Uvh device-mapper-multipath-0.4.5-8.0.RHEL4.i386.rpm
warning: device-mapper-multipath-0.4.5-8.0.RHEL4.i386.rpm: V3 DSA signature: NOKEY, key ID 897da07a
Preparing...                ########################################### [100%]
   1:device-mapper-multipath########################################### [100%]


Multipathing is off by default. This prevents unexpected behavior on a default installation where it is not needed.


I edited the /etc/multipath.conf and commented out the blacklist everything stanza


#devnode_blacklist {
#  devnode "*"
#} 


It is possible to blacklist devices in this section, more information is available in multipath.conf.annotated. This configuration file also contains information on adding new device types, custom actions, as well as grouping and prioritizing paths.


Let's start the multipath daemon. This service is responsible for restoring failed paths automatically.


[root@clu1 RPMS]# chkconfig multipathd on
[root@clu1 RPMS]# service multipathd start
Starting multipathd daemon:                                [  OK  ]


Now we need to load the dm_multipath module ...


[root@clu1 RPMS]# modprobe dm_multipath


... and turn on multipathing. Running `multipath` creates and updates the devmaps. The -v2 option prints out extended information but is not required.


[root@clu1 ~]# multipath -v2
create: mpath1 (3600d0230003228bc000339414edb8102)
[size=52 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1]
 \_ 2:0:0:0 sdb 8:16 [ready]
\_ round-robin 0 [prio=1]
 \_ 3:0:0:0 sde 8:64 [ready]

create: mpath2 (3600d0230003228bc000339414edb8100)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1]
 \_ 2:0:0:4 sdc 8:32 [ready]
\_ round-robin 0 [prio=1]
 \_ 3:0:0:4 sdf 8:80 [ready]

create: mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1]
 \_ 2:0:0:6 sdd 8:48 [ready]
\_ round-robin 0 [prio=1]
 \_ 3:0:0:6 sdg 8:96 [ready]


Out of the box, RHEL multipathing creates friendly names such as /dev/mapper/mpath1. These names are persistently bound to the WWID (shown in parenthesis) and are stored in /var/lib/multipath/bindings, and persist across reboots. It is possible to comment out the user_friendly_names option and instead of seeing devices like /dev/mapper/mpath1 you will see /dev/mapper/3600d0230003228bc000339414edb8102.


The devices sdd and sdg represent single path devices and should not be used when the multipath device (mpath3) is in use.


I have decided I want to do multipath on the mpath3 LUN. Note that you can run vgscan if the LUNs already contain LVM information.


Let's lookat the multipath information for mpath3.


[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1][active]
 \_ 2:0:0:6 sdd 8:48 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 3:0:0:6 sdg 8:96 [active][ready]


You can see that the LUN is available via two paths, 2:0:0:6 and 3:0:0:6. There are two priority groups of equal priority, and each group has one device. The system will fail-over between these groups. This is due to the default of “path_grouping_policy failover” in multipath.conf.


Placing multiple paths within a group will provide round-robin between the members (note even the single-member groups above are listed as round-robin). You can accomplish this by “path_grouping_policy multibus”. Adding this to the defaults section in multipath.conf results in the following output: (note: `multipath` was run first to reload the paths)


[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=2][enabled]
 \_ 2:0:0:6 sdd 8:48 [active][ready]
 \_ 3:0:0:6 sdg 8:96 [active][ready]

It is possible to create groups by serial, node name, and prioirity. See multipath.conf for more information.


LVM2 configuration


LVM2 has three logical components – the physical volume (pv), a logical volume group (vg) and the logical volume (lv).


First we will mark the LUN as a physical volume.


[root@clu1 ~]# pvcreate /dev/mapper/mpath3
  Physical volume "/dev/mapper/mpath3" successfully created


Now I create a new volume group, new_volgrp. It only contains the pv we created, however you can add pv's in the future.


[root@clu1 ~]# vgcreate new_volgrp /dev/mapper/mpath3
  Volume group "new_volgrp" successfully created


I have decided to create a 600M logical volume in the new volume group. The space not used in the group can be used to create additional lv's, or used to grow our new logical volume. Adding another PV to the group is trivial and allows you to grow for future needs. Again, it is just a single command to create the LV:


[root@clu1 ~]# lvcreate -L 600M -n new_lvol new_volgrp
  Logical volume "new_lvol" created


Now I will make a default ext3 filesystem on the logical volume.


[root@clu1 ~]# mkfs -t ext3 /dev/new_volgrp/new_lvol
mke2fs 1.35 (28-Feb-2004)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
76800 inodes, 153600 blocks
7680 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=159383552
5 block groups
32768 blocks per group, 32768 fragments per group
15360 inodes per group
Superblock backups stored on blocks:
        32768, 98304

Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.


Make a label:

[root@clu1 ~]# e2label /dev/new_volgrp/new_lvol TESTONE


Add to fstab

LABEL=TESTONE           /mnt/TEST        ext3    defaults        1 2


Make the mountpoint and mount

[root@clu1 ~]# mkdir /mnt/TEST
[root@clu1 ~]# mount /mnt/TEST/


Our new filesystem has been mounted.


Testing Failures


Earlier, I mentioned that I was using different HBAs. The advantage for a lab environment is that I can simulate a failure by unloading one of the HBA drivers but not the other. Let's look at the multipath information, fail an adapter, and look again:


[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1][active]
 \_ 3:0:0:6 sdg 8:96  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 4:0:0:6 sdj 8:144 [active][ready]

[root@clu1 ~]# rmmod qla2100
[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
 \_ #:#:#:#  -   8:96  [failed][faulty]
\_ round-robin 0 [prio=1][active]
 \_ 4:0:0:6 sdj 8:144 [active][ready]


Let's bring it back:


[root@clu1 ~]# modprobe qla2100
[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [enabled]
 \_ #:#:#:#  -   8:96  [failed][faulty]
\_ round-robin 0 [prio=1][active]
 \_ 4:0:0:6 sdj 8:144 [active][ready]

[root@clu1 ~]# multipath -ll mpath3
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1][active]
 \_ 4:0:0:6 sdj 8:144 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:6 sdd 8:48  [active][ready]


Multipathd brought the controller back into the configuration after a few seconds as the adapter initialized.



Growing a parition with LVM


Let's use some of that extra space in our volume group to expand our filesystem.


[root@clu1 ~]# vgdisplay new_volgrp
  --- Volume group ---
  VG Name               new_volgrp
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               58.59 GB
  PE Size               4.00 MB
  Total PE              14999
  Alloc PE / Size       150 / 600.00 MB
  Free  PE / Size       14849 / 58.00 GB
  VG WWID               YWltLw-v4CV-6gC1-6CZm-WoCw-1A9w-UVppAV

[root@clu1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             16682200   3195684  12639088  21% /
/dev/sda1               147764      8825    131310   7% /boot
none                    127992         0    127992   0% /dev/shm
/dev/sda5               147764      5664    134471   5% /spare
/dev/mapper/new_volgrp-new_lvol
                        604736     16880    557136   3% /mnt/TEST


So we see that the current filesystem has 604736 blocks. Let's add 40G. If we didn't have 40G of free space in our VG we could add a PV to the VG first. We need two commands, the first will extend the logical volume. The second command resizes the ext3 filesystem while it is on-line.


[root@clu1 ~]# lvextend -L +40G /dev/new_volgrp/new_lvol
  Extending logical volume new_lvol to 40.59 GB
  Logical volume new_lvol successfully resized

[root@clu1 ~]# ext2online /mnt/TEST/
ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b

[root@clu1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             16682200   3195696  12639076  21% /
/dev/sda1               147764      8825    131310   7% /boot
none                    127992         0    127992   0% /dev/shm
/dev/sda5               147764      5664    134471   5% /spare
/dev/mapper/new_volgrp-new_lvol
                      41930648     18116  39787220   1% /mnt/TEST


Now we can see that we have another 40G of space on our filesystem.


Appendix: Useful Commands


Get some information on the multipath, including the WWID:


[root@clu1 ~]# dmsetup info mpath3
Name:              mpath3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 2
Number of targets: 1
WWID: mpath-3600d0230003228bc000339414edb8101


Show the multipath devices configured.


[root@clu1 ~]# dmsetup ls --target=multipath
mpath2  (253, 1)
mpath1  (253, 0)
mpath3  (253, 2)




Determine which multipath and WWID a /dev/sd* device maps to:


[root@clu1 ~]# multipath -l /dev/sdd
mpath3 (3600d0230003228bc000339414edb8101)
[size=58 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
 \_ 2:0:0:6 sdd 8:48 [active][ready]
\_ round-robin 0 [enabled]
 \_ 3:0:0:6 sdg 8:96 [active][ready]


Get a WWID from a sd* device (without using device mapper):


[root@clu1 ~]# scsi_id -g -s /block/sdd
3600d0230003228bc000339414edb8101

Note that this begins with “3” and therefore is page 83 type 3. Good news.


NOTE: If your system returns a value with a leading 1, you are looking at page 83 type 1, the T10 vendor ID based identifier. You will need to check with your vendor to see what they are providing – if they are providing the serial number of the controller, that will not be unique per LU and cause issues.


Appendix: Tuning


Setting “no_path_retry queue” in your multipath configuration (see multipath.conf.annotated) will set your devices to queue I/O forever. This will keep transient SAN issues (that affect both paths) from causing I/O errors. You probably want to test failover (and multipathd's ability to restore a path!) before implementing this.


You also will want to propagate errors up to the multipath level sooner – instead of retrying a particular HBA for several rounds, you may want to fail it and use the alternate path. [TODO: add info on how to tune this in modprobe.conf]


Links


Device-mapper resource page:

http://sources.redhat.com/dm/


More information on LVM2 and device-mapper:

http://people.redhat.com/agk/talks/





TODO


Tuning failover on the HBA module options (ql2xloginretrycount and the like)


Using udev for persistence without device-mapper like

       BUS="scsi", PROGRAM="/sbin/scsi_id", NAME="disk%c%n"

or something like that.


remove and add devices by hand (something like:)

echo "1" > /sys/bus/scsi/devices/0:0:1:1/rescan ...

examples of more complex configs – grouping, >2paths, etc.

Can I boot off of SAN? (mkinitrd rebuild)



Other stuff:

#echo “1” > /sys/bus/scsi/devices/0:0:1:1/delete

# cat /sys/bus/scsi/devices/0:0:1:1/block:sda/dev

8:0

# ls -l /sys/bus/scsi/devices/0:0:1:1/

This document is provided as-is and represents my own experiences and is not official advice from Red Hat, Inc.

+ Recent posts