ハードウェアRAIDモニタリング
安価なサーバでもハードウェアRAIDを搭載していることは珍しくなくなりました。
何度かサポートしてきていますが、監視していないケースが意外と多く見られます。
そこで、過去に対応したことのあるケースについて、モニタリング方法/導入、使い方を簡単にですが纏めてみました。
ターミナル出力は参考値です。一部マスクしてありますが、ほとんどのケースで正常時のものになっています。
エラー/異常の判断は、ツール毎/異常部位により異なりますので、詳しくはツールのマニュアル類を参照ください。
個人的にはステータスのみで判断して(大抵は保守に入っているでしょうから)交換してしまうと良いと思っています。
キャッシュ用に電池を積んでいる場合は、その消耗と言うケースもよくあります。
ディスクと共に消耗品なので、アラートになる前に、定期的に交換するのが望ましいのですけどね。
なお、大抵の場合、モニタリングソフト/ツール類はハードウェアに付属していますので、あるならばそれを使用すると良いでしょう。
LSI Logic MegaRAID
- RAIDカード確認方法
$ grep -i megaraid /var/log/dmesg scsi0 : LSI Logic MegaRAID driver
MegaCLIというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5082327
http://tools.rapidsoft.de/perc/perc-cheat-sheet.html
# yum info megacli ... Name : megacli Arch : i386 Version: 2.00.11 Release: 2 Size : 1.7 M Repo : installed Summary: MegaCli is used to manage SAS RAID controllers ...
- アダプタ情報
# megacli -AdpAllinfo -aALL Adapter #0 ============================================================================== Versions ================ Product Name : MegaRAID SAS 8708EM2 Serial No : ... FW Package Build: ... Mfg. Data ================ Mfg. Date : ... Rework Date : ... Revision No : ... Battery FRU : ... Image Versions In Flash: ================ FW Version : ... BIOS Version : ... WebBIOS Version : ... Ctrl-R Version : ... Preboot CLI Version: ... Boot Block Version : ... Pending Images In Flash ================ None PCI Info ================ Vendor Id : ... Device Id : ... SubVendorId : ... SubDeviceId : ... Host Interface : PCIE Number of Frontend Port: 0 Device Interface : PCIE Number of Backend Port: 8 Port : Address 0 ... 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 7 ... HW Configuration ================ SAS Address : ... BBU : Absent Alarm : Present NVRAM : Present Serial Debugger : Present Memory : Present Flash : Present Memory Size : 128MB Settings ================ Current Time : ... Predictive Fail Poll Interval : ... Interrupt Throttle Active Count : 16 Interrupt Throttle Completion : 50us Rebuild Rate : 30% PR Rate : 30% Resynch Rate : 30% Check Consistency Rate : 30% Reconstruction Rate : 30% Cache Flush Interval : 4s Max Drives to Spinup at One Time : 2 Delay Among Spinup Groups : 12s Physical Drive Coercion Mode : Disabled Cluster Mode : Disabled Alarm : Disabled Auto Rebuild : Enabled Battery Warning : Disabled Ecc Bucket Size : 15 Ecc Bucket Leak Rate : 1440 Minutes Restore HotSpare on Insertion : Enabled Expose Enclosure Devices : Disabled Maintain PD Fail History : Disabled Host Request Reordering : Enabled Auto Detect BackPlane Enabled : SGPIO/i2c SEP Load Balance Mode : Auto Capabilities ================ RAID Level Supported : RAID0, RAID1, RAID10 Supported Drives : SAS, SATA Allowed Mixing: Mix In Enclosure Allowed Status ================ ECC Bucket Count : 0 Limitations ================ Max Arms Per VD : 32 Max Spans Per VD : 8 Max Arrays : 128 Max Number of VDs : 64 Max Parallel Commands : 1008 Max SGE Count : 80 Max Data Transfer Size : 8192 sectors Max Strips PerIO : 42 Min Stripe Size : 8kB Max Stripe Size : 1024kB Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 3 Disks : 2 Critical Disks : 0 Failed Disks : 0 Supported Adapter Operations ================ Rebuild Rate : Yes CC Rate : Yes BGI Rate : Yes Reconstruct Rate : Yes Patrol Read Rate : Yes Alarm Control : Yes Cluster Support : No BBU : Yes Spanning : Yes Dedicated Hot Spare : Yes Revertible Hot Spares : No Foreign Config Import : Yes Self Diagnostic : Yes Allow Mixed Redundancy on Array : No Global Hot Spares : Yes Deny SCSI Passthrough : No Deny SMP Passthrough : No Deny STP Passthrough : No Supported VD Operations ================ Read Policy : Yes Write Policy : Yes IO Policy : Yes Access Policy : Yes Disk Cache Policy : Yes Reconstruction : Yes Deny Locate : No Deny CC : No Supported PD Operations ================ Force Online : Yes Force Offline : Yes Force Rebuild : Yes Deny Force Failed : No Deny Force Good/Bad : No Deny Missing Replace : No Deny Clear : No Deny Locate : No Disable Copyback : No Enable Copyback on SMART : No Error Counters ================ Memory Correctable Errors : 0 Memory Uncorrectable Errors : 0 Cluster Information ================ Cluster Permitted : No Cluster Active : No Default Settings ================ Phy Polarity : 0 Phy PolaritySplit : 0 Background Rate : 30 Stripe Size : 64kB Flush Time : 4 seconds Write Policy : WB Read Policy : None Cache When BBU Bad : Disabled Cached IO : No SMART Mode : Mode 6 Alarm Disable : No Coercion Mode : None ZCR Config : Unknown Dirty LED Shows Drive Activity : No BIOS Continue on Error : Yes Spin Down Mode : None Allowed Device Type : SAS/SATA Mix Allow Mix In Enclosure : Yes Allow Mix In VD : No Allow SATA In Cluster : No Max Chained Enclosures : 3 Disable Ctrl-R : Yes Enable Web BIOS : Yes Direct PD Mapping : Yes BIOS Enumerate VDs : Yes Restore Hot Spare on Insertion : Yes Expose Enclosure Devices : No Maintain PD Fail History : No Disable Puncturing : Yes Zero Based Enclosure Enumeration : No PreBoot CLI Enabled : No LED Show Drive Activity : Yes Cluster Disable : Yes SAS Disable : No Auto Detect BackPlane Enable : SGPIO/i2c SEP Exit Code: 0x00
- 物理デバイス情報
# megacli -PDList -aALL Adapter #0 Enclosure Device ID: 252 Slot Number: 0 Device Id: 0 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: ...MB [... Sectors] Non Coerced Size: ...MB [... Sectors] Coerced Size: ...MB [... Sectors] Firmware state: Online SAS Address(0): ... SAS Address(1): ... Connected Port Number: 0(path0) Inquiry Data: SEAGATE ... Foreign State: None Enclosure Device ID: 252 Slot Number: 1 Device Id: 1 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: ...MB [... Sectors] Non Coerced Size: ...MB [... Sectors] Coerced Size: ...MB [... Sectors] Firmware state: Online SAS Address(0): ... SAS Address(1): ... Connected Port Number: 1(path0) Inquiry Data: SEAGATE ... Foreign State: None Exit Code: 0x00
- 論理デバイス情報
# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Disk: 0 (Target Id: 0) Name:array0 RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0 Size:...MB State: Optimal Stripe Size: 64kB Number Of Drives:2 Span Depth:1 Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Access Policy: Read/Write Disk Cache Policy: Disabled Exit Code: 0x00
メーカ限定ですが、NEC社Express5800シリーズに対しては、NEC社がツールを公開しています。
比較的新しいExpress5800シリーズにはUniversal RAID Utilityが使用出来ます。
http://support.express.nec.co.jp/dload/420842-A01/index.html
既に製品情報に無いようなExpress5800シリーズにはMegaMonitorが使用出来ます。
(RAIDカードのファームウェアが古くて、megacliも動作しない場合があります。)
http://www.express.nec.co.jp/linux/distributions/confirm/gam/megamgr.htm
LSI Logic Fusion-MPT
- RAIDカード確認方法
$ grep -i mptbase /var/log/dmesg mptbase: ioc0: Initiating bringup
mpt-statusというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://sven.stormbind.net/mpt-status-rhel/
daemonizeはepelにもありますので、そちらからインストールしても良いでしょう。
# yum --enablerepo=epel info daemonize ... Name : daemonize Arch : x86_64 Version : 1.7.3 Release : 1.el6 Size : 19 k Repo : installed Summary : Run a command as a Unix daemon URL : http://www.clapper.org/software/daemonize/ License : BSD Description : daemonize runs a command as a Unix daemon. As defined in W. ...
# yum info mpt-status ... Name : mpt-status Arch : x86_64 Version : 1.2.0 Release : 3.el6 Size : 31 k Repo : installed Summary : Get RAID status out of mpt (and other) HW RAID controllers URL : http://www.drugphish.ch/~ratz/mpt-status/ License : GPLv2+ ...
# chkconfig --list mpt-statusd mpt-statusd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# service mpt-statusd start Starting mpt-status monitor: mpt-statusd [ OK ]
正常時。
# mpt-status -s log_id 0 OPTIMAL phys_id 0 ONLINE phys_id 1 ONLINE
異常時。
# mpt-status -s log_id 0 DEGRADED phys_id 0 ONLINE phys_id 1 FAILED # mpt-status -v ioc0 vol_id 0 type IM, 2 phy, 33 GB, state DEGRADED, flags ENABLED ioc0 phy 0 scsi_id 0 IBM-ESXS MAP3367NC FN B109, 33 GB, state ONLINE, flags NONE ioc0 phy 1 scsi_id 1 IBM-ESXS MAP3367NC FN B109, 33 GB, state FAILED, flags OUT_OF_SYNC
HP SmartArray
- RAIDカード確認方法
$ head -1 /proc/driver/cciss/cciss0 cciss0: HP Smart Array P400i Controller
http://www8.hp.com/jp/ja/support-drivers.html
上記サポートページから製品情報を検索し「HP アレイ コンフィギュレーション ユーティリティ CLI for Linux」(使用OSによっては64ビット用)を入手します。
http://www.datadisk.co.uk/html_docs/redhat/hpacucli.htm
# hpacucli ctrl all show config Smart Array P400i in Slot 0 (Embedded) (sn: ... ) array A (SAS, Unused Space: 0 MB) logicaldrive 1 (... GB, RAID 1, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, ... GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, ... GB, OK)
# hpacucli ctrl slot=0 array A show Smart Array P400i in Slot 0 (Embedded) Array: A Interface Type: SAS Unused Space: 0 MB Status: OK MultiDomain Status: OK
cciss_vol_statusというステータス取得ツールも公開されています。
http://h50146.www5.hp.com/products/software/oe/linux/mainstream/support/download/cciss_vol_status/
# cciss_vol_status /dev/cciss/c0d0 /dev/cciss/c0d0: (Smart Array P400i) RAID 1 Volume 0 status: OK.
IBM/Adaptec ServeRAID
- RAIDカード確認方法
$ grep -i serveraid /var/log/dmesg scsi0 : ServeRAID
arcconfというコマンドラインツールが公開されていますので、これを使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5073618&brandind=5000008
http://www.obvious.co.nz/aacraid/arcconf/
# arcconf -v | ARCCONF | IBM uniform command line interface | ARCCONF | Version 9.30 (B17006) | ARCCONF | (C) Adaptec 2003-2007 | ARCCONF | All Rights Reserved ...
# arcconf getconfig 1 Controllers found: 1 ---------------------------------------------------------------------- Controller information ---------------------------------------------------------------------- Controller Status : Okay Channel description : SAS/SATA Controller Model : IBM ServeRAID 8k Controller Serial Number : ... Physical Slot : 0 Installed memory : 256 MB Copyback : Disabled Data scrubbing : Enabled Defunct disk drive count : 0 Logical drives/Offline/Critical : 1/0/0 -------------------------------------------------------- Controller Version Information -------------------------------------------------------- BIOS : ... Firmware : ... Driver : ... Boot Flash : ... -------------------------------------------------------- Controller Battery Information -------------------------------------------------------- Status : Okay Over temperature : No Capacity remaining : 100 percent Time remaining (at current draw) : ... days, ... hours, ... minutes -------------------------------------------------------- Controller Vital Product Data -------------------------------------------------------- VPD Assigned# : ... EC Version# : ... Controller FRU# : ... Battery FRU# : ... ---------------------------------------------------------------------- Logical drive information ---------------------------------------------------------------------- Logical drive number 1 Logical drive name : Drive 1 RAID level : 5 Status of logical drive : Okay Size : ... MB Read-cache mode : Enabled Write-cache mode : Enabled (write-back) Write-cache setting : Enabled (write-back) when protected by battery Partitioned : Yes Number of segments : 3 Stripe-unit size : 256 KB Stripe order (Channel,Device) : 0,2 0,1 0,3 Defunct segments : No Defunct stripes : No ---------------------------------------------------------------------- Physical Device information ---------------------------------------------------------------------- Device #0 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device : 0,1 Reported Location : Enclosure 0, Slot 1 Reported ESD : 2,0 Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... PFA : No Device #1 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device : 0,2 Reported Location : Enclosure 0, Slot 2 Reported ESD : 2,0 Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... PFA : No Device #2 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device : 0,3 Reported Location : Enclosure 0, Slot 0 Reported ESD : 2,0 Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... PFA : No Device #3 Device is an Enclosure services device Reported Channel,Device : 2,0 Enclosure ID : 0 Type : SES2 Vendor : IBM Model : SAS SES-2 DEVICE Firmware : 1.10 Status of Enclosure services device Temperature : Normal Command completed successfully.
コマンドのバージョンによって、応答内容が若干異なったりするので、監視に組み込む際には注意しましょう。
# arcconf -v | UCLI | Adaptec by PMC uniform command line interface | UCLI | Version 7.0 (B18786) | UCLI | (C) Adaptec by PMC 2003-2011 | UCLI | All Rights Reserved ...
# arcconf getconfig 1 Controllers found: 1 ---------------------------------------------------------------------- Controller information ---------------------------------------------------------------------- Controller Status : Optimal Channel description : SAS/SATA Controller Model : IBM ServeRAID 8k Controller Serial Number : ... Physical Slot : 0 Installed memory : 256 MB Copyback : Disabled Background consistency check : Enabled Automatic Failover : Enabled Stayawake period : Disabled Spinup limit internal drives : 0 Spinup limit external drives : 0 Defunct disk drive count : 0 Logical devices/Failed/Degraded : 1/0/0 -------------------------------------------------------- Controller Version Information -------------------------------------------------------- BIOS : ... Firmware : ... Driver : ... Boot Flash : ... -------------------------------------------------------- Controller Battery Information -------------------------------------------------------- Status : Optimal Over temperature : No Capacity remaining : 100 percent Time remaining (at current draw) : ... days, ... hours, ... minutes ---------------------------------------------------------------------- Logical device information ---------------------------------------------------------------------- Logical device number 0 Logical device name : Drive 1 RAID level : 5 Status of logical device : Optimal Size : ... MB Stripe-unit size : 256 KB Read-cache mode : Enabled Write-cache mode : Enabled (write-back) Write-cache setting : Enabled (write-back) when protected by battery/ZMM Partitioned : Yes Protected by Hot-Spare : No Bootable : Yes Failed stripes : No Power settings : Disabled -------------------------------------------------------- Logical device segment information -------------------------------------------------------- Segment 0 : ... Segment 1 : ... Segment 2 : ... ---------------------------------------------------------------------- Physical Device information ---------------------------------------------------------------------- Device #0 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device(T:L) : 0,1(1:0) Reported Location : Enclosure 0, Slot 1 Reported ESD(T:L) : 2,0(0:0) Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... S.M.A.R.T. : No S.M.A.R.T. warnings : 0 Device #1 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device(T:L) : 0,2(2:0) Reported Location : Enclosure 0, Slot 2 Reported ESD(T:L) : 2,0(0:0) Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... S.M.A.R.T. : No S.M.A.R.T. warnings : 0 Device #2 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SAS 3.0 Gb/s Reported Channel,Device(T:L) : 0,3(3:0) Reported Location : Enclosure 0, Slot 0 Reported ESD(T:L) : 2,0(0:0) Vendor : IBM-ESXS Model : ... Firmware : ... Serial number : ... World-wide name : ... Size : ... MB Write Cache : Disabled (write-through) FRU : ... S.M.A.R.T. : No S.M.A.R.T. warnings : 0 Device #3 Device is an Enclosure services device Reported Channel,Device(T:L) : 2,0(0:0) Enclosure ID : 0 Type : SES2 Vendor : IBM Model : SAS 4 DRIVE BP Firmware : 1.10 Status of Enclosure services device Command completed successfully.
S.M.A.R.T. の利用
RAIDカードによっては、S.M.A.R.T. Monitoring Tools が使用出来ます。
モニタリングと併用するのも良いでしょう。
http://sourceforge.net/projects/smartmontools/
http://sourceforge.net/apps/trac/smartmontools/wiki/Supported_RAID-Controllers
# yum info smartmontools ... Name : smartmontools Arch : x86_64 Epoch : 1 Version : 5.42 Release : 2.el6 Size : 1.3 M Repo : installed From repo : base Summary : Tools for monitoring SMART capable hard disks URL : http://smartmontools.sourceforge.net/ License : GPLv2+ ...
ちなみにCentOS 4まではkernel-utilsに入ってはいますが、RAIDカード用には機能しないかもしれません。
# smartctl -a -dcciss,1 /dev/cciss/c0d0p1 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.5.1.el6.x86_64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: HP Product: ... Revision: ... User Capacity: ... bytes [... GB] Logical block size: ... bytes Logical Unit id: ... Serial number: ... Device type: disk Transport protocol: SAS Local Time is: ... Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 30 C Drive Trip Temperature: 70 C Manufactured in week 01 of year 2010 Specified cycle count over device lifetime: ... Accumulated start-stop cycles: 10 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = ... Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 0 0 0 0 0.000 0 write: 0 0 0 0 0 0.000 0 Non-medium error count: 0 No self-tests have been logged Long (extended) Self Test duration: ... seconds [... minutes]