Locked

CentOS_TIPS_000

ハードウェアRAIDモニタリング

安価なサーバでもハードウェアRAIDを搭載していることは珍しくなくなりました。
何度かサポートしてきていますが、監視していないケースが意外と多く見られます。
そこで、過去に対応したことのあるケースについて、モニタリング方法/導入、使い方を簡単にですが纏めてみました。

ターミナル出力は参考値です。一部マスクしてありますが、ほとんどのケースで正常時のものになっています。
エラー/異常の判断は、ツール毎/異常部位により異なりますので、詳しくはツールのマニュアル類を参照ください。

個人的にはステータスのみで判断して(大抵は保守に入っているでしょうから)交換してしまうと良いと思っています。
キャッシュ用に電池を積んでいる場合は、その消耗と言うケースもよくあります。
ディスクと共に消耗品なので、アラートになる前に、定期的に交換するのが望ましいのですけどね。

なお、大抵の場合、モニタリングソフト/ツール類はハードウェアに付属していますので、あるならばそれを使用すると良いでしょう。

LSI Logic MegaRAID

  • RAIDカード確認方法

$ grep -i megaraid /var/log/dmesg
scsi0 : LSI Logic MegaRAID driver

MegaCLIというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5082327
http://tools.rapidsoft.de/perc/perc-cheat-sheet.html

# yum info megacli
...
Name   : megacli
Arch   : i386
Version: 2.00.11
Release: 2
Size   : 1.7 M
Repo   : installed
Summary: MegaCli is used to manage SAS RAID controllers
...
  • アダプタ情報

# megacli -AdpAllinfo -aALL

Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : MegaRAID SAS 8708EM2
Serial No       : ...
FW Package Build: ...

                    Mfg. Data
                ================
Mfg. Date       : ...
Rework Date     : ...
Revision No     : ...
Battery FRU     : ...

                Image Versions In Flash:
                ================
FW Version         : ...
BIOS Version       : ...
WebBIOS Version    : ...
Ctrl-R Version     : ...
Preboot CLI Version: ...
Boot Block Version : ...

                Pending Images In Flash
                ================
None

                PCI Info
                ================
Vendor Id       : ...
Device Id       : ...
SubVendorId     : ...
SubDeviceId     : ...

Host Interface  : PCIE

Number of Frontend Port: 0
Device Interface  : PCIE

Number of Backend Port: 8
Port  :  Address
0        ...
1        ...
2        ...
3        ...
4        ...
5        ...
6        ...
7        ...

                HW Configuration
                ================
SAS Address     : ...
BBU             : Absent
Alarm           : Present
NVRAM           : Present
Serial Debugger : Present
Memory          : Present
Flash           : Present
Memory Size     : 128MB

                Settings
                ================
Current Time                     : ...
Predictive Fail Poll Interval    : ...
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
Resynch Rate                     : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 2
Delay Among Spinup Groups        : 12s
Physical Drive Coercion Mode     : Disabled
Cluster Mode                     : Disabled
Alarm                            : Disabled
Auto Rebuild                     : Enabled
Battery Warning                  : Disabled
Ecc Bucket Size                  : 15
Ecc Bucket Leak Rate             : 1440 Minutes
Restore HotSpare on Insertion    : Enabled
Expose Enclosure Devices         : Disabled
Maintain PD Fail History         : Disabled
Host Request Reordering          : Enabled
Auto Detect BackPlane Enabled    : SGPIO/i2c SEP
Load Balance Mode                : Auto

                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID10
Supported Drives                 : SAS, SATA

Allowed Mixing:
Mix In Enclosure Allowed

                Status
                ================
ECC Bucket Count                 : 0

                Limitations
                ================
Max Arms Per VD         : 32
Max Spans Per VD        : 8
Max Arrays              : 128
Max Number of VDs       : 64
Max Parallel Commands   : 1008
Max SGE Count           : 80
Max Data Transfer Size  : 8192 sectors
Max Strips PerIO        : 42
Min Stripe Size         : 8kB
Max Stripe Size         : 1024kB

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 3
  Disks           : 2
  Critical Disks  : 0
  Failed Disks    : 0

                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
CC Rate                         : Yes
BGI Rate                        : Yes
Reconstruct Rate                : Yes
Patrol Read Rate                : Yes
Alarm Control                   : Yes
Cluster Support                 : No
BBU                             : Yes
Spanning                        : Yes
Dedicated Hot Spare             : Yes
Revertible Hot Spares           : No
Foreign Config Import           : Yes
Self Diagnostic                 : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares               : Yes
Deny SCSI Passthrough           : No
Deny SMP Passthrough            : No
Deny STP Passthrough            : No

                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
IO Policy            : Yes
Access Policy        : Yes
Disk Cache Policy    : Yes
Reconstruction       : Yes
Deny Locate          : No
Deny CC              : No

                Supported PD Operations
                ================
Force Online              : Yes
Force Offline             : Yes
Force Rebuild             : Yes
Deny Force Failed         : No
Deny Force Good/Bad       : No
Deny Missing Replace      : No
Deny Clear                : No
Deny Locate               : No
Disable Copyback          : No
Enable Copyback on SMART  : No

                Error Counters
                ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0

                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No

                Default Settings
                ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
Stripe Size                      : 64kB
Flush Time                       : 4 seconds
Write Policy                     : WB
Read Policy                      : None
Cache When BBU Bad               : Disabled
Cached IO                        : No
SMART Mode                       : Mode 6
Alarm Disable                    : No
Coercion Mode                    : None
ZCR Config                       : Unknown
Dirty LED Shows Drive Activity   : No
BIOS Continue on Error           : Yes
Spin Down Mode                   : None
Allowed Device Type              : SAS/SATA Mix
Allow Mix In Enclosure           : Yes
Allow Mix In VD                  : No
Allow SATA In Cluster            : No
Max Chained Enclosures           : 3
Disable Ctrl-R                   : Yes
Enable Web BIOS                  : Yes
Direct PD Mapping                : Yes
BIOS Enumerate VDs               : Yes
Restore Hot Spare on Insertion   : Yes
Expose Enclosure Devices         : No
Maintain PD Fail History         : No
Disable Puncturing               : Yes
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled              : No
LED Show Drive Activity          : Yes
Cluster Disable                  : Yes
SAS Disable                      : No
Auto Detect BackPlane Enable     : SGPIO/i2c SEP

Exit Code: 0x00
  • 物理デバイス情報

# megacli -PDList -aALL

Adapter #0

Enclosure Device ID: 252
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: ...MB [... Sectors]
Non Coerced Size: ...MB [... Sectors]
Coerced Size: ...MB [... Sectors]
Firmware state: Online
SAS Address(0): ...
SAS Address(1): ...
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ...
Foreign State: None

Enclosure Device ID: 252
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: ...MB [... Sectors]
Non Coerced Size: ...MB [... Sectors]
Coerced Size: ...MB [... Sectors]
Firmware state: Online
SAS Address(0): ...
SAS Address(1): ...
Connected Port Number: 1(path0)
Inquiry Data: SEAGATE ...
Foreign State: None


Exit Code: 0x00
  • 論理デバイス情報

# megacli -LDInfo -Lall -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:array0
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:...MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disabled

Exit Code: 0x00

メーカ限定ですが、NEC社Express5800シリーズに対しては、NEC社がツールを公開しています。

比較的新しいExpress5800シリーズにはUniversal RAID Utilityが使用出来ます。
http://support.express.nec.co.jp/dload/420842-A01/index.html

既に製品情報に無いようなExpress5800シリーズにはMegaMonitorが使用出来ます。
(RAIDカードのファームウェアが古くて、megacliも動作しない場合があります。)
http://www.express.nec.co.jp/linux/distributions/confirm/gam/megamgr.htm

LSI Logic Fusion-MPT

  • RAIDカード確認方法

$ grep -i mptbase /var/log/dmesg
mptbase: ioc0: Initiating bringup

mpt-statusというコマンドラインツールが公開されていますので、これをビルドして使用します。
http://sven.stormbind.net/mpt-status-rhel/

daemonizeはepelにもありますので、そちらからインストールしても良いでしょう。

# yum --enablerepo=epel info daemonize
...
Name        : daemonize
Arch        : x86_64
Version     : 1.7.3
Release     : 1.el6
Size        : 19 k
Repo        : installed
Summary     : Run a command as a Unix daemon
URL         : http://www.clapper.org/software/daemonize/
License     : BSD
Description : daemonize runs a command as a Unix daemon. As defined in W.
...

# yum info mpt-status
...
Name        : mpt-status
Arch        : x86_64
Version     : 1.2.0
Release     : 3.el6
Size        : 31 k
Repo        : installed
Summary     : Get RAID status out of mpt (and other) HW RAID controllers
URL         : http://www.drugphish.ch/~ratz/mpt-status/
License     : GPLv2+
...

# chkconfig --list mpt-statusd
mpt-statusd     0:off   1:off   2:on    3:on    4:on    5:on    6:off

# service mpt-statusd start
Starting mpt-status monitor: mpt-statusd
                                                           [  OK  ]

正常時。

# mpt-status -s
log_id 0 OPTIMAL
phys_id 0 ONLINE
phys_id 1 ONLINE

異常時。

# mpt-status -s
log_id 0 DEGRADED
phys_id 0 ONLINE
phys_id 1 FAILED

# mpt-status -v
ioc0 vol_id 0 type IM, 2 phy, 33 GB, state DEGRADED, flags ENABLED
ioc0 phy 0 scsi_id 0 IBM-ESXS MAP3367NC     FN B109, 33 GB, state ONLINE, flags NONE
ioc0 phy 1 scsi_id 1 IBM-ESXS MAP3367NC     FN B109, 33 GB, state FAILED, flags OUT_OF_SYNC

HP SmartArray

  • RAIDカード確認方法

$ head -1 /proc/driver/cciss/cciss0
cciss0: HP Smart Array P400i Controller

http://www8.hp.com/jp/ja/support-drivers.html
上記サポートページから製品情報を検索し「HP アレイ コンフィギュレーション ユーティリティ CLI for Linux」(使用OSによっては64ビット用)を入手します。

http://www.datadisk.co.uk/html_docs/redhat/hpacucli.htm

# hpacucli ctrl all show config

Smart Array P400i in Slot 0 (Embedded)    (sn: ...     )

   array A (SAS, Unused Space: 0 MB)


      logicaldrive 1 (... GB, RAID 1, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, ... GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, ... GB, OK)

# hpacucli ctrl slot=0 array A show
Smart Array P400i in Slot 0 (Embedded)

   Array: A
      Interface Type: SAS
      Unused Space: 0 MB
      Status: OK
      MultiDomain Status: OK

cciss_vol_statusというステータス取得ツールも公開されています。
http://h50146.www5.hp.com/products/software/oe/linux/mainstream/support/download/cciss_vol_status/

# cciss_vol_status /dev/cciss/c0d0
/dev/cciss/c0d0: (Smart Array P400i) RAID 1 Volume 0 status: OK.

IBM/Adaptec ServeRAID

  • RAIDカード確認方法

$ grep -i serveraid /var/log/dmesg
scsi0 : ServeRAID

arcconfというコマンドラインツールが公開されていますので、これを使用します。
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5073618&brandind=5000008
http://www.obvious.co.nz/aacraid/arcconf/

# arcconf -v
  | ARCCONF |  IBM uniform command line interface
  | ARCCONF |  Version 9.30 (B17006)
  | ARCCONF |  (C) Adaptec 2003-2007
  | ARCCONF |  All Rights Reserved
...

# arcconf getconfig 1

Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
   Controller Status                        : Okay
   Channel description                      : SAS/SATA
   Controller Model                         : IBM ServeRAID 8k
   Controller Serial Number                 : ...
   Physical Slot                            : 0
   Installed memory                         : 256 MB
   Copyback                                 : Disabled
   Data scrubbing                           : Enabled
   Defunct disk drive count                 : 0
   Logical drives/Offline/Critical          : 1/0/0
   --------------------------------------------------------
   Controller Version Information
   --------------------------------------------------------
   BIOS                                     : ...
   Firmware                                 : ...
   Driver                                   : ...
   Boot Flash                               : ...
   --------------------------------------------------------
   Controller Battery Information
   --------------------------------------------------------
   Status                                   : Okay
   Over temperature                         : No
   Capacity remaining                       : 100 percent
   Time remaining (at current draw)         : ... days, ... hours, ... minutes
   --------------------------------------------------------
   Controller Vital Product Data
   --------------------------------------------------------
   VPD Assigned#                            : ...
   EC Version#                              : ...
   Controller FRU#                          : ...
   Battery FRU#                             : ...

----------------------------------------------------------------------
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
   Logical drive name                       : Drive 1
   RAID level                               : 5
   Status of logical drive                  : Okay
   Size                                     : ... MB
   Read-cache mode                          : Enabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery
   Partitioned                              : Yes
   Number of segments                       : 3
   Stripe-unit size                         : 256 KB
   Stripe order (Channel,Device)            : 0,2 0,1 0,3
   Defunct segments                         : No
   Defunct stripes                          : No

----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
      Device #0
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device            : 0,1
         Reported Location                  : Enclosure 0, Slot 1
         Reported ESD                       : 2,0
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         PFA                                : No
      Device #1
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device            : 0,2
         Reported Location                  : Enclosure 0, Slot 2
         Reported ESD                       : 2,0
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         PFA                                : No
      Device #2
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device            : 0,3
         Reported Location                  : Enclosure 0, Slot 0
         Reported ESD                       : 2,0
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         PFA                                : No
      Device #3
         Device is an Enclosure services device
         Reported Channel,Device            : 2,0
         Enclosure ID                       : 0
         Type                               : SES2
         Vendor                             : IBM
         Model                              : SAS SES-2 DEVICE
         Firmware                           : 1.10
         Status of Enclosure services device
            Temperature                     : Normal


Command completed successfully.

コマンドのバージョンによって、応答内容が若干異なったりするので、監視に組み込む際には注意しましょう。

# arcconf -v
  | UCLI |  Adaptec by PMC uniform command line interface
  | UCLI |  Version 7.0 (B18786)
  | UCLI |  (C) Adaptec by PMC 2003-2011
  | UCLI |  All Rights Reserved
...

# arcconf getconfig 1

Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
   Controller Status                        : Optimal
   Channel description                      : SAS/SATA
   Controller Model                         : IBM ServeRAID 8k
   Controller Serial Number                 : ...
   Physical Slot                            : 0
   Installed memory                         : 256 MB
   Copyback                                 : Disabled
   Background consistency check             : Enabled
   Automatic Failover                       : Enabled
   Stayawake period                         : Disabled
   Spinup limit internal drives             : 0
   Spinup limit external drives             : 0
   Defunct disk drive count                 : 0
   Logical devices/Failed/Degraded          : 1/0/0
   --------------------------------------------------------
   Controller Version Information
   --------------------------------------------------------
   BIOS                                     : ...
   Firmware                                 : ...
   Driver                                   : ...
   Boot Flash                               : ...
   --------------------------------------------------------
   Controller Battery Information
   --------------------------------------------------------
   Status                                   : Optimal
   Over temperature                         : No
   Capacity remaining                       : 100 percent
   Time remaining (at current draw)         : ... days, ... hours, ... minutes

----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
   Logical device name                      : Drive 1
   RAID level                               : 5
   Status of logical device                 : Optimal
   Size                                     : ... MB
   Stripe-unit size                         : 256 KB
   Read-cache mode                          : Enabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : Yes
   Protected by Hot-Spare                   : No
   Bootable                                 : Yes
   Failed stripes                           : No
   Power settings                           : Disabled
   --------------------------------------------------------
   Logical device segment information
   --------------------------------------------------------
   Segment 0                                : ...
   Segment 1                                : ...
   Segment 2                                : ...


----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
      Device #0
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,1(1:0)
         Reported Location                  : Enclosure 0, Slot 1
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         S.M.A.R.T.                         : No
         S.M.A.R.T. warnings                : 0
      Device #1
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,2(2:0)
         Reported Location                  : Enclosure 0, Slot 2
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         S.M.A.R.T.                         : No
         S.M.A.R.T. warnings                : 0
      Device #2
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,3(3:0)
         Reported Location                  : Enclosure 0, Slot 0
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : ...
         Firmware                           : ...
         Serial number                      : ...
         World-wide name                    : ...
         Size                               : ... MB
         Write Cache                        : Disabled (write-through)
         FRU                                : ...
         S.M.A.R.T.                         : No
         S.M.A.R.T. warnings                : 0
      Device #3
         Device is an Enclosure services device
         Reported Channel,Device(T:L)       : 2,0(0:0)
         Enclosure ID                       : 0
         Type                               : SES2
         Vendor                             : IBM
         Model                              : SAS 4 DRIVE BP
         Firmware                           : 1.10
         Status of Enclosure services device


Command completed successfully.

S.M.A.R.T. の利用

RAIDカードによっては、S.M.A.R.T. Monitoring Tools が使用出来ます。
モニタリングと併用するのも良いでしょう。
http://sourceforge.net/projects/smartmontools/
http://sourceforge.net/apps/trac/smartmontools/wiki/Supported_RAID-Controllers

# yum info smartmontools
...
Name        : smartmontools
Arch        : x86_64
Epoch       : 1
Version     : 5.42
Release     : 2.el6
Size        : 1.3 M
Repo        : installed
From repo   : base
Summary     : Tools for monitoring SMART capable hard disks
URL         : http://smartmontools.sourceforge.net/
License     : GPLv2+
...

ちなみにCentOS 4まではkernel-utilsに入ってはいますが、RAIDカード用には機能しないかもしれません。

# smartctl -a -dcciss,1 /dev/cciss/c0d0p1

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.5.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               HP
Product:              ...
Revision:             ...
User Capacity:        ... bytes [... GB]
Logical block size:   ... bytes
Logical Unit id:      ...
Serial number:        ...
Device type:          disk
Transport protocol:   SAS
Local Time is:        ...
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        70 C
Manufactured in week 01 of year 2010
Specified cycle count over device lifetime:  ...
Accumulated start-stop cycles:  10
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = ...

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0

Non-medium error count:       0
No self-tests have been logged
Long (extended) Self Test duration: ... seconds [... minutes]