咨询个硬盘 smartd 工具的问题.

查看 21|回复 0
作者:kyonn   
安装 smartd 工具后, 主板上插入一块 Current_Pending_Sector 报错的硬盘, smartd 无法上报错误, 不会发送邮件.
已经排查如下几个方面:
[ol]
  • smartd.conf 中使用 -M test 参数测试过, 邮件发送功能是正常的.
  • 确认是 smartd 运行 short test 时, 虽然读到 Current_Pending_Sector 异常, 没有触发异常流程, 没有触发邮件发送.
  • 这块坏硬盘之前在 OMV5 上能正确识别并发送告警邮件. 现在更换到新主板了, 操作系统是 debian12.
    [/ol]
    /etc/smartd.conf
    /dev/sdd -a -o on -S on -n standby,q -T permissive -s (S/../.././09|L/../01/./04) -W 0,50,55 -m [email protected] -M exec /usr/share/smartmontools/smartd-runner
    smartd 运行 short test 时 syslog 的报错信息:
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], previous self-test completed with error (read test element)
    Jan 22 09:51:19 debian12 smartd[5979]: Device: /dev/sdd [SAT], Self-Test Log error count increased from 9 to 10
    Jan 22 10:00:01 debian12 CRON[6148]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
    smartctl -l selftest /dev/sdd
    smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
    Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       90%     21653         792577336
    # 2  Short offline       Completed: read failure       90%     21647         792577336
    # 3  Short offline       Completed: read failure       90%     21645         792577336
    # 4  Short offline       Completed: read failure       90%     21644         792577336
    # 5  Short offline       Completed: read failure       90%     21642         792577336
    # 6  Short offline       Completed: read failure       90%     21639         792577336
    # 7  Extended offline    Completed: read failure       90%     21639         792577336
    # 8  Short offline       Completed: read failure       90%     21639         792577336
    # 9  Short offline       Completed: read failure       90%     21586         792577336
    有没有配置过 smartd 的大佬给个排查思路, 或者推荐别的类似软件, 能提供类似 omv 的 smart 监控和邮件推送功能. 整个 omv 太重了, 不想再安装.
    完整命令信息: smartctl -a /dev/sdd
    smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-17-amd64] (local build)
    Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST2000DM001-1ER164
    Serial Number:    W56012C5
    LU WWN Device Id: 5 000c50 09b3a804f
    Firmware Version: CC26
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database 7.3/5577
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Mon Jan 22 10:45:42 2024 CST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    General SMART Values:
    Offline data collection status:  (0x00)        Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      ( 121)        The previous self-test completed having
                                            the read element of the test failed.
    Total time to complete Offline
    data collection:                 (   80) seconds.
    Offline data collection
    capabilities:                          (0x73) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            No Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003)        Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01)        Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:          (   1) minutes.
    Extended self-test routine
    recommended polling time:          ( 207) minutes.
    Conveyance self-test routine
    recommended polling time:          (   2) minutes.
    SCT capabilities:                (0x1085)        SCT Status supported.
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   117   088   006    Pre-fail  Always       -       151185992
      3 Spin_Up_Time            0x0003   097   094   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1370
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       82712512
      9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21655
    10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
    12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1262
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   085   085   000    Old_age   Always       -       15
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       5 5 5
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   071   056   045    Old_age   Always       -       29 (Min/Max 14/30)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       187
    193 Load_Cycle_Count        0x0032   090   090   000    Old_age   Always       -       20532
    194 Temperature_Celsius     0x0022   029   044   000    Old_age   Always       -       29 (0 7 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       20310h+10m+19.082s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       8742325499
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       146195076257
    SMART Error Log Version: 1
    ATA Error Count: 15 (device log contains only the most recent five errors)
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    Error 15 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:54:51.942  READ FPDMA QUEUED
      61 00 78 ff ff ff 4f 00  36d+21:54:49.230  WRITE FPDMA QUEUED
      60 00 48 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
      60 00 b0 ff ff ff 4f 00  36d+21:54:48.864  READ FPDMA QUEUED
      60 00 48 ff ff ff 4f 00  36d+21:54:48.862  READ FPDMA QUEUED
    Error 14 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:54:01.959  READ FPDMA QUEUED
      61 00 08 00 22 06 40 00  36d+21:54:01.958  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:54:01.948  SET FEATURES [Enable SATA feature]
      27 00 00 00 00 00 e0 00  36d+21:54:01.921  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
      ec 00 00 00 00 00 a0 00  36d+21:54:01.921  IDENTIFY DEVICE
    Error 13 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:53:58.059  READ FPDMA QUEUED
      61 00 08 ff ff ff 4f 00  36d+21:53:58.046  WRITE FPDMA QUEUED
      61 00 08 00 22 06 40 00  36d+21:53:58.035  WRITE FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:58.015  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:53:58.005  SET FEATURES [Enable SATA feature]
    Error 12 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00  36d+21:53:54.205  READ FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:54.205  WRITE FPDMA QUEUED
      61 00 08 e8 21 06 40 00  36d+21:53:54.204  WRITE FPDMA QUEUED
      ef 10 02 00 00 00 a0 00  36d+21:53:54.193  SET FEATURES [Enable SATA feature]
      27 00 00 00 00 00 e0 00  36d+21:53:54.167  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
    Error 11 occurred at disk power-on lifetime: 21537 hours (897 days + 9 hours)
      When the command that caused the error occurred, the device was active or idle.
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 08 e8 21 06 40 00  36d+21:53:53.926  WRITE FPDMA QUEUED
      61 00 10 ff ff ff 4f 00  36d+21:53:53.671  WRITE FPDMA QUEUED
      60 00 08 ff ff ff 4f 00  36d+21:53:50.339  READ FPDMA QUEUED
      60 00 18 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED
      60 00 20 ff ff ff 4f 00  36d+21:53:50.279  READ FPDMA QUEUED
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       90%     21653         792577336
    # 2  Short offline       Completed: read failure       90%     21647         792577336
    # 3  Short offline       Completed: read failure       90%     21645         792577336
    # 4  Short offline       Completed: read failure       90%     21644         792577336
    # 5  Short offline       Completed: read failure       90%     21642         792577336
    # 6  Short offline       Completed: read failure       90%     21639         792577336
    # 7  Extended offline    Completed: read failure       90%     21639         792577336
    # 8  Short offline       Completed: read failure       90%     21639         792577336
    # 9  Short offline       Completed: read failure       90%     21586         792577336
    #10  Extended offline    Completed: read failure       90%     21546         792577336
    #11  Short offline       Completed without error       00%     21418         -
    #12  Short offline       Completed without error       00%     21250         -
    #13  Short offline       Completed without error       00%     21082         -
    #14  Short offline       Completed without error       00%     20914         -
    #15  Short offline       Completed without error       00%     20746         -
    #16  Short offline       Completed without error       00%     20578         -
    #17  Short offline       Completed without error       00%     20410         -
    #18  Short offline       Completed without error       00%     20242         -
    #19  Short offline       Completed without error       00%     20074         -
    #20  Short offline       Completed without error       00%     19906         -
    #21  Short offline       Completed without error       00%     19738         -
    SMART Selective self-test log data structure revision number 1
    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
  • 您需要登录后才可以回帖 登录 | 立即注册

    返回顶部