Наткнулся на статью сис.админа Zerox на счёт мониторинга S.M.A.R.T. диска средствами Zabbix. Но что-то у меня никак не получалось по его записи. Поэтому я опишу свой опыт настройки это необходимой вещи.
Если вы здесь потому что вы начали майнить Chia можете написать мне все мои Контакты, могу предложить услугу по внедрению этого решения на ваши фермы, уже есть опыт.
Будем разворачивать решение с Github. По сути, эта запись просто перевод с небольшими пояснениями. 🙂
Все необходимые компоненты я сложил в архив, который можно скачать с Я.Диска (если ссылка сломалась пишите к комменты, стучите на почту, смотрите на github).
Возможности решения
Данное решает такие задачи:
Подготовка Zabbix-Server
Всё, что Вам потребуется, это добавить замечательный шаблон в свой Zabbix.
Подготовка Zabbix-Agent Windows
Установка smartmontools
Ничего необычного, просто устанавливаем smartmontools, как обычную программу. Единственный момент, не рекомендую менять путь, иначе его надо будет менять в конфиге агента и в скрипте.
Конфигурирование агента
Создаем папку scripts и помещаем туда наш скрипт smartctl-disks-discovery.ps1
Открываем zabbix_agentd.conf и правим
И добавляем пользовательскую проверку
Осталось перезапустить агента и привязать наш хост к шаблону.
Примерно через час прилетят данные. (Для отладки можно поменять время обнаружения, я обычно ставлю 10 минут, меняем 1h на 10m. Главное, не забыть обратно вернуть).
Результат
Таким образом мы настроили мониторинг SSD и HDD дисков. Данное решение отлично показывает себя в проде. По критически важным дискам можно строить вот такие информативные графики. Мне нравится 🙂
ТраблШутинг
У меня такая проблема возникала, когда забыл ставить smartmontools
S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs), solid-state drives (SSDs), and eMMC drives
Available solutions
Also available for: 5.0
SMART by Zabbix agent 2
Overview
For Zabbix version: 5.4 and higher The template for monitoring S.M.A.R.T. attributes of physical disk that works without any external scripts. It collects metrics by Zabbix agent 2 version 5.0 and later with Smartmontools version 7.1 and later. Disk discovery LLD rule finds all HDD, SSD, NVMe disks with S.M.A.R.T. enabled. Attribute discovery LLD rule finds all Vendor Specific Attributes for each disk. If you want to skip some attributes, please set regular expressions with disk names in <$SMART.DISK.NAME.MATCHES> and with attribute IDs in <$SMART.ATTRIBUTE.ID.MATCHES>macros on the host level.
This template was tested on:
Setup
Install the Zabbix agent 2 and Smartmontools 7.1.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
Template links
There are no template links in this template.
Discovery rules
Discovery SMART disks.
ZABBIX_PASSIVE
smart.disk.discovery
Overrides:
Discovery SMART Vendor Specific Attributes of disks.
ZABBIX_PASSIVE
smart.attribute.discovery
Overrides:
Items collected
Group
Name
Description
Type
Key and additional info
Zabbix_raw_items
SMART: Get attributes
DEPENDENT
smart.disk.model[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.sn[<#NAME>]
Preprocessing:
The disk is passed the SMART self-test or not.
DEPENDENT
smart.disk.test[<#NAME>]
Preprocessing:
Current drive temperature.
DEPENDENT
smart.disk.temperature[<#NAME>]
Preprocessing:
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. «By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years (running every day and night on all days). This is equal to 1825 days in 24/7 mode or 43800 hours.» On some pre-2005 drives, this raw value may advance erratically and/or «wrap around» (reset to zero periodically). https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
DEPENDENT
smart.disk.hours[<#NAME>]
Preprocessing:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).
DEPENDENT
smart.disk.percentage_used[<#NAME>]
Preprocessing:
This field indicates critical warnings for the state of the controller.
DEPENDENT
smart.disk.critical_warning[<#NAME>]
Preprocessing:
Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.
DEPENDENT
smart.disk.media_errors[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.error[<#NAME>,<#ID>]
Preprocessing:
DEPENDENT
smart.disk.attr.raw[<#NAME>,<#ID>]
Preprocessing:
Triggers
Name
Description
Expression
Severity
Dependencies and additional info
SMART [<#NAME>]: Disk has been replaced (new serial number received)
Device serial number has changed. Ack to close.
last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>],#1)<>last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>],#2) and length(last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>]))>0
INFO
last(/SMART by Zabbix agent 2/smart.disk.test[<#NAME>])=»false»
HIGH
SMART [<#NAME>]: Average disk temperature is too high (over <$SMART.TEMPERATURE.MAX.WARN>°C for 5m)
Depends on:
— SMART [<#NAME>]: Average disk temperature is critical (over <$SMART.TEMPERATURE.MAX.CRIT>°C for 5m)
avg(/SMART by Zabbix agent 2/smart.disk.temperature[<#NAME>],5m)>
AVERAGE
SMART [<#NAME>]: NVMe disk percentage using is over 90% of estimated endurance
last(/SMART by Zabbix agent 2/smart.disk.percentage_used[<#NAME>])>90
last(/SMART by Zabbix agent 2/smart.disk.error[<#NAME>,<#ID>])
WARNING
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.
References
Also available for: 5.4
Template Module SMART by Zabbix agent 2
Overview
For Zabbix version: 5.0 and higher The template for monitoring S.M.A.R.T. attributes of physical disk that works without any external scripts. It collects metrics by Zabbix agent 2 version 5.0 and later with Smartmontools version 7.1 and later. Disk discovery LLD rule finds all HDD, SSD, NVMe disks with S.M.A.R.T. enabled. Attribute discovery LLD rule finds all Vendor Specific Attributes for each disk. If you want to skip some attributes, please set regular expressions with disk names in <$SMART.DISK.NAME.MATCHES> and with attribute IDs in <$SMART.ATTRIBUTE.ID.MATCHES>macros on the host level.
This template was tested on:
Setup
Install the Zabbix agent 2 and Smartmontools 7.1.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
Template links
There are no template links in this template.
Discovery rules
Discovery SMART disks.
ZABBIX_PASSIVE
smart.disk.discovery
Overrides:
Discovery SMART Vendor Specific Attributes of disks.
ZABBIX_PASSIVE
smart.attribute.discovery
Overrides:
Items collected
Group
Name
Description
Type
Key and additional info
Zabbix_raw_items
SMART: Get attributes
DEPENDENT
smart.disk.model[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.sn[<#NAME>]
Preprocessing:
The disk is passed the SMART self-test or not.
DEPENDENT
smart.disk.test[<#NAME>]
Preprocessing:
Current drive temperature.
DEPENDENT
smart.disk.temperature[<#NAME>]
Preprocessing:
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. «By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years (running every day and night on all days). This is equal to 1825 days in 24/7 mode or 43800 hours.» On some pre-2005 drives, this raw value may advance erratically and/or «wrap around» (reset to zero periodically). https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
DEPENDENT
smart.disk.hours[<#NAME>]
Preprocessing:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).
DEPENDENT
smart.disk.percentage_used[<#NAME>]
Preprocessing:
This field indicates critical warnings for the state of the controller.
DEPENDENT
smart.disk.critical_warning[<#NAME>]
Preprocessing:
Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.
DEPENDENT
smart.disk.media_errors[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.error[<#NAME>,<#ID>]
Preprocessing:
DEPENDENT
smart.disk.attr.raw[<#NAME>,<#ID>]
Preprocessing:
Triggers
Name
Description
Expression
Severity
Dependencies and additional info
SMART [<#NAME>]: Disk has been replaced (new serial number received)
Device serial number has changed. Ack to close.
].last()>=»false»
HIGH
SMART [<#NAME>]: Average disk temperature is too high (over <$SMART.TEMPERATURE.MAX.WARN>°C for 5m)
Depends on:
— SMART [<#NAME>]: Average disk temperature is critical (over <$SMART.TEMPERATURE.MAX.CRIT>°C for 5m)
].avg(5m)>>
AVERAGE
SMART [<#NAME>]: NVMe disk percentage using is over 90% of estimated endurance
The value should be greater than THRESH.
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.
References
Also available for: 5.0
SMART by Zabbix agent 2
Overview
For Zabbix version: 5.4 and higher The template for monitoring S.M.A.R.T. attributes of physical disk that works without any external scripts. It collects metrics by Zabbix agent 2 version 5.0 and later with Smartmontools version 7.1 and later. Disk discovery LLD rule finds all HDD, SSD, NVMe disks with S.M.A.R.T. enabled. Attribute discovery LLD rule finds all Vendor Specific Attributes for each disk. If you want to skip some attributes, please set regular expressions with disk names in <$SMART.DISK.NAME.MATCHES> and with attribute IDs in <$SMART.ATTRIBUTE.ID.MATCHES>macros on the host level.
This template was tested on:
Setup
Install the Zabbix agent 2 and Smartmontools 7.1.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
Template links
There are no template links in this template.
Discovery rules
Discovery SMART disks.
ZABBIX_PASSIVE
smart.disk.discovery
Overrides:
Discovery SMART Vendor Specific Attributes of disks.
ZABBIX_PASSIVE
smart.attribute.discovery
Overrides:
Items collected
Group
Name
Description
Type
Key and additional info
Zabbix_raw_items
SMART: Get attributes
DEPENDENT
smart.disk.model[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.sn[<#NAME>]
Preprocessing:
The disk is passed the SMART self-test or not.
DEPENDENT
smart.disk.test[<#NAME>]
Preprocessing:
Current drive temperature.
DEPENDENT
smart.disk.temperature[<#NAME>]
Preprocessing:
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. «By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years (running every day and night on all days). This is equal to 1825 days in 24/7 mode or 43800 hours.» On some pre-2005 drives, this raw value may advance erratically and/or «wrap around» (reset to zero periodically). https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
DEPENDENT
smart.disk.hours[<#NAME>]
Preprocessing:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).
DEPENDENT
smart.disk.percentage_used[<#NAME>]
Preprocessing:
This field indicates critical warnings for the state of the controller.
DEPENDENT
smart.disk.critical_warning[<#NAME>]
Preprocessing:
Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.
DEPENDENT
smart.disk.media_errors[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.error[<#NAME>,<#ID>]
Preprocessing:
DEPENDENT
smart.disk.attr.raw[<#NAME>,<#ID>]
Preprocessing:
Triggers
Name
Description
Expression
Severity
Dependencies and additional info
SMART [<#NAME>]: Disk has been replaced (new serial number received)
Device serial number has changed. Ack to close.
last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>],#1)<>last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>],#2) and length(last(/SMART by Zabbix agent 2/smart.disk.sn[<#NAME>]))>0
INFO
last(/SMART by Zabbix agent 2/smart.disk.test[<#NAME>])=»false»
HIGH
SMART [<#NAME>]: Average disk temperature is too high (over <$SMART.TEMPERATURE.MAX.WARN>°C for 5m)
Depends on:
— SMART [<#NAME>]: Average disk temperature is critical (over <$SMART.TEMPERATURE.MAX.CRIT>°C for 5m)
avg(/SMART by Zabbix agent 2/smart.disk.temperature[<#NAME>],5m)>
AVERAGE
SMART [<#NAME>]: NVMe disk percentage using is over 90% of estimated endurance
last(/SMART by Zabbix agent 2/smart.disk.percentage_used[<#NAME>])>90
last(/SMART by Zabbix agent 2/smart.disk.error[<#NAME>,<#ID>])
WARNING
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.
References
Also available for: 5.4
Template Module SMART by Zabbix agent 2
Overview
For Zabbix version: 5.0 and higher The template for monitoring S.M.A.R.T. attributes of physical disk that works without any external scripts. It collects metrics by Zabbix agent 2 version 5.0 and later with Smartmontools version 7.1 and later. Disk discovery LLD rule finds all HDD, SSD, NVMe disks with S.M.A.R.T. enabled. Attribute discovery LLD rule finds all Vendor Specific Attributes for each disk. If you want to skip some attributes, please set regular expressions with disk names in <$SMART.DISK.NAME.MATCHES> and with attribute IDs in <$SMART.ATTRIBUTE.ID.MATCHES>macros on the host level.
This template was tested on:
Setup
Install the Zabbix agent 2 and Smartmontools 7.1.
Zabbix configuration
No specific Zabbix configuration is required.
Macros used
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used in overrides of attribute discovery for filtering IDs. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
This macro is used for trigger expression. It can be overridden on the host or linked template level.
Template links
There are no template links in this template.
Discovery rules
Discovery SMART disks.
ZABBIX_PASSIVE
smart.disk.discovery
Overrides:
Discovery SMART Vendor Specific Attributes of disks.
ZABBIX_PASSIVE
smart.attribute.discovery
Overrides:
Items collected
Group
Name
Description
Type
Key and additional info
Zabbix_raw_items
SMART: Get attributes
DEPENDENT
smart.disk.model[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.sn[<#NAME>]
Preprocessing:
The disk is passed the SMART self-test or not.
DEPENDENT
smart.disk.test[<#NAME>]
Preprocessing:
Current drive temperature.
DEPENDENT
smart.disk.temperature[<#NAME>]
Preprocessing:
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. «By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years (running every day and night on all days). This is equal to 1825 days in 24/7 mode or 43800 hours.» On some pre-2005 drives, this raw value may advance erratically and/or «wrap around» (reset to zero periodically). https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
DEPENDENT
smart.disk.hours[<#NAME>]
Preprocessing:
Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).
DEPENDENT
smart.disk.percentage_used[<#NAME>]
Preprocessing:
This field indicates critical warnings for the state of the controller.
DEPENDENT
smart.disk.critical_warning[<#NAME>]
Preprocessing:
Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.
DEPENDENT
smart.disk.media_errors[<#NAME>]
Preprocessing:
DEPENDENT
smart.disk.error[<#NAME>,<#ID>]
Preprocessing:
DEPENDENT
smart.disk.attr.raw[<#NAME>,<#ID>]
Preprocessing:
Triggers
Name
Description
Expression
Severity
Dependencies and additional info
SMART [<#NAME>]: Disk has been replaced (new serial number received)
Device serial number has changed. Ack to close.
].last()>=»false»
HIGH
SMART [<#NAME>]: Average disk temperature is too high (over <$SMART.TEMPERATURE.MAX.WARN>°C for 5m)
Depends on:
— SMART [<#NAME>]: Average disk temperature is critical (over <$SMART.TEMPERATURE.MAX.CRIT>°C for 5m)
].avg(5m)>>
AVERAGE
SMART [<#NAME>]: NVMe disk percentage using is over 90% of estimated endurance
The value should be greater than THRESH.
Feedback
Please report any issues with the template at https://support.zabbix.com
You can also provide a feedback, discuss the template or ask for help with it at ZABBIX forums.