Purpose
This article provide information about the monitoring system in a company, define guide lines, rules and standard framework for daily operating work. All IT systems and devices should follow this document for applying a common monitoring standard for IT environment.Scope
This article will provide information about:
- Opsview - the nagios-based monitoring system is currenlty implemented
in several company almost.
- The monitogin operating model of Opsview
- Standard framework for daily monitoring
- Alert/Notification handling guidelines
This document will be updated regularly based on new requirements and
also needs for optimizing the monitoring system.
Target audiences
This document is useful for IT Operation Team's member, including:- Helpdesk Team
- System and Database Team
- Network and Infrastructure Team
About Opsview (Nagios based monitoring system)
Nagios
Nagios® Core™ is an Open Source system and network monitoring
application. It watches hosts and services that you specify, alerting you when
things go bad and when they get better.
Nagios Core was originally designed to run under Linux, although it should work
under most other unices as well.
Some of the many features of Nagios Core include:
- Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.)
- Monitoring of host resources (processor load, disk usage, etc.)
- Simple plugin design that allows users to easily develop their own service checks
- Parallelized service checks
- Ability to define network host hierarchy using "parent" hosts, allowing detection of and distinction between hosts that are down and those that are unreachable
- Contact notifications when service or host problems occur and get resolved (via email, pager, or user-defined method)
- Ability to define event handlers to be run during service or host events for proactive problem resolution
- Automatic log file rotation
- Support for implementing redundant monitoring hosts
- Optional web interface for viewing current network status, notification and problem history, log file, etc.
Opsview
Opsview is a enterprise-grade monitoring system for physical, virtual IT
infrastructure. Opsview sponsors a free, open-source
software version - Opsview Core. It sells Opsview Pro to SMBs and Opsview
Enterprise to larger organisations under a proprietary
license.
Opsview is built with the following technologies:
- Nagios Core: Provides the core set of monitoring and alerting capabilities in Opsview. Sometimes referred to as Opsview's monitoring engine.
- Perl: The primary programming language used for Opsview
- Catalyst: A MVC Web application framework used for building the web application
- ExtJS: a JavaScript library used for building the dashboard in Opsview Pro and Opsview Enterprise
- MySQL: A relational database used for configuration, runtime and data warehouse databases
- Net-SNMP: Provides SNMP support
- RRDtool: Provides lightweight graphing
Opsview runs on Linux with official
support for the following distributions: CentOS, Debian, Red Hat Enterprise
Linux, SUSE and Ubuntu. It
also runs on Solaris 10.
Currently, Opsview system in PPF VF using the opensource version 3.3.2.
It's a old version but the last one that keeps all the enterprise features that
we need. In later on version, Opsview removes some important features and add
them into the Professional and Enterprise version which need commercial license.
For better overview of Opsview in company, please read OpsviewMonitoring.
Opsview Manual
Please read OpsviewUserManual.Monitoring operating design
Monitoring servers contains a group of 2 Opsview hosts running on Nagios core version 3.3.2- Master monitoring server - located in LAN
- Slave monitoring sever - located in DMZ
Monitoring objects
We use many service checks to check status of system include hardware, software and services are running... The monitoring objects are listed below:Server | - Load of CPU - Memory use (RAM, page file) - Raid Array Status - Drive utilisation on Windows or partition utilisation on Linux - Status of server |
Switch /Router | - Status of interfaces - CPU load - Memory utilisation - Status of devices |
Storage System | - Status of controller - Write latency - Read latency - I/O performance |
UPS | - Temperature - Battery life time - Battery capacity - Output voltage, frequency... |
Printer | - Status of printer |
Security Camera | - Status of camera |
Data Center | - Humidity - Temperature of RACKs |
Line Internet | - Fiber, ADSL, Leased line, VTN, Tunnel... |
Services | - Mail, Web, Proxy, Dns, Active Directory, Database... |
Production, UAT and Testing Environments
Production Environments
System priority table (by critical level)
Users of Monitoring system
Monitoring operators
Monitoring administrators
Business users/owners
Third parties/Suppliers
To monitor or not to monitor
SLA for Production Environment
SLA for Testing Environment
SLA for UAT Environment
Life cycle of a host, service and monitoring
Working hours and non-working hours
Threshhold and notification settings
- For internal services
+ Disk space
+ ...
- For public services
- For hosts
+ Internal host
+ Public host
Email notifcation settings
SMS notification settings
Monitor a generic Windows machine \
Monitor a Windows Cluster
Monitor HP Dataprotector Cell Manager Server
Monitor a Windows-based Java Application Server
Monitor a MS SQL Server
Monitor a IIS Web Application Server
Monitor a Windows File Server
Monitor a Windows Domain Controller
Monitor a Microsoft Exchange Server
Monitor a Microsoft ISA Server
Monitor a Windows-based Symantec SEPM Server
Monitor a Microsoft Windows Update Server
Monitor a Linux/Unix machine Monitor DHCP/DNS Server
Monitor Internet DNS Server
Monitor Mail Gateway Server
Monitor Squid Proxy Server
Monitor Pound SSL Reversed Proxy Server
Monitor Varnish Reversed Proxy Server
Monitor Linux-based Java Application Server
Monitor Linux-based Web Server
Monitor Zimbra Server
Monitor Linux-based Oracle Database Server
Monitor a network printer
Monitor a router/switch
Monitor a leased-line connectivity
Monitor a internal service (Active Directory, OWA, HTTP, SSH, etc.)
Monitor Active Directory Service
Monitor Exchange Publishing Service (OWA, ActiveSync, Outlook Anywhere, Exchange Web Service)
Monitor a public service (HTTP, SMTP, DNS, etc)
Standard alert handling process SMS notification settings
Monitor a generic Windows machine \
Monitor a Windows Cluster
Monitor HP Dataprotector Cell Manager Server
Monitor a Windows-based Java Application Server
Monitor a MS SQL Server
Monitor a IIS Web Application Server
Monitor a Windows File Server
Monitor a Windows Domain Controller
Monitor a Microsoft Exchange Server
Monitor a Microsoft ISA Server
Monitor a Windows-based Symantec SEPM Server
Monitor a Microsoft Windows Update Server
Monitor a Linux/Unix machine Monitor DHCP/DNS Server
Monitor Internet DNS Server
Monitor Mail Gateway Server
Monitor Squid Proxy Server
Monitor Pound SSL Reversed Proxy Server
Monitor Varnish Reversed Proxy Server
Monitor Linux-based Java Application Server
Monitor Linux-based Web Server
Monitor Zimbra Server
Monitor Linux-based Oracle Database Server
Monitor a network printer
Monitor a router/switch
Monitor a leased-line connectivity
Monitor a internal service (Active Directory, OWA, HTTP, SSH, etc.)
Monitor Active Directory Service
Monitor Exchange Publishing Service (OWA, ActiveSync, Outlook Anywhere, Exchange Web Service)
Monitor a public service (HTTP, SMTP, DNS, etc)
Get to understand the dependencies between hosts and services
Services notification handling
Steps/questions need to be examed before starting of resolution
Steps/questions need to be examed after resolution deployed Volume related notification (disk space, database table space, RAM, swap, bandwidth, etc.)
Performance related notification (CPU utilization, Disk I/O, ping latency, service's reponse time, etc.)
Up/down or ok/error related notication (interface up/down, line
reachable, raid status, AD replication, windows service, linux service etc.).
Communication and health check only - with or withoud performance data
Line up/down handling
Public service of PPF up/down handling (HTTP, HTTPS)
Public service of ISP (not PPF) up/down handling (external DNS resolving,
ISP's outage)
Internal service up/down handling (HTTP, HTTPS)
Host up/down handling
Zone up/down handling
0 comments
Post a Comment