| 0 comments ]

Purpose

This article provide information about the monitoring system in a company, define guide lines, rules and standard framework for daily operating work. All IT systems and devices should follow this document for applying a common monitoring standard for IT environment.

Scope

This article will provide information about: - Opsview - the nagios-based monitoring system is currenlty implemented in several company almost. - The monitogin operating model of Opsview - Standard framework for daily monitoring - Alert/Notification handling guidelines This document will be updated regularly based on new requirements and also needs for optimizing the monitoring system.

Target audiences

This document is useful for IT Operation Team's member, including:
  • Helpdesk Team
  • System and Database Team
  • Network and Infrastructure Team

About Opsview (Nagios based monitoring system)

Nagios

Nagios® Core™ is an Open Source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better. Nagios Core was originally designed to run under Linux, although it should work under most other unices as well. Some of the many features of Nagios Core include:
  • Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.)
  • Monitoring of host resources (processor load, disk usage, etc.)
  • Simple plugin design that allows users to easily develop their own service checks
  • Parallelized service checks
  • Ability to define network host hierarchy using "parent" hosts, allowing detection of and distinction between hosts that are down and those that are unreachable
  • Contact notifications when service or host problems occur and get resolved (via email, pager, or user-defined method)
  • Ability to define event handlers to be run during service or host events for proactive problem resolution
  • Automatic log file rotation
  • Support for implementing redundant monitoring hosts
  • Optional web interface for viewing current network status, notification and problem history, log file, etc.

Opsview

Opsview is a enterprise-grade monitoring system for physical, virtual IT infrastructure. Opsview sponsors a free, open-source software version - Opsview Core. It sells Opsview Pro to SMBs and Opsview Enterprise to larger organisations under a proprietary license. Opsview is built with the following technologies:
  • Nagios Core: Provides the core set of monitoring and alerting capabilities in Opsview. Sometimes referred to as Opsview's monitoring engine.
  • Perl: The primary programming language used for Opsview
  • Catalyst: A MVC Web application framework used for building the web application
  • ExtJS: a JavaScript library used for building the dashboard in Opsview Pro and Opsview Enterprise
  • MySQL: A relational database used for configuration, runtime and data warehouse databases
  • Net-SNMP: Provides SNMP support
  • RRDtool: Provides lightweight graphing
Opsview runs on Linux with official support for the following distributions: CentOS, Debian, Red Hat Enterprise Linux, SUSE and Ubuntu. It also runs on Solaris 10. Currently, Opsview system in PPF VF using the opensource version 3.3.2. It's a old version but the last one that keeps all the enterprise features that we need. In later on version, Opsview removes some important features and add them into the Professional and Enterprise version which need commercial license. For better overview of Opsview in company, please read OpsviewMonitoring.

Opsview Manual

Please read OpsviewUserManual 

Monitoring operating design

Monitoring servers contains a group of 2 Opsview hosts running on Nagios core version 3.3.2
  • Master monitoring server - located in LAN
  • Slave monitoring sever - located in DMZ 

Monitoring objects

We use many service checks to check status of system include hardware, software and services are running... The monitoring objects are listed below:

Server - Load of CPU - Memory use (RAM, page file) - Raid Array Status - Drive utilisation on Windows or partition utilisation on Linux - Status of server
Switch /Router - Status of interfaces - CPU load - Memory utilisation - Status of devices
Storage System - Status of controller - Write latency - Read latency - I/O performance
UPS - Temperature - Battery life time - Battery capacity - Output voltage, frequency...
Printer - Status of printer
Security Camera - Status of camera
Data Center - Humidity - Temperature of RACKs
Line Internet - Fiber, ADSL, Leased line, VTN, Tunnel...
Services - Mail, Web, Proxy, Dns, Active Directory, Database...

Production, UAT and Testing Environments 

Production Environments 
System priority table (by critical level)
Users of Monitoring system
Monitoring operators
Monitoring administrators 
Business users/owners
Third parties/Suppliers 
To monitor or not to monitor 
SLA for Production Environment  
SLA for Testing Environment 
SLA for UAT Environment 
Life cycle of a host, service and monitoring 
Working hours and non-working hours 
Escalation and contact points

Monitoring standard framework 

Threshhold and notification settings 
- For internal services 
+ Disk space 
+ ... 
 - For public services 
 - For hosts 
+ Internal host 
 + Public host 

Email notifcation settings
SMS notification settings
Monitor a generic Windows machine \
Monitor a Windows Cluster
Monitor HP Dataprotector Cell Manager Server
Monitor a Windows-based Java Application Server
Monitor a MS SQL Server
Monitor a IIS Web Application Server
Monitor a Windows File Server
Monitor a Windows Domain Controller
Monitor a Microsoft Exchange Server
Monitor a Microsoft ISA Server
Monitor a Windows-based Symantec SEPM Server
Monitor a Microsoft Windows Update Server
Monitor a Linux/Unix machine Monitor DHCP/DNS Server
Monitor Internet DNS Server
Monitor Mail Gateway Server
Monitor Squid Proxy Server
Monitor Pound SSL Reversed Proxy Server
Monitor Varnish Reversed Proxy Server
Monitor Linux-based Java Application Server
Monitor Linux-based Web Server
Monitor Zimbra Server
Monitor Linux-based Oracle Database Server
Monitor a network printer
Monitor a router/switch
Monitor a leased-line connectivity
Monitor a internal service (Active Directory, OWA, HTTP, SSH, etc.)
Monitor Active Directory Service
Monitor Exchange Publishing Service (OWA, ActiveSync, Outlook Anywhere, Exchange Web Service)
Monitor a public service (HTTP, SMTP, DNS, etc)
Standard alert handling process 

Get to understand the dependencies between hosts and services
Services notification handling
Steps/questions need to be examed before starting of resolution
Steps/questions need to be examed after resolution deployed Volume related notification (disk space, database table space, RAM, swap, bandwidth, etc.)
Performance related notification (CPU utilization, Disk I/O, ping latency, service's reponse time, etc.)
Up/down or ok/error related notication (interface up/down, line reachable, raid status, AD replication, windows service, linux service etc.). 
Communication and health check only - with or withoud performance data Line up/down handling 
Public service of PPF up/down handling (HTTP, HTTPS) 
Public service of ISP (not PPF) up/down handling (external DNS resolving, ISP's outage) 
Internal service up/down handling (HTTP, HTTPS) 
Host up/down handling 
Zone up/down handling 
Data center up/down handling 

0 comments

Post a Comment