My Profile: Monitoring all services in IT infrastructure

Monitoring all services in IT infrastructure

[2:48 AM | 0 comments ]

Purpose

This article provide information about the monitoring system in a company, define guide lines, rules and standard framework for daily operating work. All IT systems and devices should follow this document for applying a common monitoring standard for IT environment.

Scope

This article will provide information about: - Opsview - the nagios-based monitoring system is currenlty implemented in several company almost. - The monitogin operating model of Opsview - Standard framework for daily monitoring - Alert/Notification handling guidelines This document will be updated regularly based on new requirements and also needs for optimizing the monitoring system.

Target audiences

This document is useful for IT Operation Team's member, including:

Helpdesk Team
System and Database Team
Network and Infrastructure Team

About Opsview (Nagios based monitoring system)

Nagios

Nagios® Core™ is an Open Source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better. Nagios Core was originally designed to run under Linux, although it should work under most other unices as well. Some of the many features of Nagios Core include:

Monitoring of network services (SMTP, POP3, HTTP, NNTP, PING, etc.)
Monitoring of host resources (processor load, disk usage, etc.)
Simple plugin design that allows users to easily develop their own service checks
Parallelized service checks
Ability to define network host hierarchy using "parent" hosts, allowing detection of and distinction between hosts that are down and those that are unreachable
Contact notifications when service or host problems occur and get resolved (via email, pager, or user-defined method)
Ability to define event handlers to be run during service or host events for proactive problem resolution
Automatic log file rotation
Support for implementing redundant monitoring hosts
Optional web interface for viewing current network status, notification and problem history, log file, etc.

Opsview

Opsview is a enterprise-grade monitoring system for physical, virtual IT infrastructure. Opsview sponsors a free, open-source software version - Opsview Core. It sells Opsview Pro to SMBs and Opsview Enterprise to larger organisations under a proprietary license. Opsview is built with the following technologies:

Nagios Core: Provides the core set of monitoring and alerting capabilities in Opsview. Sometimes referred to as Opsview's monitoring engine.
Perl: The primary programming language used for Opsview
Catalyst: A MVC Web application framework used for building the web application
ExtJS: a JavaScript library used for building the dashboard in Opsview Pro and Opsview Enterprise
MySQL: A relational database used for configuration, runtime and data warehouse databases
Net-SNMP: Provides SNMP support
RRDtool: Provides lightweight graphing

Opsview runs on Linux with official support for the following distributions: CentOS, Debian, Red Hat Enterprise Linux, SUSE and Ubuntu. It also runs on Solaris 10. Currently, Opsview system in PPF VF using the opensource version 3.3.2. It's a old version but the last one that keeps all the enterprise features that we need. In later on version, Opsview removes some important features and add them into the Professional and Enterprise version which need commercial license. For better overview of Opsview in company, please read OpsviewMonitoring.

Opsview Manual

Please read OpsviewUserManual.

Monitoring operating design

Monitoring servers contains a group of 2 Opsview hosts running on Nagios core version 3.3.2

Master monitoring server - located in LAN
Slave monitoring sever - located in DMZ

Monitoring objects

We use many service checks to check status of system include hardware, software and services are running... The monitoring objects are listed below:

Server	- Load of CPU - Memory use (RAM, page file) - Raid Array Status - Drive utilisation on Windows or partition utilisation on Linux - Status of server
Switch /Router	- Status of interfaces - CPU load - Memory utilisation - Status of devices
Storage System	- Status of controller - Write latency - Read latency - I/O performance
UPS	- Temperature - Battery life time - Battery capacity - Output voltage, frequency...
Printer	- Status of printer
Security Camera	- Status of camera
Data Center	- Humidity - Temperature of RACKs
Line Internet	- Fiber, ADSL, Leased line, VTN, Tunnel...
Services	- Mail, Web, Proxy, Dns, Active Directory, Database...

Production, UAT and Testing Environments

Production Environments
System priority table (by critical level)
Users of Monitoring system

Monitoring operators

Monitoring administrators

Business users/owners

Third parties/Suppliers

To monitor or not to monitor

SLA for Production Environment

SLA for Testing Environment

SLA for UAT Environment

Life cycle of a host, service and monitoring

Working hours and non-working hours

Escalation and contact points

Monitoring standard framework

Threshhold and notification settings

- For internal services

+ Disk space

+ ...

- For public services

- For hosts

+ Internal host

+ Public host

Email notifcation settings
SMS notification settings
Monitor a generic Windows machine \
Monitor a Windows Cluster
Monitor HP Dataprotector Cell Manager Server
Monitor a Windows-based Java Application Server
Monitor a MS SQL Server
Monitor a IIS Web Application Server
Monitor a Windows File Server
Monitor a Windows Domain Controller
Monitor a Microsoft Exchange Server
Monitor a Microsoft ISA Server
Monitor a Windows-based Symantec SEPM Server
Monitor a Microsoft Windows Update Server
Monitor a Linux/Unix machine Monitor DHCP/DNS Server
Monitor Internet DNS Server
Monitor Mail Gateway Server
Monitor Squid Proxy Server
Monitor Pound SSL Reversed Proxy Server
Monitor Varnish Reversed Proxy Server
Monitor Linux-based Java Application Server
Monitor Linux-based Web Server
Monitor Zimbra Server
Monitor Linux-based Oracle Database Server
Monitor a network printer
Monitor a router/switch
Monitor a leased-line connectivity
Monitor a internal service (Active Directory, OWA, HTTP, SSH, etc.)
Monitor Active Directory Service
Monitor Exchange Publishing Service (OWA, ActiveSync, Outlook Anywhere, Exchange Web Service)
Monitor a public service (HTTP, SMTP, DNS, etc)

Standard alert handling process

Get to understand the dependencies between hosts and services
Services notification handling
Steps/questions need to be examed before starting of resolution
Steps/questions need to be examed after resolution deployed Volume related notification (disk space, database table space, RAM, swap, bandwidth, etc.)
Performance related notification (CPU utilization, Disk I/O, ping latency, service's reponse time, etc.)

Up/down or ok/error related notication (interface up/down, line reachable, raid status, AD replication, windows service, linux service etc.).

Communication and health check only - with or withoud performance data Line up/down handling

Public service of PPF up/down handling (HTTP, HTTPS)

Public service of ISP (not PPF) up/down handling (external DNS resolving, ISP's outage)

Internal service up/down handling (HTTP, HTTPS)

Host up/down handling

Zone up/down handling

Data center up/down handling

0 comments

Introduce myself

This is a personal blog about connectivity for learning - funny - sharing and reference, in my opinion, covers everything about IT network infrastructures and all of its related components, like new software and/or hardware from vendors like Cisco Systems, Microsoft, IBM, HP, CheckPoint, Juniper and other things and so on. So that some blogs also contain useful configuration examples, posts and articles, at least for me, from different network components. I created this blog to share my knowledge with other people and hopefully someone will share his knowledge with me ... contains blogs about everything related to IT network infrastructures. Most of the blogs contain experiences of myself during my work.

Who am I ... My name is Huynh Phi Long and currently I work as a IT network administrator at PPF - Homecredit.

You can contact to me by email: longhp@live.com

Purpose

Scope

Target audiences

About Opsview (Nagios based monitoring system)

Nagios

Opsview

Opsview Manual

Monitoring operating design

Monitoring objects

0 comments

Post a Comment

Introduce myself

Google Maps

Visitor Locations

Lunar Calendar

Link URL

Followers

My Blogs