Cloudwatch is a monitoring service. It can monitor for 2 types of checks:
- System Status Checks
- Instance Status Checks
System Status Checks
These are checks that gives information about whether aws underlying hardware/software has developed a fault. If any of these checks fails then it is something AWS is required to repair. Here are some examples checks of this type:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host
When one of these system checks fails, it can be fixed in one of 2 ways:
- Wait for AWS to fix the issue
- Stopping and starting an instance, or by terminating and replacing an instance. Behind the scenes, this has the effect automatically moving your instance to working hardware.
Instance Status Checks
These are checks that gives information about the software and network configuration of your individual instance. These checks detect problems that requires you to fix, e.g. fix a faulty config file. Here are some examples of this type of checks:
- Failed system status checks - i.e. this is to show that Instant Status checks will indicate if the underlying hardware/software is at fault.
- Incorrect networking or startup configuration
- Exhausted memory
- Corrupted file system
- Incompatible kernel
Faults that are identified by these types of checks requires you to investigate and fix.
Cloudwatch alarms is service that sends out simple notification service (sns) message when a status check changes state. An alarm can have 3 states:
OK -The metric is within the defined threshold
ALARM -The metric is outside of the defined threshold
-This is temporary that a new alarm takes in the first few minutes after creating it
Cloudwatch Alarms are enabled by default for a few key status checks. These status checks are the ones that can be monitored at the physical host level, e.g.:
- cpu utilisation
- network traffic
- disk usage
CPU credit usage/balance - i.e. it monitor how much IOPS you are accruing, on an burstable instance.
Cloudwatch alarms for each instance is kept for several weeks after an instance has been deleted.
You can also set alarms if your instance reaches a certain cost to run, e.g. £50/day. This is useful to notify if you forgot to shutdown an instance after you finished working with it. Another alarm could be set if data transfer costs becomes to high e.g. £20/day.
Alarms can do a lot more then just sending out sns notification, they can also trigger actions to try fix the issue, e.g. an alarm can trigger the creation of more instance to handle unexpected high load. Hence Cloudwatch Alarms plays a big part to enabling elasticity/auto-scaling.
Basic and Detailed Monitoring
CloudWatch generates near-realtime metrics by processing raw data from Amazon EC2. Cloudwatch stores up to two weeks of data, to help analyse the history. By default, Amazon EC2 metric data is automatically sent to CloudWatch in 5-minute periods (aka basic monitoring). however you can enable more frequent monitoring on an Amazon EC2 instance, which sends data to CloudWatch in 1-minute periods (aka detailed monitoring). Pricewise, Basic Monitoring is free, but there is a charge for Detailed Monitoring.
Monitoring using scripts
Some EC2 instance metrics can only be collected by running monitoring scripts (provided by aws) inside your ec2 instances. For example:
- Memory utilization ("free -m" command)
- Swap disk utilization (disk space acting as memory)
- disk space utilization ("du -h" command)
- All of these are displayed as time based line graphs on the aws console.