Overview


Information in this article is related to using CloudMonix for monitoring Azure Cloud Services, i.e. Web and Worker Roles.


The article covers the following topics:

  • common use cases where CloudMonix can help with monitoring and automation

  • what are monitoring agents, how to select and configure one

  • what happens during the monitoring cycle

  • what is needed to connect to and monitor an Azure Web or Worker role

  • what metrics CloudMonix tracks, visualizes and monitors

  • what automated actions can be executed by CloudMonix

Azure Cloud Services are monitored using Classic API. Refer to the article about differences between ARM and Classic API's for more details.



Why use CloudMonix for Azure Cloud Services?


Popular usages of CloudMonix include the following examples:

  • Track and alert on any performance counters by instance or at an aggregate across a role

  • Track and alert on specific event logs

  • Automatically reboot failing instances

  • Automatically adjust the number of instances based on the actual demand, queue depths or according to a schedule (available during the Trial Period or in the Ultimate plan only)


Monitoring using Azure Diagnostics Extension vs. CloudMonix Agent


CloudMonix uses Azure Diagnostics Extension for monitoring Azure Cloud Services by default. That is the recommended configuration. CloudMonix will modify Azure Diagnostics Extension configuration in order to automatically set up monitoring. The user can prevent CloudMonix from modifying configuration and manage it manually, however then they’re responsible for ensuring CloudMonix is set up correctly.


In case CloudMonix Agent is preferred, it’s important to understand the implications. Refer to the What are the differences between Azure Diagnostics Extension and CloudMonix agent? and How do I add CloudMonix agent to Azure Cloud Services (Web/Worker roles)? articles to learn more.


Monitoring Cycle


During each monitoring cycle, CloudMonix attempts to retrieve data from Diagnostics Extension storage and Azure Management API. Azure Diagnostics sends its monitored data to a Diagnostics Azure Storage account. It is this storage where CloudMonix will retrieve its diagnostic data from. No direct connections to monitored Cloud Role instances are performed.


After each monitored cycle is completed, CloudMonix will evaluate and possibly execute any automated actions and auto-scaling rules.


Configuration


Azure Cloud Services monitoring can be configured either via Setup Wizard or by using the “Add New” button in the dashboard. It’s highly recommended to use Setup Wizard when configuring permissions for the first time. The Setup Wizard makes certificate authorization simple and services will be automatically instrumented in CloudMonix. Learn more about authorizing with Setup Wizard here.



CloudMonix provides defaults for both Web and Worker Roles, further customization is available after Setup Wizard completes.


Metrics


Every diagnostic data point that CloudMonix retrieves from the monitored resource is considered a metric in CloudMonix. Refer to the Metrics article to learn more about metrics in general.


CloudMonix provides default templates with popular and useful metrics, alerts and actions recommended for the typical configurations. Templates can be applied in the Configuration Template dropdown in the Resource definition dialog.


For Azure Cloud Roles there are two default templates:

  • Sample configuration for Azure Web Role

  • Sample configuration for Azure Worker Role


The primary difference between the two templates, is tracking and alert on IIS-specific performance counters and metrics.



The metrics can be added, removed and customized in the Metrics tab of the Azure Cloud Role resource configuration dialog.


Built-in Azure Cloud Roles Metrics


ResourceStatus

Identifies the last state of the monitored Azure Cloud Service instance. This metric tracks the state of each instance individually. This is a critical metric that is captured for most types of resources that CloudMonix tracks. It is used for Uptime tracking and should not be removed.

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: string

  • Possible values: Ready, Down, Stopped, Unknown

  • Included in sample profile: yes, in all profiles tracked as a metric called Status

  • Included in default alerts: yes, in all profiles:

    • Resource Outage: Status == "Down": Raises an alert when monitored server is reported as not-Ready by Azure or if no metrics come through from diagnostic agents, for a sustained period of time.


Statuses are determined according to the following rules:

  • Ready - Azure Management API reports the monitored instance as Ready and there is diagnostic data found for this instance

  • Unknown - Azure Management API reports instance as Ready but there is no diagnostic data found, indicating a possible outage with the instance.  

  • Stopped - Azure Management API reports the instance as Stopped.  

  • Down -  Azure Management API reports instance as non-Ready and there is no diagnostic data found.



WindowsPerformanceCounter

Windows Performance Counter is one of the most popular metric types. Windows OS and applications running on it publish a large number of performance counters that highlight various aspects of performance indicators, health, uptime, etc. In order to learn more about the most popular counters refer to the Monitor Windows Server with Performance Counters article. The Performance Counter class documentation explains how to consume and define custom counters, should there be a need for CloudMonix to track user-generated diagnostic data.


CloudMonix can track any published performance counter. Each performance counter that CloudMonix should track must be defined as an individual metric in the Resource Configuration dialog.

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: double

  • Included in sample profile: yes

Metrics included in all sample profiles:

  • CPUTime: Processor(_Total)\ % Processor Time

  • CpuTime30MinAverage: Aggregated metrics, Average value of CPUTime in 30 min. aggregation period

  • DiskFreeSpaceTotal: LogicalDisk(_Total)\Free Megabytes

  • DiskIdleTime: PhysicalDisk(_Total)\% Idle Time

  • DiskReadSpeed: PhysicalDisk(_Total)\Avg. Disk sec/Read

  • DiskWriteSpeed: PhysicalDisk(_Total)\Avg. Disk sec/Write

  • MemoryCommittedPct: Memory\% Committed Bytes In Use

  • MemoryFree: Memory\Available MBytes

Metrics included in the Sample configuration for Azure Web Role template:

  • AspNetApplicationRestarts: ASP.NET\Application Restarts

  • AspNetBytesOut: ASP.NET Applications(__Total__)\Request Bytes Out Total

  • AspNetErrors: ASP.NET Applications(__Total__)\Errors Total/Sec

  • AspNetRequests: ASP.NET Applications(__Total__)\Requests/Sec

  • AspNetRequestsQueued: ASP.NET\Requests Queued

  • AspNetRequestsRejected: ASP.NET\Requests Rejected

  • AspNetRequestWaitTime: ASP.NET\Request Wait Time

  • Included in default alerts: yes

            Alerts included in all sample profiles

  • High CPU (Warning): CpuTime30MinAverage > 70: Raises an alert when average CPU utilization for the last 30 minutes across all instances is over 70%.

  • Low Memory (Warning): MemoryFree < 100: Raises an alert if the amount of available physical memory falls below 100MBs for sustained amount of time.

  • Slow Disk (Warning): DiskReadSpeed > 0.025 || DiskWriteSpeed > 0.025 || DiskIdleTime < 20: Raises an alert if the average disk read or write speeds exceed 25 milliseconds or if the disk is idle for less than 20% of the time sustained for 5 minutes.  For mission critical servers, disk speed metrics should not be exceeding 10 milliseconds.

Alerts included in the Sample configuration for Azure Web Role template:

  • Requests are Queueing Up (Warning): AspNetRequestsQueued > 10: Raises an alert when the number of queued requests exceeds 10, for 5 minutes sustained.  Queued requests indicate that IIS or backened processes are not able to process the requests quickly enough.


WindowsPerformanceCounterMultiInstance



Performance counter categories can be either single instance or multi-instance. A single instance category has only one machine wide value for each counter (e.g. the Systems category in Windows). The multi-instance category can have unlimited number of values for each counter (e.g. Process category which has a counter for each process, or DiskFreeSpace which has a counter for each disk).


WindowsPerformanceCounterMultiInstance metric is similar to WindowsPerformanceCounter, however it tracks metrics for all instances, i.e. it tracks multi-instance metrics. It returns an array of PerformanceCounterInstance objects for each counter instance.

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: array of objects with the following properties:

    • Instance (string): instance name

    • Value (double): counter value for the given instance

  • Can be accessed only through aggregation using Expressions described in the Working with Expressions article in Evaluating data in sets\arrays (advanced) section

  • Included in sample profile: no

  • Included in default alerts: no



AzureCloudRoleInstanceDetails

Tracks detailed information about Azure role instances as a list.

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: object with the following properties:

  • Instance (string): instance name

  • Size (string): size of the VM which the role instance is running, possible values: ExtraSmall, Small, Medium, Large, ExtraLarge, etc.

  • State (string): latest status of the instance

  • StateDetails (string): message about the status of the instance

  • ErrorCode (string): empty or error message

  • Can be accessed only through aggregation using Expressions described in the “Working with Expressions” article in Evaluating data in sets\arrays (advanced) section.

  • Included in sample profile: yes,  in all profiles tracked as a metric:

    • InstanceList: detailed status of monitored cloud role instances

  • Included in default alerts: yes, in all profiles:

    • Role has NO Ready Instances (Error): No instances with ReadyRole status have been detected for sustained period of time (5 min.).

    • Role has some Non-Ready Instances (Warning): Any(InstanceList, "State != \"ReadyRole\"") && Count(InstanceList, "State == \"ReadyRole\"") > 0: Some instances with ReadyRole and non-Ready Role statuses have been detected for sustained period of time (5 min.).


ResourceInstanceCount

Tracks the last number of instances reported by Azure Management API in the cloud role

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: int

  • Included in sample profile: no

  • Included in default alerts: no


AzureVirtualMachineState

Specifies the current status of a role instance as reported by Azure. Possible values for this metric are listed in this MSDN article in the RoleInstanceList section.


  • Tracked by Azure Diagnostics Extension and CloudMonix agent.

  • Data Type: string

  • Possible values: CreatingVM, StartingVM, StoppingVM, StoppedVM, DeletingVM, FailedStartingVM, CreatingRole, StartingRole, ReadyRole, BusyRole, StoppingRole, RestartingRole, CyclingRole, FailedStartingRole, UnresponsiveRole, StoppedDeallocated, Preparing, Unknown.

  • Included in sample profile: no

  • Included in default alerts: no


WindowsEventLogEntry

Tracks entries from the Windows Event Log.

  • Tracked by Azure VM Diagnostics Extension and CloudMonix agent.

  • Data Type: object with the following properties:

  • EventId (int): ID of the Event Log entry

  • MachineName (string): host name of the server that generated the error

  • Message (string): actual message of the event

  • Source (string): application/service that generated the event

  • UserName (string): user under whose credentials the log entry was generated

  • EntryType (string): Information, Warning or Error

  • Timestamp (datetime); local time when the log entry was generated

  • Can be accessed only through aggregation using Expressions described in the Working with Expressions article in Evaluating data in sets\arrays (advanced) section.

  • Included in sample profiles: yes, in all sample profiles tracked as metrics:

    • ApplicationEventLogs: Tracks entries from the Windows Event Log (Application source)

    • SystemEventLogs: Tracks entries from the Windows Event Log (System source)

  • Included in default alerts: no


WindowsUpdatesDrivers

Tracks available Windows Driver Updates. Used for ensuring all important updates are installed regularly. CloudMonix limits time needed to retrieve these metrics to 10 sec, and won't retrieve them more often than every 6 hours.

  • Available for Classic and ARM VMs.

  • Tracked by CloudMonix agent.

  • Data Type: WindowsUpdatesSoftware list, with the following properties:

    • Title - (string)

    • Url - (string)

    • Mandatory - (bool)

    • Priority - (string)

    • Date - (DateTime)

  • Included in sample profile: no

  • Included in default alerts: no


WindowsUpdatesSoftware

Tracks available Windows Updates. Used for ensuring all important updates are installed regularly. CloudMonix limits time needed to retrieve these metrics to 10 sec, and won't retrieve them more often than every 6 hours.

  • Available for Classic and ARM VMs.

  • Tracked by CloudMonix agent.

  • Data Type: WindowsUpdatesSoftware list, with the following properties:

    • Title - (string)

    • Url - (string)

    • Mandatory - (bool)

    • Priority - (string)

    • Date - (DateTime)

  • Included in sample profile: no

  • Included in default alerts: no



Alerts

Users can create alerts based on changes in any value tracked by CloudMonix (including custom metrics). Each resource template includes alerts which are suitable for a given resource. The predefined alerts for Azure Cloud Services are listed in the Metrics section. Refer to the Alerts article to learn more about alerts in general.


Alerts are available during the Trial period or in Professional and Ultimate plans only.


The conditions for alerts can be evaluated on a per role or per individual instance basis. Users can configure the suitable option on the alert definition tab.  Certain metrics are well distributed between the instances and it may make more sense to alert on them across the role (ie: Requests/sec, etc), while others are best to evaluate on an instance basis (ie: RAM consumption).




Auto-scaling and Actions


Automation features (Actions) allow users to set up powerful reactive, proactive and scheduled actions and auto-scaling rules. CloudMonix can execute actions and scale adjustments when a specific monitoring condition occurs or according to a schedule. Refer to the Actions article to learn more about automating Cloud Role instance reboots and to the Auto-scaling article to learn more about Auto-scaling Cloud Role instances.


Automation features are available during the Trial period or in the Ultimate plan only.


Sample usages:


Special notes:

  • When defining actions, ensure that they are evaluated and executed on an instance level, as this is the most common scenario.



  • The default monitoring templates for both Web and Worker roles come pre-packaged with two disabled actions that can reboot instances on a daily basis and when they run low on RAM. Those actions need to be explicitly enabled.

  • Because Role Instances can be rebooted by Azure or CloudMonix automation rules, it is generally important to ensure that these instances properly handle reboots by handling RoleEnvironment.Stopping event. Note that Azure itself reboots instances regularly for updates, so it’s a general Azure good practice. Refer to The Right Way to Handle Azure OnStop Events article to learn more.

  • Auto-scaling generates a TopologyChangeEvent that can sometimes cause instances to reboot unnecessarily. It is a good idea to properly handle TopologyChangeEvent so that this does not happen. Refer to the Responding to Role Topology Changes article to learn more.



As a general rule, every new action should specify appropriate Suspended period and Sustained period values. See Automating Actions article to learn more about those settings.


Built-in Azure Cloud Roles Actions


AzureCloudServiceInstanceReboot


CloudMonix will request Azure to reboot the instance of a role.


  • Available for  Azure Web and Worker Roles.

  • Supported by Azure Diagnostics Extensions and CloudMonix agents.

  • Included in default actions: yes, included in all sample profiles, but have to be explicitly enabled

    • Daily reboot (Information): Reboots cloud role instances once per day.  Reboot happens when instance's index matches current clock hour (in UTC).  For example: 1st instance is rebooted at UTC midnight, 2nd instance is rebooted at 1am UTC, etc.  For deployments with 25+ instances, this action reboots every 24th instance.  For example, for deployment with 100 instances, at UTC midnight 1st, 25th, 49th, 73rd and 96th instances will be rebooted; at UTC 1am, 2nd, 26th, 50th, 74th and 97th instances will be rebooted; etc.  More information here.

    • Low Ram Reboot (Warning): Reboot Cloud Role instance if available memory drops below 100MB for 5 minutes sustained.  This action will not be executed more than once per hour due to Suspended period setting.



AzureCloudServiceInstanceReimage


CloudMonix will request Azure to reimage the Cloud Role.

  • Available for  Azure Web and Worker Roles.

  • Supported by Azure Diagnostics Extensions and CloudMonix agents.

  • Included in default actions: no


PowershellRestartService


CloudMonix will restart a specified Windows Service. The predefined PowerShell script executes the command: Restart-Service serviceName.

  • Supported by CloudMonix agent only.

  • Included in default actions: no


CustomPowershellScript

CloudMonix will execute on the target VM the custom PowerShell script specified in the action definition. That action is especially useful when used in combination with other CloudMonix features, such as metrics or schedules.

  • Supported by CloudMonix agent only.

  • Included in default actions: no


Built-in Azure Cloud Roles Auto-scaling rules


Every resource that uses CloudMonix auto-scaling should also define Scale-down cooling period and Scale-up cooling period or Sustained period values.


Auto-scaling rules included in all sample profiles:

Scale Ranges:

  • Overall Scaling Limit (Warning): The minimum number of instances is 2. Disabled by default, has to be explicitly enabled in the Resource Configuration dialog in the Scale Ranges tab.


Scale Adjustments:


  • Scale Up (Warning): CpuTime30MinAverage > 70: Disabled by default, has to be explicitly enabled in the Resource Configuration dialog in the Scale Adjustments tab. Scales up the number of instances by 1 when 30-minute CPU utilization average across all instances within monitored Cloud Role exceeds 70%

  • Scale Down (CPU) (Information): CpuTime30MinAverage < 20: Disabled by default, has to be explicitly enabled in the Resource Configuration dialog in the Scale Adjustments tab. Scales down Cloud Role when 30-minute average CPU utilization across all instances has been under 20% for 10 minutes sustained