AppScale 3.5: collecting performance data with Hermes

Posted by Dmitrii Calzago on 4/5/18 10:46 AM

With AppScale 3.5 release, the internal system for gathering system-wide performance statistics, called Hermes (after the messenger of Greek gods), is coming out of the shadow and becoming useful in a growing number of use cases. These include monitoring, alerting, and debugging. Hermes has been part of AppScale for a few releases – initially added to inform the auto-scaling logic about incoming request rate – but addition of new performance statistics and a rich HTTP-based API has made it much more useful in 3.5. While we will continue to improve the API and add integrations with outside tools, such as Elastic Stack, the core design of Hermes has stabilized for the foreseeable future.

Overview

Hermes is a generic, topology-aware transport for delivering performance statistics from all AppScale nodes to the edge. Let us deconstruct that sentence:

  • Generic means there are no structural constraints on the data: any JSON document can be transported by Hermes, based on the needs of the use case.
  • Topology-aware means Hermes knows nodes and their roles. On some nodes – load balancers – Hermes aggregates data from all other nodes. The roles also determine what measurements Hermes performs locally on the node.
  • Transport means that Hermes moves data, but does not store it, except the last value and possibly some averages over a short window of time. This implies a stateless design and that other systems need to be used for long-term storage and analysis of performance data.
  • Edge is the subset of nodes in a deployment that are reachable from outside the deployment. Usually these are the load balancers. Performance statistics can be pushed or pulled from these nodes to an outside system for storage and analysis.

In the current implementation, Hermes runs on all AppScale nodes, but behaves differently depending on the roles assigned to the node. If a node does not have the "load_balancer" role, it polls node-local sources only and caches the latest values obtained. Hermes process makes those latest values available via an HTTP-based API (explained below).

If a node does have the "load_balancer" role, it polls both node-local sources and all other nodes in the deployment and caches the latest values from all. Hermes processes on load balancers can be queried for performance statistics of all nodes in the deployment. For example, on some of our installations we have used metricbeat daemons on load balancers to relay statistics to an Elastic Stack installation. Such a configuration is shown in the following diagram, with nodes of different roles being polled by a single load balancer (not all of the sources for statistics shown are implemented yet):Hermes architecture diagram

Hermes daemon is aware of the roles assigned to the node where it’s running. Based on that, the daemon collects performance statistics using data producers specific to each role, as well as general node statistics. For example, Hermes invokes the psutil library on every node to get CPU, memory, network, and disk measurements, some for individual processes and some for the node overall. On load balancers, Hermes polls HAProxy by sending a "show stat" command into a Unix socket. On nodes running RabbitMQ, Hermes obtains service statistics from a TCP socket.

Hermes API

Hermes is an HTTP-based read-only API offered on TCP ports 4378 (within the deployment) and 17441 (for queries from outside, using HTTPS). For authorization, custom header "AppScale-Secret" must contain the secret token. The token is the same on for all nodes and can be found in file /etc/appscale/secret.key. It gets generated and distributed during deployment (i.e., 'appscale up' on a new set of nodes). Thus, on any AppScale node one can query Hermes from the command line as follows:

SECRET=`cat /etc/appscale/secret.key`
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/node

Responses from Hermes are JSON documents. Request with an invalid path will result in a "404 Not Found" reply. Request without the "AppScale-Secret" header or with a wrong secret will result in a "403 Bad secret" reply. Request with an empty path will return health status of the service:

{"status": "up"}

Node-local statistics are available from Hermes services on all nodes. Valid paths for local statistics are of the form "/stats/local/type", where type can be "node" or four of the other types described in detail below. Responses are dictionaries of statistics, each a data structure specific to the values being measured. For example, reply to "/stats/local/push_queues" (Task Queue statistics) can be:

{
    "queues": [
        {
            "messages": 0,
            "name": "appscaledashboard___default"
        },
        {
            "messages": 2,
            "name": "guestbook___default"
        }
    ],
    "utc_timestamp": 1522675603
}

Cluster-wide statistics are available from Hermes services on load balancers, which aggregate node-local statistics and return them together. Nodes that could be queried successfully for a statistic are grouped under "stats", indexed by their IP address. If any nodes could not be queried, they get listed under "failures", also indexed by IP addresses. For example, reply to "/stats/cluster/push_queues" on the same installation as above can be:

{
    "failures": {},
    "stats": {
        "10.10.9.245": {
            "queues": [
                {
                    "messages": 0,
                    "name": "appscaledashboard___default"
                },
                {
                    "messages": 2,
                    "name": "guestbook___default"
                }
            ],
            "utc_timestamp": 1522676219
        }
    }
}

Note the same content as in the earlier example, but packaged under the IP of the Task Queue node (10.10.9.245) in the "stats" section (for successful replies). The "failures" section is empty when Hermes processes are functioning normally.

Request for cluster-wide statistics to a node without the load_balancer role will result in "404 Only master node provides cluster stats" reply.

Statistics Types

Whether at a node level or aggregated for the whole deployment, Hermes in 3.5 supports five types of statistics, each returning a unique data structure. Below we touch upon the most important measurements that are available from each type:

1. 'node' (or 'nodes', for a cluster-wide request) type returns CPU load, memory use, and disk use measurements for the server(s). In addition to point-in-time measurements, CPU is also available as a 5-minute load average. An example of a 'nodes' response for one node is:

{
    "cpu": {
        "count": 2,
        "percent": 51.4
    },
    "loadavg": {
        "last_5min": 1.14
    },
    "memory": {
        "available": 2900672512,
        "total": 3837366272
     },
     "partitions_dict": {
        "/": {
            "free": 5321445376,
            "used": 4167401472
        },
        "/mnt": {
            "free": 20686467072,
            "used": 46206976
        }
     },
     "utc_timestamp": 1522677773.0
}

2. 'processes' type returns CPU, memory, and TCP port information about all Unix processes (heavyweight processes, not threads) on a node spawned as part of the AppScale deployment. This includes application servers, AppScale processes such as admin_server, 3-rd party systems such as Zookeeper and Cassandra, as well as the Hermes daemon itself. Note that the name of a deployed application may appear in several places: in each application server process, in each API server deployed for the application, and in child process of Celery daemon dedicated to the application. An example of a 'processes' response for one daemon is:

{
    "application_id": null,
    "children_stats_sum": {
        "cpu": {
            "percent": 0,
            "system": 0.0,
            "user": 0.0
        },
        "memory": {
            "resident": 0,
            "unique": 0,
            "virtual": 0
        }
    },
    "cpu": {
        "percent": 0.0,
        "system": 77.39,
        "user": 71.44
    },
    "memory": {
        "resident": 127700992,
        "unique": 112345088,
        "virtual": 3555422208
    },
    "monit_name": "zookeeper",
    "port": null,
    "unified_service_name": "zookeeper"
}

3. 'proxies' type returns request statistics from HAProxy, for each AppScale service that is invoked via the proxy. Both external requests to the application and API requests made by the application to internal services pass through HAProxy. Thus, request rates and queue lengths measured at proxies are good indicators of request load experienced by a particular service. Five types of services can be examined:

  • taskqueue - requests related to Task Queue API calls
  • blobstore - requests related to Blobstore API calls
  • datastore - requests related to Datastore API calls
  • uaserver - requests to the User App Server, which implements other App Engine API calls not covered by the services above
  • application - HTTP requests from outside to applications, with separate entries for each application

Within each type of service, measurements are further broken down into "frontend" and "backend". These are provided by HAProxy and refer to statistics for requests and connections entering the proxy (front-end statistics) and made by the proxy to another service (back-end statistics). An example of a 'proxies' response for one service of type 'application' is:

{
    "application_id": "appscaledashboard_default_v1",
    "backend": {
        "hrsp_5xx": 0,
        "qcur": 0,
        "qtime": 0,
        "rtime": 1110,
        "scur": 0
    },
    "frontend": {
        "bin": 323463,
        "bout": 191625,
        "hrsp_4xx": 0,
        "hrsp_5xx": 0,
        "rate": 0,
        "req_rate": 0,
        "req_tot": 1022,
        "scur": 0,
        "smax": 2
    },
    "name": "gae_appscaledashboard_default_v1",
    "servers_count": 2,
    "unified_service_name": "application"
}

In this example, "scur" captures the current number of open connections (listed separately for the front and the back), while "qcur" captures how many of the connections opened on the front are queued up – not yet forwarded due to load – on the back. Since 5XX errors received by the proxy from the back do not map one-to-one into 5XX errors returned by proxy to the client, their count ("hrsp_5xx") is tracked separately in the two sections. Note that any cumulative statistics, such as the 5XX error count or request count ("req_tot") or byte count ("bin" and "bout") get reset to zero when HAProxy process is restarted and shouldn't be relied on to represent long periods of time.

4. 'rabbitmq' type returns health status of the RabbitMQ service: boolean values indicating whether the service has run out of memory or disk. An example of a 'rabbitmq' response for a node is:

{
    "disk_free_alarm": false,
    "mem_alarm": false,
    "name": "rabbit@appscale-image1",
    "utc_timestamp": 1522678116
}

5. 'push_queues' type returns state of all Push Task Queues: their name and the number of queued up requests. An example 'push_queues' response can be seen above, in the Hermes API section.

Summary

A good variety of performance statistics, which can be used for monitoring, alerting, and debugging, are available from Hermes. As of AppScale 3.5 the complete list of queries that can be performed on a load balancing node using command-line tool ‘curl’ is as follows:

SECRET=`cat /etc/appscale/secret.key`
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/node
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/proxies
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/rabbitmq
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/push_queues
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/local/processes 
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/cluster/nodes
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/cluster/proxies
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/cluster/rabbitmq
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/cluster/push_queues
curl --header "AppScale-Secret: $SECRET" http://localhost:4378/stats/cluster/processes

Subscribe to Email Updates

Most Popular

Recent Posts