## The problem with current AppSync metrics If you're building a GraphQL API with [[AWS AppSync|AppSync]], there are a few [[AWS CloudWatch|CloudWatch]] metrics you can use to create an alarm for to notify you when something goes wrong with your API in production that you need to react to. The following metrics in particular can be useful to create alarms on: `4XXError`, `5XXError`, `Latency`. However, unlike the use of these status codes within RESTful APIs (if you're using [[AWS API Gateway|API Gateway]], say), the `4XXError` and `5XXError` are less useful in that the GraphQL specification does not use these HTTP status codes to denote field-level errors when processing a GraphQL request. Instead, a 200 status code is returned and an error JSON payload is included beneath the field with the issue. The problem here is that there is no way to be alerted of field-level errors using the built-in AppSync metrics (at this time of writing, 2023-09-01). Some examples of events that can cause field-level errors to be returned are: - Bug in the VTL request/response template that throws an unhandled error - Bug in the resolver's Lambda function data source that throws an unhandled error - Timeout in the resolver's Lambda function data source - Explicit returning of error from VTL or Lambda function because of an invalid client request An important distinction in the above events is that the first 3 are all "server-side" errors (roughly equivalent to a 500 Internal Server Error in RESTful APIs) while the last one is a "client-side" one (equivalent to a 400 BAD REQUEST in RESTful APIs). In terms of alerting, you probably want to always be notified about any Internal Server Errors whereas you may have a higher threshold for triggering an alarm due to client validation errors. ## Solution: Use a log filter to emit custom metrics While we cannot use any built-in metrics, CloudWatch does allow you to create [custom metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html). We can then create an alarm based off that metric. ### Pre-req This requires that you [enable logging on your AppSync API](https://docs.aws.amazon.com/appsync/latest/devguide/monitoring.html) with "Include verbose content" enabled and field-logging level set to `ERROR` (it can also be set to `ALL` but this could get very costly and isn't needed for this alarm). ### Create custom metric filter Our metric filter will watch the AppSync API's CloudWatch Log Group for a specific pattern and the emit a Count metric of "1" every time it finds this pattern. The CloudFormation YAML to create this metric filter is below: ```yml MyGraphQLApiResolverServerErrorsMetric: Type: AWS::Logs::MetricFilter Properties: FilterPattern: '{ ($.logType = "RequestMapping" || $.logType = "ResponseMapping") && ($.fieldInError IS TRUE) && ($.context.error.message = "*") }' LogGroupName: <YOUR_APP_SYNC_API_LOG_GROUP_NAME> MetricTransformations: - MetricName: ServerFieldErrors MetricNamespace: AppSyncResolvers MetricValue: "1" Unit: Count ``` The `FilterPattern` here looks for `fieldError` (to exclude debug/info level logs) and also checks for the presence of the `context.error.message` field, which seems to be only present whenever server-side errors are returned. If you like, you could create a second `ResolverClientErrorsMetric` by negating this last clause of the pattern. Note that we need to specify the `LogGroupName` above. So if you have multiple AppSync APIs to monitor, you may wish to create multiple metrics using the same filter pattern but with a different namespace. (It may also be possible to use a single metric with the AppSyncAPI ID as a metric dimension, but I haven't tried this out) ### Create alarm To wire up the alarm to this metric, the following alarm definition will trigger if any errors are returned within a 1 minute period. ``` ResolverServerErrorsAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: MyGraphQLApiResolverServerErrorsAlarm Namespace: AppSyncResolvers MetricName: ServerFieldErrors Dimensions: - Name: LogGroupName Value: <YOUR_APP_SYNC_API_LOG_GROUP_NAME> ComparisonOperator: GreaterThanOrEqualToThreshold EvaluationPeriods: 1 Period: 60 Statistic: Sum Threshold: 1 AlarmActions: - <YOUR_SNS_TOPIC_ARN> ``` --- ## References - [AppSync Metrics and Dimensions](https://docs.aws.amazon.com/appsync/latest/devguide/monitoring.html#cw-metrics) - [Yan Cui's AppSync MasterClass video course which covers AppSync monitoring in detail](https://school.theburningmonk.com/courses/appsync-masterclass-premium)