How to Solve It: AWS Copilot's Progress Tracker
Towards the end of last year, I picked up How to Solve It, a book that studies the methods of problem solving, by George Pólya and I fell in love with it. While most of the examples are about solving math problems, the mental operations taken apply just as well to practical problems. This post shows the process of solving a modest practical problem, AWS Copilot’s progress tracker, to arrive at a design using the recommendations presented in the book.
Background
AWS Copilot is an open source CLI to build, release and operate containerized apps on AWS. Prior to v1.2.0, customers that deployed a service with Copilot only saw a single spinner to signal that the deployment is in progress:
We knew that displaying only a spinner for an operation that can take several minutes did not meet the delightfulness bar for Copilot. So we started exploring how we can provide a better UX and arrived at the following solution:
Mathematical vs. practical problems
In a perfectly stated mathematical problem all data and all clauses of the condition are essential and must be taken into account. In practical problems we have a multitude of data and conditions; we take into account as many as we can but we are obliged to neglect some.
- George Pólya, How to Solve It
Take our example with the progress tracker. There are multiple APIs that we can use to get information about resources being deployed and their state. Do we need to use DescribeChangeSet to get a list of proposed changes or is DescribeStackEvents enough to display progress? Is there interesting information from DescribeServices? How about calling Describe
for other resources?
In solving a practical problem, we are often obliged to start from rather hazy ideas; then, the clarification of the concepts may become an important part of the problem.
- George Pólya, How to Solve It
90% of solving really hard problems is deciding which set of constraints you should ignore. - @_joemag_
Unlike math problems, we seldom have a precise definition at the start. We iterate on the requirements. As we explore the problem space, we get a better understanding of which requirements are real and which ones are “nice to have”. For example, does the progress tracker need to render updates instantaneously or is it okay to simplify and display information periodically? The answer is pretty clear for this problem. Separating the lifecycle for fetching and rendering data significantly simplifies the problem and gets rid of an unnecessary requirement.
The design of design
I highlighted in italics suggestions or questions from Pólya that applies to solving this practical problem.
Understanding the problem
What is the unknown?
A clear problem definition.
- What are our requirements or constraints? What should the progress tracker achieve?
- What is a UI that satisfy these requirements?
- Do we have the data to create the UI?
Can you visualize the problem? Can you draw a figure? Do not concern yourself with the implementation for the moment.
Yes. Here are few low-fidelity mocks.
What are the conditions?
Here are some rough requirement ideas:
- transparency Indicate the operation is still in progress and not stuck.
- transparency Get an understanding of what is being performed so that users gain confidence the right operations are happening.
- education Explain why a resource is being mutated. As one of Copilot’s goals is to also explain the basic building blocks of AWS.
Are the conditions sufficient to determine the unknown?
No. The conditions are too vague and insufficient. For example, it’s not clear whether “operation” refers to the Copilot deployment or a CloudFormation resource. There is also no condition around troubleshooting errors. Here is a second draft that’s more precise:
- transparency Indicate that the operation is not stuck and is still in progress. Copilot should signal that it is still tracking the CloudFormation stack. CloudFormation resources that take a long time to create should provide additional information.
- transparency Give an idea of how much work remains. Users should be able to decide whether to keep paying attention to the CLI or focus on another task.
- transparency List AWS resources. Users need to see the created resources so that they know the CLI is not allocating extra costly resources.
- education Explain AWS resources. Our progress tracker needs to educate users on low-level AWS primitives so that they get familiar with AWS.
- transparency Clear error messages for troubleshooting. Error messages in case of a failure should be surfaced and be as specific as possible.
- scalability Scales with new operations. We should be able to provide the same detailed UX as new resources or workload types are added to the CLI.
What is the data? Is it possible to satisfy the condition?
Partially.
We can get the data for the transparency conditions with the following APIs:
- Call
DescribeChangeSet
to get resources that will be updated. Sample output. - Call
DescribeStackEvents
to get the latest status and error messages for a resource. Sample output. - Call
DescribeServices
to get ECS deployment information and service events. Sample output.
For the education and scalability conditions, an option is to add comments in the CloudFormation template to associate the resource with a human-friendly description.
Service:
# An ECS service to run and maintain your tasks in the environment cluster
This would provide the data for education and make sure that we meet scalability by adding a comment for new resources introduced in the templates.
However, we now have the following new unknowns:
- transparency How to keep the data and UI in sync while maintaining a modular codebase?
- education How can parse the template so that we know a comment is associated with a resource?
Devising a plan
Now that we have a better understanding of the problem, we need to find the connection between the data and the unknown that satisfies the constraints.
Look at the unknown! Think of familiar solutions having the same or similar unknown.
Here are related solutions to our transparency unknown:
- AWS Amplify.
- AWS CDK.
- Docker Compose: ECS integration.
- Ink is a Node.js library that provides React-like components for CLIs.
- The observer and model-view-controller patterns provide ways of keeping data and UI in sync while maintaing a modular codebase.
Could you use it? Could you use its result? its method?
Yes!
From the CDK, we can use the method for how to collect CloudFormation stack events triggered from a ChangeSet: stack-activity-monitor.ts. The idea is to get the CreationTime
of a ChangeSet and only stream events after the timestamp.
From Docker Compose, we can use both a result and a method. Compose displays a timestamp next to each operation showing how long the operation takes. This would help us indicate that Copilot is not stuck and is still watching the CloudFormation stack.
We can also look at tty.go to see how Compose keeps the data and UI in sync. Compose buffers events and then writes them to the terminal every 100ms.
From Ink, we can use the idea of components: a tableComponent
to render task counts part of an ECS deployment, or a textComponent
for displaying basic text info. We can build more sophisticated components from these building blocks.
We can accomodate the observer pattern to use go channels so that the UI can receive events from data fetchers.
Look at the unknown! Can you think of an analogous problem?
Yes, we dealt with an analogous problem to the education unknown while implementing the feature for Additional AWS Resources (addons).
These problems are similar because both of them need to parse CloudFormation YAML templates to read specific fields. We observed that the Go library provides a type, yaml.Node
, that stores the comments associated with a node in the template. We can use the node.FootComment
field to retrieve the description of the resource.
Side note: during implementation, we discovered that CloudFormation provides a Metadata attribute for resources. This provided a more robust and generic solution than comments for adding arbitrary information to our resources:
Service:
Metadata:
'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
Carrying out the plan
At this point we are pretty confident that we can start implementing.
Takeaways
While working on this feature, few heuristics stood out to me as being highly effective to solving a practical problem:
- Decompose and Recombine, or divide and conquer. Trying to tackle all the unknowns at once can be overwhelming. Instead consider each unknown one at a time. Find out how to connect the data to the unknown while meeting the requirements, and then move bottom up until all requirements are met.
- Analogy, and Do you know a related problem? Exploring solutions that had a similar problem, or share a similar aspect was hugely beneficial. Starting off on the shoulder of other solutions allowed us to make incremental improvements. We ended up incorporating both results from related solutions as well as their methods (implementations) in our solution.
- What is the unknown? Practical problems will have multiple unknowns. Write down each unknown every step of the way, and focus your attention on how to link the data to the unknown.