Tech Talk: Building a serverless architecture in AWS with lambda function interaction (Part 2)

Images: https://www.agilytic.be/blog/tech-talk-serverless-architecture-aws-lambda-function-interaction

In the first part of this article series, we discussed how the lambda function serves as the foundation of serverless architectures in the cloud. Now, we’ll explore how the interactions between lambda functions and queues are a natural next step to building out a robust serverless architecture.

Simple Queuing Service (SQS) in AWS

We ended the first part of this article faced with a bottleneck: our events couldn’t be handled by the simple serverless architecture in place. The simplest possibility to fix the problem of the missing messages is to leverage a messaging queue service. AWS calls this service Simple Queuing Service, or SQS for short. It solves the bottleneck issue by sending, receiving, and storing messages between software without losing messages or requiring the services to be available. If the second Lambda function in the previous example is busy with the first 1.000 jobs that it has received, SQS patiently holds the next 4.000 jobs until ‘ListenerLambda’ can ingest them.

We can send the message to the SQS programmatically using the AWS SDK from the Lambda function;

client = boto3.client('sqs')

client.send_message(QueueUrl='string', MessageBody='{"file": "path"})

where ‘QueueUrl’ is the URL corresponding to the particular queue that we want to use (again, check the documentation for a multitude of other options that we are leaving out in this example). For the second Lambda function to listen to this queue, we can, once again, go to the console screen shown in Figure 4, click in ‘Add trigger,’ and select there the SQS of interest.

Simple as that, we have managed to solve the bottleneck problem. When the event gets processed in the listener function, it gets automatically removed from the queue to avoid sending it again. If for any reason, the listener Lambda did not finish processing the event (it raises an error), then the queue will retry to send it later.

This is great… or is it? Things can get more complicated in many cases. Imagine ‘ListenerLambda’ can only process successfully 99% of the events that receive from the first Lambda, and for an event that has failed once the code always fails. In this situation, the message gets returned to the queue, which sends it to ‘ListenerLambda,’ which fails, so the message gets returned to the queue… you see it, right? To avoid this loop of despair, the SQS comes equipped with a retention period that automatically erases messages that have not been processed in that period (Figure 5).

Figure 5: Configuration of SQS

Imagine that you have estimated that all your jobs finish in two days if they went without error. Then, you would assume that any message still existing in the queue after two days is due to the job failing and can be safely erased from it. So you set your retention period to two days. This again does the job but is not a good practice. How can you debug your code and improve it, if you have erased all the information about the type of events that caused the error in the first place?

Again, no worries, we are not the first ones to think about this. Scrolling down in the AWS console screen, you will find the possibility to set a Dead-Letter Queue for your SQS. This is just another SQS to which you redirect problematic events. By setting a long retention period in the Dead-Letter Queue, you can monitor your job at any moment without having to worry that events that were not handled correctly by your code are being continuously sent to your Lambda function.

Congratulations! With this, we managed to make two Lambda functions robustly talk to each other, handling bottlenecks and problematic events with SQS and Dead-Letter Queues.

If you look back at your use case and find that you have tens of Lambda functions ready for deployment, with a non-sequential logic connecting them, and that you have to handle error catching and retrial, you may think that the tools at your disposal are limited. That is true. What is happening is that we haven’t introduced the AWS Step Function yet.

AWS Step Function

AWS Step Function is an orchestrator for serverless functions. You can visually coordinate Lambdas and other AWS serverless services quickly. The key term here is ‘visually,’ since its graphical interface clarifies what is happening at every moment, be it one Lambda function sending an event to the next one, catching an error, and retrying or implementing if-then-else logic, to put some examples. Figure 6 presents a screenshot from the configuration page, where you can find several example templates to get started and get to know the possibilities that this service allows.

Figure 6: Lambda function orchestration with AWS Step Function

It does all this, handling the events between serverless functions automatically. No need for all that yadda-yadda that you just read! As such, it is a very powerful tool to orchestrate your Lambdas. Then, why would we not use it always to simplify our code and save time reading this article? The reason is that this solution comes at a price, so you need to balance how much you want to pay and how much managing you are willing to handle. This option is definitely worth it for moderately complicated workflows.

Minimal permissions with AWS

Figure 7: Allowing a Lambda to interact with SQS

We end this article with the most important topic of them all: security. Following tutorials and templates online is a great way to learn how to deploy infrastructure, but usually, these favor the rapid deployment of a pipeline over considerations such as security. After all, it is a tutorial and you are going to destroy all the resources created (the cloud is a dangerous place: always destroy all the resources you will not be using). In the current context, this means that to make a Lambda function interact with an SQS queue, one would be prompted to go to the IAM console in AWS, and there assign a built-in security policy to the Lambda function role, many a time the full access one, like in Figure 7.

After assigning this policy to the Lambda role, one can find its content in JSON format in the IAM console, where it will read something similar to:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "sqs:*"
           ],
           "Resource": "*"
       }
   ]
}

Figure 8: Allowing a Lambda to interact with SQS

The security problem with this kind of policy resides in the wild cards ‘*’. In particular, "sqs:*" allows the Lambda to interact with the SQS queue without restriction. It is even possible to erase it or create more messages for the Lambda to consume, entering an infinite loop! What is worst, the second wild card in "Resource": "*" lets the Lambda function interact with any queue, so there exists the real possibility that we are invoking the serverless function with the wrong events that are destined to a different function or, even worse, a completely independent project.

You can alleviate the first of these issues by choosing the SQS Queue Execution role instead, as seen in Figure 8. In this case, the set of actions that the Lambda can execute concerning the SQS is limited. The JSON summary of the allowed policies now reads:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "sqs:ReceiveMessage",
               "sqs:DeleteMessage",
               "sqs:GetQueueAttributes"
           ],
           "Resource": "*"
       }
   ]
}

This allows receiving messages, deleting messages, or getting queue attributes (no more deleting the queue or creating new messages by accident). The only remaining weak point here is the "Resource": "*" term that allows the Lambda to interact with any queue. We can fix this by limiting the policy to a particular resource: the SQS that the Lambda is supposed to read the messages from. To do this, you need to obtain the ARN of the SQS, which can be gotten in the AWS console, as shown in Figure 9.

Figure 9: Getting the ARN of an SQS queue

By writing this ARN code instead of the wild card ‘*’ in the Resource field of the policy, you will ensure that the Lambda function can only receive messages, delete messages and get the queue attributes from that very queue.

A final remark comes from the obvious question: what if I want to read messages from one queue and send messages to a different one, for example, to invoke a third Lambda function? Wouldn’t limiting the resource to the receiving queue be counterproductive then? The answer to that is straightforward, you can attach more policies to your Lambda role. For example, you can add the permission to send messages to a different queue, along the lines of:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "sqs:ChangeMessageVisibility",
               "sqs:DeleteMessage",
               "sqs:GetQueueAttributes",
               "sqs:ReceiveMessage"
           ],
           "Resource": "ARN_OF_QUEUE_TO_READ_MESSAGES_FROM"
       },
       {
           "Effect": "Allow",
           "Action": [
               "sqs:SendMessage"
           ],
           "Resource": "ARN_OF_QUEUE_TO_SEND_MESSAGES_TO"
       }
}

Always limit your policy allowed actions and resources to the minimal set you need. You will avoid running into troubles in the future when scaling your application to production.

Conclusion

We have seen how to build robust and efficient serverless architectures in the cloud using Lambda functions, SQS service, Step Functions, and a restrictive setting of permissions.

For workloads that are peaky or have long idle times, this offers a cost-effective and easy to manage alternative to the deployment of monolithic pieces of code in virtual machines.

In this article, we have focused on the AWS environment for illustration, but these comments translate directly to other cloud service providers like MS Azure (see article: description of a use-case) or Google Cloud.

Previous
Previous

5 ways executives miss out on impactful analytics projects

Next
Next

Tech Talk: Building a serverless architecture in AWS with lambda function interaction (Part 1)