"Best place for cloud computing articles": February 2024

In this blog we have discussion on aws cloudwatch, cloudwatch, amazon cloudwatch, aws cloudwatch pricing, cloud watch aws, cloudwatch monitoring, datadog cloudwatch. cloudwatch pricing, aws cloudwatch cost, cloudwatch in aws.

Imagine this being a startup that is hosting a very good social media website for its users in just a few months this website got very popular and we tried expanding across other regions and within no time it spread out across the globe and we were chilling out on the beach as if nothing would ever go wrong but one finally half the data centres started to crash out and users started experiencing problems with the application we had literally no idea what exactly went wrong because we had completely overlooked some of the most important parts of global deployment and then we realized what if we had a proper logging mechanism and what if we could monitor what is going on and have a proper alerting mechanism in place which could help us get notified to act at the right time and what if we had a service that could help us collect the data points monitor them and help us act upon them automatically and which would let us analyse the data to avoid future disasters and that is what we are going to discuss today.

Hello everyone, this is Waqas Bin Khursheed and yes, we are going to start off with aws cloud watch and it’s going to be a lengthy session so read it carefully.

In today’s blog we will be talking about cloud watch obviously but along with that we will try and understand what’s the need for monitoring logging and alerting and we’ll also check out some of the features of using Cloud Watch and how does cloud watch actually work and we’ll also get some hands-on examples. Let’s start and before moving on with cloud watch let’s understand the importance of having a proper mechanism in place which can help you in the time of crisis when we design an application for an audience or demographic we are not sure about how this application is going to make millions so, we just try and expand along the way the impact grows and we might end up in situations that might cause problems for our users and that’s not always related to thinking behind the design sometimes it may be because of the short-sightedness that we have sometimes it can be because of the lack of budget or sometimes it could just be because we don’t have the proper understanding of how things actually work and when things are working fine, we never question the design or don’t necessarily think about the problem we might face and that’s where the problem begins.

In the microservice architecture where you’re dealing with not just one or two or ten apis we might be working with 1000 apis which are working in tandem with each other so that could be failures with related to Api on the service the authentication can fail and that might be a situation where the CPU utilization and the memory consumptions make the application or server to crash and when you don’t have additional services to provide you the proper information for your team to debug issues you are going to end up in problems that could affect you and your users in a very bad way previously we have learnt about the features that aws provides us such as auto scaling or resource scheduling or batching but in order to take the decision on how much we can expand, or scale depends on the state of the environment at a particular given point of time for that we need a service that can help us collect the data points or logs which can help us monitor the current state or the state over a period of time and help us create action items to mitigate the issue and the same way let us analyse the data we have in order to avoid such issues in the future and that’s where cloud watch comes into the picture so now this is the right time for us to talk about cloud watch so, when you think of cloud watch remember something very clearly it is a service that you could actually ignore and it won’t make any impact on the overall performance of your actual application but if not used properly and effectively would surely result in issues that you might have a hard time debugging and resolving.

That’s why it is called cloud watch along with providing a feature set were you can actually send logs for the services that you’re consuming using log streams and use its dashboards to create reports on the performance of your service it can also help you analyse the data points to understand where your application could actually break so, as it is rightly mentioned here cloud watch collects monitoring and operational data in the form of logs metrics and events and visualizes it using automated dashboards so you can get a unified view of your aws resources applications and services that run in aws and on-premises there are terms here that might confuse you a bit but when in this situation try and isolate what you did not understand. I am sure you know what logs are and metrics I’m not sure if you know about this or not okay so I’ll tell you something about metrics so just thinks about this so, if you invest hundred dollars and you get 105 in return you have a profit of five dollars and if you get 95 dollars in return then you have loss of five dollars because you have invested hundred dollars isn’t it when it comes to aws if you have deployed a service on a t2. micro and the CPU utilization reaches to above 95 percent and you then try and make some changes and it shoots down to 75 percent there will be a considerable amount of boost in the performance that’s called performance metrics the way you measure and take a quantitative approach on a data point over a given period of time gives you a form of metrics based on which you can analyse the way your services are performing so aws cloud watch is comprised of four pillars to provide visibility into your cloud resources and applications and I might repeat them a few times so please bear with me on that so and the four pillars are collecting monitor act and analyse.

So, the basic idea of using cloud watch is to send logs to the resources you’re working with which may be service logs application logs load balancer logs or as default service logs instance log or any other form of logs that you wish to send using the cloud watch agent yes, obviously with resources like ec2 lambda and s3 this is most commonly used and when it comes to monitoring you can make use of the cloud watch dashboard and create awesome visualizations and alerts for the change in the data points that you have and that works cross region as well and the third one act is the most interesting part because based on the data inside that you have you can create events that trigger resources to achieve allocations and meet the demands of your application like ec2 or container auto scaling using cloud watch events and with cloud watch you can analyse data over a short or a long period of time with up to one second and that too in real time these four pillars that you have here help you in application monitoring system-wide visibility a resource optimization and unified operational health and this are just the tip of the iceberg cloud watch is much more when you use it effectively don’t worry about some of these terms here we will talk about them shortly so, no matter what kind of application you’re working with or what region it belongs to you get the facility to create lock streams and send application and resource logs to cloud watch so that you can analyse the metrics and logs and so that you can act quickly to resolve the issues so if you see these applications they are mostly related to real-time data and applications like these are critical to run all the time and cannot afford to have a downtime.

So, business critical applications need a data set and real-time analysis to ensure that the application has very little downtime and as a solutions architect it’s your job to ensure that you have this in place you already know what application monitoring is so when it comes to system-wide visibility you may have applications hosted at aws cloud or on-premises you can’t ignore a few resources or services just because you don’t like them or if you’re working on a multi-tier application you cannot ignore the database because you are just storing data there that’s not going to work with cloud watch you get the exposure to monitor and get data about all the tiers of the application that you have so that you won’t miss out on anything and let’s talk about resource optimization to auto scale instances when there is a peak CPU utilization of over 95 percent you choose trigger events so that you could increase the number of instances you want and reduce when the CPU utilization decreases or reduces and these things actually help the system to have unified operational health by making sure you have the alerts and notification in place based on the events that you wish to trigger.

You know what if suppose you have a trigger and you have created an alert you can send a notification to the sns topic and you can get notified on your phone or email as well and that’s the overall picture so as I told you before and I repeat I once again when you think of implementing cloud watch think of these four pillars collect monitor act and analyse now let’s move ahead with your favourite part the working of aws cloud watch so when we think about how does cloud watch work the first thing that you should remind yourself is that it’s nothing but a matrix repository so when we have services in aws that have metrics if you want you can also create your own metrics you can also do that using the put matrix data on cloud watch so remember you can also create your own metrics in cloud watch so the basic idea here is that if I want to judge the current state of an instance or resource I need a benchmark isn’t it for example if I tell you if

the CPU utilization of the instance goes above 85 percent then I want to scale a new resource so, what you will say then then me benchmark is CPU utilization and the threshold value that I have here is 85% isn’t it because if the CPU utilization of the instance goes above 85 percent then I want to scale a new resource so 85 percent becomes the threshold and the benchmark on which I’m trying to judge the resource state is CPU utilization so remember that so, if I have a metrics called CPU utilization and I send log streams that point to that matrix for it to keep track of the current state of the instance I can create an alarm using that matrix and measure it or judge it based on if that has reached 85 percent or not because that will be my threshold value so using that matrix of CPU utilization I can put a threshold value that and I can create an alarm states using that particular threshold value in that particular matrix and one more very important thing to remember is these metrics that you are creating right now are regionally scoped but you can use Cloud watch cross region stats functionality to get them in a single place and I can write a condition that if it does use the auto scaling policy to spin up a new instance.

I can further enhance it as well as send notifications to users using the sns topic or perform any operation we need and that’s what we see here so we have the services which are being connected to cloud watch for metrics and log push we have the cloud watch alarm which makes use of the metrics to scale instances using the auto scaling group and snsu helps to send out the messages or notification. We will be able to relate to it in a way that might help you visualize how things actually work so, this is basically your overall picture so you can see the collect part here you can see the monitor part here you can see where we are actually acting upon the particular event that we have and we are how we are actually trying to analyse the resources that we have here so, this is the overall picture and nothing more but we still have a few more concepts to cover because you know right, you learn once and never forget hopefully isn’t it so now let’s understand more about alarms and events so now let’s take another example here because you guys love real time examples isn’t it so the biggest thing right now for gamers is that to get their hands on a PlayStation 5.

This is a real-world hypothesis let us assume that this is the case my friend wants to get one as well so what I did was I wrote a lambda function to fetch the records of the currently available stock this is just an example don’t do this you will be banned and what I did is I created a matrix which keeps account of the http status code of 200 and I have created an alarm using the same and I have kept the condition that if the threshold count is at five or more than the notification to the user so whenever I call the lambda function it will fetch the information from the website about the count of the current stock that I have and once the http status code is 200 it will send the metrics data and carry out the cycle what if I wanted this to be scheduled so, what I did was I added a cloud watch event so cloud watch even time based that triggers the lambda function everyone minute with cloud watch events I get the feature where I can create a scheduled event and point that event to a target that is hear my lambda function isn’t it that’s very clear in that way I can just sit back and get notified whenever the PlayStation 5 is in stock and my friend as well so, as you can see here the cloud watch event that you have delivers a new real-time stream of system events that describe the change in aws resources so that is what it is trying to tell so when the time-based cloud watch event is triggered every one minute it is trying to invoke the target here and the alarm actually performs one or more actions based on the value of the metric or expression relative to the threshold over a number of time periods so, the threshold value is five years so whenever it is reached it is trying to send out the notification so.

Cloud watch alarms as we have discussed before you will be aware that cloud watch alarms can help you automatically initiate actions on your behalf based on the metrics that you have and it keeps focusing on a single metric and keeps track of the change in the behaviour of your resources over a period of time like five minutes ten minutes one hour four-hour one day and it’s up to 15 months okay so, it can keep track of your metrics up to that long and when we create an alarm, we need to think of three settings the first one is period which indicates the single frame of time which could be in seconds or minutes that is based on your time period based on which we evaluate the metric next is evaluation period so this is basic it’s your evaluation period that is the most recent period used to determine the state of the alarm the third one is the data point to alarm this is the number of time frames or data points which must be breaching the threshold for the alarm to go to the alarm state don’t worry about this I’ll explain you this with an example so just think about these three settings right now okay let’s take this example and let’s understand how to read this so, let’s suppose we have the evaluation interval of three seconds and the data point to alarm is also three so what it means is that if we have three data points that are reaching the threshold the alarm will go to the alarming state so, this is your threshold value that you have the line in the blue that you see here and the exact value that is the current state of the resource.

So, it started off from 1 and it has taken a peek, and it has crossed the threshold value here that is in between 2 and 3 isn’t it so, once it has crossed it has reached the fourth unit but the threshold value that we have set is 3 but now it has already breached so there actually begins the data points to alarm so from three, we start counting how many numbers of points has been breached so this is one this is the second one and this is the third one and clearly, we have a breach of three data points so now this will be a placeholder for which the alarms will be triggered so as you can see here after three periods over threshold an action is invoked because it has crossed the threshold value and it has invoked, or it has crossed three data points three four five and it has crossed three time periods as well so now after three points what happened it has reduced below the threshold value and now it has gone from 6 to 7 but it has not breached the threshold value here as well so now if you see it has peaked to 5.5 but it actually, drastically reduced to 2 and then to 1 or 0 here so, this is only one period over threshold so no action is invoked so what it means is like if you have a threshold value of three and you have kept a matrix or the alarm state to be like okay let the resource breach or let the resource cross three units I don’t have a problem but if suppose it crosses three for three consecutive seconds so, if you see here one two three consecutive seconds or data points then only I will consider it as a breach and then only I’ll consider them as the data points to alarm and I will take action on that so imagine you’re playing call of duty okay so here what happens when you get hit and you take cover your health actually regenerates isn’t it.

So, let’s suppose the game has been programmed that if the person gets hit three times consecutively then he will die otherwise he will be able to recover so, if you get hit once and you go into cover you will be recovered to the full state of health that is hundred percent but if you run out of luck and you get hit three times consecutively you die isn’t it that’s where the action actually, is being invoked so that’s what the health percentage or the data points to alarm actually relates to so that’s how the alarm actually, works and when you get this chart or this diagram and if you want to analyse it you have to just consider the threshold and the value and you have to consider what is the evaluation period that you have and what is the data points to alarm because these two things are very much important when you are trying to read or evaluate an alarm and there are three states of the alarm.

The first one is in alarm which happens when the matrix or expression is outside of the defined threshold so if you have a threshold of 5 unit and it matches the data point to alarm then it moves to the alarming state that is when the action is supposed to take place the second one is okay it means everything is okay and the matrix is within the threshold value the third one is insufficient data so let’s suppose the data points are not available yet or there is no sufficient data to project for the alarm so in that case you get the insufficient data so when you’re working on alarms, please keep an eye on these states so, these are some of the concepts that we need to understand while using cloud watch so we have name space metrics events and alarms so we have already discussed mostly three of them so the next one that is actually left is namespace so let’s try and understand that so now let’s understand another concept that is namespace so when it comes to namespace you must be aware that a namespace is a placeholder to uniquely identify a group of objects in cloud watch as well it is just a container for the matrix based on which we can create bifurcations between other groups of matrices so, when you create a custom matrix you create a namespace for the matrix data so.

There are namespaces that are used by a lot of aws services as well for example ec2 uses aws slash ec2 or aws ac2 spot so let’s suppose you want to create one for your application you can create one like you can create one like giving the name of the service that you have and there is a naming convention that you should follow so you can check that out in the documentation but you must remember that to avoid conflicts with the aws service namespace you should not specify a namespace that begins with aws service name and these are all the services which have the namespaces scoped in aws these are already listed in the documentation you can check the spaces which are already there so before creating any metrics if you want to use the existing namespace then you can use them or if you want you can create one for yourself so these are all the ones that we have we have already covered most of the services but there are a lot of services in aws man this is crazy but don’t be worried about this let’s move on so now let’s check the example of how we manage auto scaling for ec2 with cloud watch so.

The requirement for this design is that if the CPU utilization for an instance goes above 90 percent we need to add another instance using order scaling groups and the policy if it reduces below 40 then we need to scale down the instances simple isn’t it so, this is a simple requirement when the CPU utilization goes above 90 we need to scale it up and if it goes below 40 percent then we need to scale it down so let’s start so, we have the user and who is using the services that we have as a part of us aws cloud infra we have our load balancer which is connected to the instances with our auto scaling group in place in order to achieve the metric order scaling, we need to use the ec2 CPU utilization metrics and create the cloud watch alarm with the threshold of 90 and for that we have our instances with cloud watch agents already installed to capture the information and send the real-time data string for analysis to the metrics once the metrics value reaches 90% we send the alarm to trigger the auto scaling group which in turn Scales the instances for us and downscales it when it’s not needed so that’s how we can design us applications to ensure that we don’t encounter any failures for our users so as you can see I’ll repeat this once again, so we have a user’s who are using the application of the service that we have as a part of us aws cloud infrastructure.

We have us load balancer which is connected to the instances with an auto scaling group in place in order to achieve the matrix auto scaling we need to use the ec2 CPU utilization matrix and create the cloud watch alarm with the threshold of 90 and for that we have us instances with cloud watch agents already installed to capture the information and send the real-time data stream for analysis to the metric so, once the metric value reaches 90 we send the alarm to trigger the auto scaling group and which in turn scales the instances for us so as you can see the red dots that are visible to you they are actually, the data points that we have they are being sent to cloud watch based on which we are triggering the alarm so that’s how we can design us applications to ensure we don’t encounter any failures for our users I want to repeat this once again so this is very simple but it is very effective while you’re trying to make use of this most of the times we don’t make use of these cloud watch alarms and uh cloud watch logs and events and matrices very effectively and thus, we are not able to get the real fruit out of it that’s what I wanted to help you understand here and there might be a few concepts that might not be clear to you and there are a lot of things that we can discuss about cloud watch but it is not that appropriate right now at this point of time but if you still have any more doubts or any problems, practice again.

I think this was a really long blog on cloud watch and we have discussed a lot on the concepts itself and I have decided to make a separate blog on cloud watch for the demo part which I’ll be uploading it in a short time so please make sure that you watch that as well and that was a very interesting session I really enjoyed it and if you did as well then please make sure your submit a feedback on what you liked what you didn’t like in the blog. If you felt it was worth it and follow me on Instagram so that we can be friends, I would love that and all the links to support the blog. So please make sure you check them out it really helps the blog grow. I wish you all the success in life stay safe stay healthy, so I hope to see you next time same place here with another blog on aws until next time.

"Best place for cloud computing articles"

Saturday, 17 February 2024

What are AWS CloudWatch? Metric | Alarms | Logs Custom Metric

Friday, 16 February 2024

What are the steps involved in a CloudFormation Solution?

What are the steps involved in a CloudFormation Solution?

What is geo-targeting in CloudFront?

What is geo-targeting in CloudFront?

Power of Amazon EMR

Report Abuse