In this article we look at what could happen to your monthly AWS spend if you have not optimised your infrastructure to accommodate the volume of changes.


You can also watch a video case study too.


When you first onboard Audit Trail it is tempting to select the lowest cost components in order to keep your costs down. In theory this is the right approach. However in practise by selecting a database capacity that becomes overloaded, you can increase your costs more than it would be if you had selected a more powerful database. Let's take a look at how that can happen.


Let's take a look at The Adam James Foundation (a fictitious organisation). They setup their database with a size medium when prompted to so.



In AWS, this corresponds to a DocumentDB database instance class size of db.t3.medium. 


The AJF thought that this would be sufficient for quiet times and busy times. What they did not realise was that this database was becoming "overheated". Normally a database can keep all the data it needs to work with in memory. However, the greater the number of changes the larger the amount of memory it needs to work with. If there is not enough memory then it will use a swap disk i.e. instead of keeping items in RAM, it will write them to disk in order to use them.


AWS charges for the hourly use of the database but also for input/output (IO) usage. Because the database was having to write so much to disk, the IO usage went up enormously.


We can see this on the cost explorer.



The first thing to note is that, if you have TechSoup AWS credits (or any other credits), it is worthwhile removing it from the view. On the right-hand filter select to exclude credits:



I have also set up this view with my date range, a monthly granularity and grouping by Usage type. To focus in on the database, I am filtering on Service, selecting only DocumentDB (with MongoDB compatibility).


The vast majority of cost here is the blue item - USE2-StorageIOUsage. This is the input/output cost of the swap disk. To lower this cost, we can increase the database class to a higher level. You can either do this via the Audit Trail Configuration application or directly in AWS. I selected "Large" which corresponds to db.r6g.large instance. This costs almost 4 times as much per hour but, the idea is that this will be greatly offset by a lowered IO value.


Another indicator...


There is another indicator that things were not going as planned...


Let's bring up the database in AWS. In the search box search for  atinstance-1. Under the resources area of the results, you should see atinstance-1



Under the monitoring tab, there are a lot of graphs. We are going to look at two of them. Firstly CPU Utilization.


This shows us that, for much of the time the CPU was working hard and quite a few occasions it was up at 100%.


The next graph is the Buffer Cache Hit Ratio in Percent. 



This graph shows us the efficiency with which the in-memory data cache  is utilized. A high percentage indicates that most read operations are being served from the in-memory cache which is faster than reading from disk. The higher the value the better. Ideally we want to be up towards 100%.


When we changed the database class to large (on the right of the graph), 100% was hit consistently.


How come we are producing so much data?


It can be difficult to recognise how many changes you are actually producing. On a basic level you may think of just those members of staff who are going into records and updating them. In which case that maybe just a 100 records per day. However when you consider other changes, such as imports or global changes and then also think about automated processes that just happen e.g. maybe you have connections to third party applications that are adding or updating data then the figures can mount up.


If you are seeing delays in data being added to the Viewer then you should also be wary. (If you are not seeing any data then that is another story. There could be an API key issue or your database search may be timing out - a larger database instance class would certainly help here too).


It is possible to check the "queue". Whenever data comes in from Blackbaud, so that it does not overwhelm the AWS infrastructure, it is added to a queue - a FIFO SQS queue to be precise. Let's bring up this queue.


In the AWS search box, enter. ATQueue.fifo. Again, under the resources select the entry for ATQueue.fifo.


If we go to the monitoring tab and look at some of the graphs shown there.



This shows the approximate number of message on the queue. Each message corresponds to one field change. This is a worrying picture. Firstly within the space of a few hours the numbers increase dramatically from around 370k up to around 470k. There may be a very good reason for this i.e. a large import of data. What is important is to see the numbers fall again. If, when we look at the activity later on we can a lowering of the numbers, then we know that our database can cope with the numbers coming off.