Recently I stumbled upon an interesting issue in one of our services. When going through our Grafana dashboards I have discovered that time spent in GC for this service is abnormally high. Immediately I started investigating this issue. The investigation was pretty quick and an issue was rather simple but it gave us a significant performance boost. So I decided to share it with you and shed some light on how adaptive heap sizing in jvm works and how it can sometimes be a problem.
First, I want to tell you a little more about the service I was working with. In my opinion, it is the most complicated service
our team created. Mostly because it utilizes 3 databases (
MongoDb(aggregated statistics for single notification),
Cassandra (event store) and
ElasticSearch (notifications full text search)) it internally uses
(queue before DBs for reliability purposes) and was written in
The business purpose of this service is to track all notifications sent from our infrastructure. It aggregates few different events:
- When the notification was sent
- If the notification was failed to send
- When the user opened a notification
- When the user clicked in any of the links inside of the notification
During the day we are observing traffic around
few thousands rps (requests per second). Here is an example of traffic from the most common day:
And normal p99 of responses time form tracker service:
As you can see it is disturbingly high.
The tracking service is running at 10 instances in a production environment. Every instance has 3GB of memory committed as uses 2 processor cores. In total it gives us 30GB of memory and 20 processor cores. Which I think is a lot for the work it has to do.
Looking through Grafana dashboards
I have discovered that most tracking service instances are spending 30 seconds out of minute in GC. Here is how it looks like on charts:
Also, a number of GC collections were really high, almost 15k collection per 10 minutes:
I began investigation. I have downloaded gc.log for one instance. Using
JClarity's Censum I have discovered that allocation rate was
around 1GB and 1.8GB in peaks, which was pretty high.
Moreover, there was huge percent of premature promotions, to be exact 52%. I have assumed that this is because of this high allocation rates.
In the next step I have created and downloaded heap dump for this instance, to check contents of the heap. It was quite a surprise when I discovered
that it's size was only 42MB. With this allocation rate, there was no surprise that
JVM was spending thirty seconds out of
minute in GC. The percent of premature promotions was that high because heap was only 42MB in size and object allocation rate was more than 1GB.
I cannot believe we deliberately set that small heaps, so I have checked the jvm properties from our configuration, and the only
heap size property that was set was
-Xmx2G. My first reflex was to add
-Xms property which I did right away.
Adding -Xms along -Xmx
I have set -Xms1g and -Xmx1g and here are the results of this change:
As we can see time spent in a GC was reduced significantly. Moreover, thanks to that response times from our service was also reduced
Why this helped and why previously heap was only 40mb?
When we don't set
-Xms flag explicitly the initial heap size will be set to 1/64 of all available memory.
Which in our case was around 31MB. We can control how fast JVM is resizing it's heap by setting flags:
-XX:MaxHeapFreeRatio which defaults are 40% and 70%. This means that JVM will enlarge its heap after full gc when
percent of free space is less than 40. On the other, hand heap size will shrink after full GC if there is more than 70% space free.
JVM is less eager to reduce heap size because it is more complicated operation and will take more time which will significantly
affect application performance. This is default resizing policy and because objects that were allocated by our application was
dying young and most of them were removed after full GC we could never reach this
MinHeapFreeRatio threshold. This is fragment
of our GC log:
In line 14 we can see that live objects were occupying 77049KB out of 178220KB available memory on the heap. Which means there was almost 60 percent of free space. And this was rather a standard situation after each full GC.
This problem occurred because we were using
Concurrent Mark Sweep collector. CMS does not support adaptive sizing policy,
which is default for
G1 collectors. With adaptive sizing there are three criteria taken into account when
deciding if the heap should be resized:
- Desired maximum GC pause goal - if the GC pause time is greater than the pause time goal then reduce the sizes of the generations to better attain the goal.
- Desired application throughput goal - if the pause time goal is being met then consider the application's throughput goal. If the application's throughput goal is not being met, then increase the sizes of the generations to better attain the goal.
- Minimum footprint - if both the pause time goal and the throughput goal are being met, then the size of the generations are decreased to reduce footprint.
However we were using CMS GC so we cannot rely on adaptive sizing. But we can compute how large heap we need.
Looking at GC logs and finding the right values for Eden and tenuring space
Right now we have defined our heap size and to be 1GB. But we have not specified the size of generations. Due to that CMS is creating really small Eden there would be still a lot of collections in young generation. If we look again at allocation rates we can see that most of the time we are allocating around 800mb per second. It would be nice not to have more than one collection per second. Our service is using around 200-300mb of memory for Spring and configuration and rest of the objects have really short lifespan. So we can set young generation size to 600mb an old size to 400mb. Here are the results of this change:
Typical young collection time (0.01 sec):
Typical old collection pause time(0.21sec):
Number of collections per 10 minutes:
Time spent in GC per minute:
Also, overall usage of the processor was reduced thanks to that changes (first change at 11:00 and second at 16:00):
To sum up, CMS does not have adaptive sizing policy and all changes to heap size are based only on free space after full GC.
Because of that, it is good practice to set not only max heap size by also min heap size, which should be based on how your application
allocates memory and how much does it need for normal functioning. Another thing you should be cautious about is how often
GC is being performed and how long does it take to clean heap form garbage. Moreover, I want to point out how important it is
to look at GC metrics, analyze
gc.log from time to time, and adjust JVM tuning to your actual workload. All the tuning I did
to this JVM is fine right now but after a few months, application or traffic could change and we will end up with a poorly tuned