This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.
Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and the misconceptions. In the talk on performance testing, Martin takes its further and fuses Java, OS and the hardware to show how understanding of all these aspects can help write better programs.
Java Garbage Collection Distilled by Martin Thompson
There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.
throughput (-XX:GCTimeRatio=99), latency (-XX:MaxGCPauseMillis=<n>) and memory (-Xmx<n>) are the key variables that the collectors depend upon. It is important to note that Hotspot often cannot achieve the above targets. If a low-latency application goes unresponsive for more than a few seconds it can spill disaster. Tradeoffs play out as they
- provide more memory to GC algorithms
- GC can be reduced by containing the live set and keeping heap size small
- frequency of pauses can be reduced by managing heap and generation sizes & controlling application’s object allocation rate
- frequency of large pauses can be reduced by running GC concurrently
GC algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. Experimentation has shown that generational garbage collectors support much better throughput than non-generational ones – hence used in server JVMs.
For GC to occur it is necessary that all the threads of a running application must pause – garbage collectors do this by signaling the threads to stop when they come to a safe-point. Time to safe-point is an important consideration in low-latency applications and can be found using the ‑XX:+PrintGCApplicationStoppedTime flag in addition to the other GC flags.
When a STW events occurs a system will undergo significant scheduling pressure as the threads resume when released from safe-points, hence less STWs makes an application more efficient.
Heap Organisation in Hotspot
Java heap is divided in various regions, an object is created in Eden, and moved into the survivor spaces, and eventually into tenured. PermGen was used to store runtime objects such as classes and static strings. Collectors take help of Virtual spaces to meet throughput & latency targets, and adjusting the region sizes to reach the targets.
TLAB (Thread Local Allocation Buffer) is used to allocate objects in Java, which is cheaper than using malloc (takes 10 instructions on most platforms). Rate of minor collection is directly proportional to the rate of object allocation. Large objects (-XX:PretenureSizeThreshold=n) may have to be allocated in Old Gen, but if the threshold is set below TLAB size, then they will not be created in old gen – (note) does not apply to the G1collector.
Minor collection occurs when Eden becomes full, objects are promoted to the tenured space from eden once they get old i.e. cross the threshold (-XX:MaxTenuringThreshold). In minor collection, live reachable objects with known GC roots are copied to the survivor space. Hotspot maintains cross-generational references using a card table. Hence size of the old generation is also a factor in the cost of minor collections. Collection efficiency can be received by adjusting the size of Eden to the number of objects to be promoted. These are prone to STW and hence problematic in recent times.
Major collections collect the old generation so that objects from the young gen can be promoted. Collectors track a fill threshold for the old generation and begin collection when the threshold is passed. To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (-XX:PromotedPadding=<n>). Heap resizing can be avoided using the –Xms and -Xmx flags. Compaction of old gen causes one of the largest STW pauses an application can experience and directly proportion to the number of live objects in old gen. Tenure space can be filled up slower, by adjusting the survivor space sizes and tenuring threshold but this in turn can cause longer minor collection pause times as a result of increased copying costs between the survivor spaces.
It is the simplest collector with the smallest footprint (-XX:+UseSerialGC) and uses a single thread for both minor and major collections.
Comes in two forms (-XX:+UseParallelGC) and (-XX:+UseParallelOldGC) and uses multiple threads for minor collections and a single thread for major collections – since Java 7u4 uses multiple threads for both type of collections. Parallel Old performs very well on a multi-processor system, suitable for batch applications. This collector can be helped by providing more memory, larger but fewer collection pauses. Weigh your bets between the Parallel Old and Concurrent collector depending on how much pause your application can withstand (expect 1 to 5 seconds pauses per GB of live data on modern hardware while old gen is compacted).
Concurrent Mark Sweep (CMS) Collector
CMS (-XX:+UseConcMarkSweepGC) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. CMS is not a compacting collector causing fragmentation in Old gen over time. Promotion failure will trigger FullGC when a large object cannot fit in Old gen. CMS runs alongside your application taking CPU time. CMS can suffer “concurrent mode failures” when it fails to collect at a sufficient rate to keep up with promotion.
Garbage First (G1) Collector
G1 (-XX:+UseG1GC) is a new collector introduced in Java 6 and now officially supported in Java 7. It is a generational collector with a partially concurrent collecting algorithm, compacts the Old gen with smaller incremental STW pauses. It divides the heap into fixed sized regions of variable purpose. G1 is target driven on latency (–XX:MaxGCPauseMillis=<n>, default value = 200ms). Collection on the humongous regions can be very costly. It uses “Remembered Sets” to keep track of references to objects from other regions. There is a lot of cost involved with book keeping and maintainig “Remembered Sets”. Similar to CMS, G1 can suffer from an evacuation failure (to-space overflow).
Alternative Concurrent Collectors
Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing are alternative concurrent collectors. Zing according to the author is the only Java collector that strikes a balance between collection, compaction, and maintains a high-throughput rate for all generations. Zing is concurrent for all phases including during minor collections, irrespective of heap size. For all the concurrent collectors targeting latency you have to give up throughput and gain footprint. Budget for heap size at least 2 to 3 times the live set for efficient operation.
Garbage Collection Monitoring & Tuning
Important flags to always have enabled to collect optimum GC details:
Use applications like Chewiebug, JVisualVM (with Visual GC plugin) to study the behaviour of your application as a result of GC actions. Run representative load tests that can be executed repeatedly (as you gain knowledge of the various collectors), keep experimenting with different configuration until you reach your throughput and latency targets. jHiccup helps track pauses within the JVM. As known to us it’s a difficult challenge to strike a balance between latency requirements, high object allocation and promotion rates combined, and that sometimes choosing a commercial solution to achieve this might be a more sensible idea.
Conclusion: GC is a big subject by itself and has a number of components, some of them are constantly replaced and it’s important to know what each one stands for. The GC flags are as much important as the component to which they are related and it’s important to know them and how to use them. Enabling some standard GC flags to record GC logs does not have any significant impact on the performance of the JVM. Using third-party freeware or commercial tools help as long as you follow the authors methodology.
— Highly reading the article multiple times, as Martin has covered lots of details about GC and the collectors which requires close inspection and good understanding. —
He classifies myth into three categories – possible, plausible and busted! In order to get the best out of the hardware you own, you need TO KNOW your hardware. Make tradeoffs as you go along, changing knobs, it’s not as scary as you may think.
Good question: do normal developers understand the hardware they program on? Or cannot understand what’s going on? Or do we have the discipline and make the effort to understand the platform we work with?
CPUs are not getting faster – clock speed isn’t everything, Sandy bridge architecture is the faster breed. 6 ports to support parallelism (6 ops/cycle). Haswell has 8 ports! Code doing division operations perform slower than any other arithmetic operations. CPU has both front-end and back-end cycles. It’s getting faster as we are feeding them faster – PLAUSIBLE
Memory provides us random access – CPU registers and buffers, internal caches (L1, L2, L3) and memory – mentioned in the order in which speed of access to these areas increase respectively. CPUs have been doing things to bring down its operational temperature by performing direct access operations. Writes are less hassle than reads – buffer misses are costly. L1 is organised into cache-lines containing code that the processor will execute – efficiency is achieved by not having cache-line misses. Pre-fetchers help reduce latency and helps during reading streaming and predictable data. TLB misses can also cost efficiency (4K = size of memory page). In short reading memory isn’t anywhere close to reading it randomly but SEQUENTIALLY due to the way the underlying hardware works. Writing highly branched code can cause slow down in execution of the program – keep things together that’ is cohesive is the key to efficiency. – BUSTED
Note: TLAB and TLB are two different concepts!
HDD provides random access – spinning disc and an arm moves about reading data. More sectors are placed in the outer tracks than the inner tracks (zone bit recording). Spinning the discs faster isn’t the way to increase HDD performance. 4K is the minimum you can read or write. Seek time in the best disc is 3-6 ms, laptop drives are slower (15 ms). Rotational latency takes some time. Data transfers takes 100-220 Mbytes/sec. Adding a cache can improve writing data into the disc, not so much for reading data from disks. – BUSTED
SSDs provides random access – Reads and writes are great, works very fast (4K at a time). Delete is not like the real deletes, it’s marked deleted and not really deleted – as you can’t erase at high resolution hence a whole block needs to be erased at a time (hence marked deleted). All this can cause fragmentation, and GC and compaction is required. Reads are smooth, writes are hindered by fragmentation, GC, compaction, etc…, also to be ware of write-amplification. A few disadvantages when using SSD but overall quite performant. – PLAUSIBLE
Can we understand all of this and write better code?
Conclusion: do not take everything in the space for granted just because it’s on the tin, examine, inspect and investigate for yourself the internals where possible before considering it to be possible, or plausible – in order to write good code and take advantage of these features.
— Great talk and good coverage of the mechanical sympathy topic with some good humour, watch the video for performance statistics gathered for each of the above hardware components —
“Performance Testing Java Applications” by Martin Thompson
Things we use like “How to use a profiler?” or ”How to use a debugger?”
What is Performance? Could mean two things like throughput or bandwidth (how much can you get through) and latency (how quickly the system responds).
Response time changes as we apply more load on the system. Before we design any system, we need performance requirements i.e. what is the throughput of the system or how fast you want the system to respond (latency)? Does your system scale economically with your business?
Developer time is expensive, hardware is cheap! For anything, you need a transaction budget (with a good break-down of the different components and processes the system is going to go through or goes through).
How can we do performance testing? Apply load to a system and see if the throughput go up or down? And what is the response time of the system as we apply load. Stress testing is different from load testing (see Load testing), stress testing (see Stress testing) is a point where things break (collapse of a system), an ideal system will continue with a flat line. Also it’s important to perform load testing not just from one point but from multiple points and concurrently. Most importantly high duration testing is very important – which bring a lot of anomalies to the surface i.e. memory leaks, etc…
Building up a test suite is a good idea, a suite made up of smaller parts. We need to know the basic building blocks of the system we use and what we can get out of it. Do we know the different threshold points of our systems and how much its components can handle? Very important to know the algorithms we use, know how to measure them and use it accordingly – reverse out using real data.
When should we test performance?
What does optimisation mean? Knowing and choosing your data and working around it for performance. New development practices: we need to test early and often!
From a Performance perspective, “test first” practice is very important, and then design the system gradually as changes can cost a lot in the future.
Red – Green – Debug – Profile – Refactor, a new way of “test first” performance methodology as opposed to Red-Green-Refactor methodology only! Earlier and shorter feedback cycle is better than finding something out in the future.
Use “like live” pairing stations, Mac is a bad example to work on if you are working in the Performance space – a linux machine is a better option.
Performance tests can fail the build – and it should fail a build in your CI system! What should a micro benchmark look like (i.e. calliper)?Division in your code can be very expensive, instead use a mask operator!
What about concurrency testing? Is it just about performance? Invariants? Contention?
What about system performance tests? Should we be able to test big and small clients with different ranges. It’s fun to know deep-dive details of the system you work on. A business problem is the core and the most important one to solve and NOT to discuss on whatframeworks to use to build it. Please do not use Java serialisation as it is not designed for on the-wire-protocol! Measure performance of a system using a observer rather than measure it from inside the system.
Performance Testing Lessons – lots of technical stuff and lots of cultural stuff. Technical Lessons – learn how to measure, check out histograms! Do not sample a system, we miss out when things when the system go strange, outliers, etc… – histograms help! Knowing what is going on around the areas where the system takes a long time is important! Capturing time from the OS is a very important as well.
With time you get – accuracy, precision and resolution, and most people mix all of these up. On machines with dual sockets, time might not be synchronised. Quality of the time information is very dependent on the OS you are using. Time might be an issue on virtualisedsystems, or between two different machines. This issue can be resolved, is to round-trip an operation between the two systems (note the start and stop clock times) and half them to get a more accurate time.
Know your system, its underlying components – get the metrics and study them ! Use a linux tool like perstat, will give lots of performance and statistics related information on your CPU and OS – branch predictions and cache-misses!
RDTSC is not an order-instructions execution system, x86 is an ordered instruction systems and operations do not occur in a unordered fashion.
Theory of constraints! – Always start with number 1, the one that takes most of the time – bottleneck of your system, the remaining sequence of issues might be dependent on number 1 and not separate issues!
Trying to create a performance team is an anti-pattern – make the experts help bring the skills out to the rest of the team, and stretch-up their abilities!
Beware of YAGNI – about doing performance tests – smell of excuse!
Commit builds > 3.40 mins = worrying, same for acceptance test build > 15 mins = lowers team confidence.
Test environment should equal to production environment! Easy to get exactly similar hardware these days!
Conclusion: Start with a “test first” performance testing approach when writing applications that are low latency dependent. Know your targets and work towards it. Know your underlying systems all the way from hardware to the development environment. Its not just technical things that matter, cultural things also do matter as much when it comes to performance testing. Share and spread the knowledge across the team rather than isolating it to one or two people i.e. so called experts in the team. Its everyone’s responsibility not just a few seniors in the team. Learn more about time across various hardware and operating systems, and between systems.
As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey.
- Are your GC logs speaking to you, the G1GC edition by Kirk Pepperdine – Slides – Video
- Performance Special Interest Group discussion – moderated by Richard Warburton (video)
- Caching in: understand, measure and use your CPU Cache more effectively” by @RichardWarburto – (video & slides)
- Article on Atomic I/O operations (Linux) by Jonathan Corbet
- Articles and Presentations about Azul Zing, Low Latency GC & OpenJDK by Gil Tene (videos & slides)
- Lock-Free Algorithms For Ultimate Performance by Martin Thompson
- Performance Java User’s Group – “For expert Java developers who want to push their systems to the next level”
- Tuning the Size of your thread pool by Kirk Pepperdine
- How NOT to measure Latency by Gil Tene
- Understanding Java Garbage Collection and What You Can Do about It by Gil Tene
- Vanilla #Java Understanding how Core Java really works can help you write simpler, faster applications by Peter Lawrey
- Profiling Java In Production – by Kaushik Srenevasan
- HotSpot JVM garbage collection options cheat sheet (v3) by Alexey Ragozin
- Optimizing Google’s Warehouse Scale Computers: The NUMA Experience – authors from Univ. of Cal (SD) & Google!
- MegaPipe: A New Programming Interface for Scalable Network I/O by several authors!
- What Every Programmer Should Know About Memory by Ulrich Drepper
- Memory Barriers: a Hardware View for Software Hackers – Paul E. McKenney (Linux Technology Center – IBM Beaverton)