It’s two in the morning. Kamila, an SRE engineer at a large e-commerce company, is roused from sleep by the piercing sound of an alert from the monitoring system. Her pulse speeds up. She knows what it means. The main transaction processing system, the heart of the entire platform, built on Java, has slowed dramatically. The response time, which is normally 200 milliseconds, jumped to 10 seconds. The virtual “war room” on the company’s communicator fills up within minutes. The familiar chaotic ritual begins. Developers panic-search through gigabytes of logs for any errors. The operations team stares at the CPU and memory charts, which, to their anger, look normal. Database administrators check the slow queries, but nothing points to an obvious culprit. Everyone sees only a small piece of the puzzle. An hour passes, then another. The company loses thousands of zlotys every minute, its reputation melts away, and the team, despite tremendous effort, is still at square one, wandering in a fog of uncorrelated data.
This scenario is the nightmare of any organization whose business relies on complex, critical applications. In modern, distributed architectures, where a single user request can flow through dozens of services, databases and external systems, traditional monitoring methods based on analyzing logs and infrastructure metrics are like trying to diagnose a complex disease with a simple thermometer. They give a signal that something is wrong, but don’t tell you what or why. Fortunately, that era of reactive firefighting is coming to an end. This article is a journey into the world of modern Application Performance Management (APM) systems. We will show how they are fundamentally changing the game in diagnosing problems and how the unique Symptom Driven Diagnostics (SDD) approach, implemented in our proprietary Polish Flopsar Suite tool from ARDURA Consulting, can reduce diagnosis time from hours to literally tens of seconds.
Why is traditional monitoring (logs and metrics) insufficient for diagnosing modern Java applications?
“Premature optimization is the root of all evil.”
— Donald Knuth, Structured Programming with go to Statements | Source
For years, two pillars were the basis of application monitoring: application logs and system metrics. Developers put log.info() and log.error() statements in the code, and administrators watched the graphs of CPU, memory and disk usage. In the days of simple, monolithic applications running on a few servers, this approach often sufficed. But in today’s world of Java-based enterprise applications - running on powerful application servers like Weblogic or JBoss, in microservices architectures, in containers and in the cloud - this model is completely inadequate.
1 Lack of context and correlation: Logs, server metrics and database metrics are three separate, isolated worlds. When a slowdown problem occurs, we see three separate pictures: logs show that certain operations are taking a long time; CPU monitoring shows that nothing wrong is happening; and database monitoring shows that certain queries are slow. But what was the cause and what was the effect? Did the application slow down because the database is running slow, or did the database slow down because the application is flooding it with inefficient queries? Traditional tools can’t answer this question because they lack the transaction context that connects all these events.
2 The “needle in the haystack” problem: Modern applications generate gigabytes or even terabytes of logs per day. Manually sifting through such a huge amount of data in search of the cause of a problem is extremely difficult and time-consuming. It’s like looking for one particular needle in a giant haystack, often under tremendous time pressure.
3. “Unknown Unknowns” (Unknown Unknowns): Traditional monitoring allows us to track what we know needs to be tracked. We measure the response time of a particular method because we know it is critical. But what if the problem lies in a completely different, unexpected part of the code? What if the cause is a thread lock, a memory leak or inefficient synchronization that generates no errors in the logs? Traditional tools are “blind” to problems that we have not explicitly configured to monitor.
4 Complexity of Java Enterprise environments: The Java Enterprise ecosystem (now Jakarta EE) is extremely powerful, but also complex. Applications run inside application servers (such as Oracle Weblogic Server, JBoss Application Server, Tomcat, IBM Websphere Application Server) that manage thread pools, database connections, transactions and many other aspects. The performance problem may lie not in the application code, but in the poor configuration of the application server itself. Traditional logs often don’t provide any insight into what’s going on “under the hood.”
These limitations make diagnosing performance problems a reactive, tedious and often fruitless process. What is needed is a tool that can look at an application holistically, from start to finish, and automatically connect the dots.
What is Application Performance Management (APM) and what problems does it solve?
Application Performance Management (APM) is a discipline and category of tools that aims to monitor and manage application performance and availability from an end-user perspective. Unlike traditional tools that look at individual, isolated components (server, database), APM looks at the entire application ecosystem as a cohesive whole.
A fundamental innovation that APM systems have introduced is the ability to **automatically track and contextualize every single transaction ** that flows through the system. From the moment a user clicks a button in the browser, to the last query to the database and back again.
What problems does APM solve?
1 It provides end-to-end visibility: APM gives you a single, consistent view of what’s happening in your application. It allows you to see the entire request path, from the front-end to all microservices to calls to external APIs and databases. This eliminates guesswork and allows you to immediately identify which component is the bottleneck.
2. drastically reduces diagnosis time (MTTD/MTTR): Instead of spending hours manually correlating logs, APM can pinpoint the root cause of a problem (root cause) in seconds. It shows not only that the application is slow, but why it is slow - pointing to a specific slow method in the code, a problematic SQL query or a long wait for a response from an external service.
3. enables proactive detection of problems: Modern APM platforms, using machine learning algorithms, learn the “normal” behavior of an application (known as the baseline) and can automatically detect anomalies - subtle deviations from the norm that can be an early sign of an impending failure before it even affects users.
4. links technical performance to business results: APM allows you to correlate technical metrics (response time, error rate) with business metrics (number of transactions, conversion, revenue). This gives business and technical leaders a common language and allows them to make decisions based on real business impact. You can answer the question, “How much money are we losing due to slowing down the payment process?”
5 Supports DevOps and SRE culture: APM is a key tool for DevOps and SRE (Site Reliability Engineering) teams. It provides them with the data they need to define and monitor Service Level Objectives (SLOs) and build a data-driven culture. It gives developers immediate insight into how their code behaves in production, blurring the line between “Dev” and “Ops.”
In short, APM is like going from examining a patient with a stethoscope to using an MRI. It gives deep, detailed and multidimensional insight into the i
er workings of an application, enabling precise and rapid diagnosis.
How does distributed tracing technology work in the Java world?
The heart and magic of any modern APM system is distributed tracing technology. It’s what allows you to build a consistent picture of a single request as it travels through a complex maze of microservices and components. In the Java world, this technology is implemented in an extremely elegant and non-invasive way.
On-the-fly Code Instrumentation: The key is that you don’t have to change anything in the code to track it. APM systems use a mechanism known as **byte code instrumentatio **.
-
Java Agent: A special APM “agent” is run on the application server running the application. It is a simple .jar file that is appended to the startup parameters of the Java virtual machine (JVM) with a single argument (-javaagent).
-
In-memory code modification: When the JVM loads application classes into memory, the agent intercepts this process. Before the class is run, the agent “injects” additional, very lightweight monitoring code in key places (at the beginning and end of methods). This process is done on the fly, in memory, and does not modify the original .class files on disk.
-
Data collection: The injected code measures the execution time of each method, captures parameters, handles exceptions and collects other contextual information.
Context Propagation (Context Propagation): To link calls between different services, the APM agent uses a **context propagatio ** mechanism.
-
Assigning an identifier: When a request first enters the system (e.g., as an HTTP request to the first service), the APM agent assigns it a unique global transaction identifier (Trace ID).
-
Header injection: When Service A is about to call Service B (e.g., via a REST API), Service A’s agent automatically “injects” this Trace ID into the headers of the outgoing HTTP request.
-
Context reading: The agent running in Service B reads the Trace ID from the incoming request. This lets it know that the work it is doing is part of the same overarching transaction.
-
Building a call tree: This process is repeated for each subsequent call, creating a complete dependency tree that shows how the request flowed through the entire system.
Benefits of this approach:
-
No interference with the code: This is the biggest advantage. APM implementation does not require developers to change a single line of application code. It is a completely transparent process.
-
Low overhead: Modern APM agents are extremely optimized and designed so that their impact on the performance of the monitored application is minimal (typically within 1-3%).
-
Comprehensive insight: Instrumentation includes not only application code, but also calls to standard libraries and frameworks (e.g., Spring, Hibernate), JDBC drivers, HTTP clients, giving a complete picture of what the application spends its time on.
With this powerful technology, APM is able to build a coherent, comprehensible history of each individual transaction out of the chaos of thousands of independent operations.
What is the innovative Symptom Driven Diagnostics (SDD) approach built into the Flopsar Suite?
Traditional APM tools, while powerful, often overwhelm the user with the enormity of the data. They present a complete call tree for each transaction, which can consist of thousands of methods. Analyzing such a tree in search of the cause of a problem can still be time-consuming and require a great deal of expertise. One needs to know what one is looking for.
At ARDURA Consulting, based on our years of experience in diagnosing performance problems in the largest Java systems in Poland and around the world, we have developed a unique and innovative approach that we call Symptom Driven Diagnostics (SDD). It is the heart of our proprietary Flopsar Suite tool.
SDD’s philosophy is simple but revolutionary: instead of showing the user everything and telling him to look for it, the system itself, in an automated way, analyzes the signs (symptoms) of the problem and pinpoints its most likely root cause.
How does Symptom Driven Diagnostics work? SDD is a multi-stage analytical algorithm that works on data collected by an APM agent.
-
Identifying the symptom: The process begins with identifying the main symptom of the problem. The most common symptom is long transaction response times.
-
Automatic thread analysis: The system automatically analyzes what the application server threads were doing during the execution of this slow transaction. It classifies their state every millisecond: whether they were executing code on the CPU, whether they were waiting for an input/output (I/O) operation from the database, whether they were blocked, waiting for another thread, or whether they were sleeping.
-
Detection of “anti-performance patterns.” Based on this analysis, the SDD algorithm automatically searches for known, common “anti-patterns” that are common causes of performance problems in Java applications:
-
Long database queries: Identifies the specific SQL query that took the longest.
-
Locks and Resource Contention (Lock Contention): Detects situations where multiple threads are trying to access the same synchronized resource, and indicates the exact object and line of code where the locking occurs.
-
Inefficient “Waits” (Waits/Sleeps): Finds places in code where a thread is u
ecessarily “sleeping” or waiting.
-
CPU Intensive Usage: Indicates the specific methods that consume the most CPU time.
-
Aggregation and cause indication: the system aggregates this information from all threads involved in the transaction and presents the user with a simple, unambiguous conclusion, such as: “85% of this transaction’s response time (8.5 seconds out of 10) was spent waiting for a response from the database to the query ‘SELECT * FROM …’”. Along with this conclusion, the user is provided with a full call stack (stack trace) that leads him to the exact line of code from which this query was executed.
With SDD, the diagnosis process ceases to be an art reserved for experts. It becomes an automated, repeatable and extremely fast scientific process. It is this technology that allows us to deliver on our promise: diagnosing a problem in less than 30 seconds.
What does it look like in practice to diagnose a performance problem in 30 seconds using Flopsar Suite?
Let’s re-imagine the scenario from the beginning of the article. It’s 2:05 in the morning. SRE engineer Kamila receives an alert about a slowdown in the trading system. But this time, instead of opening ten different terminals and tools, she logs into one system - the Flopsar Suite dashboard.
Secs. 0-5: Identify the problem. On the main screen, Kamila immediately sees that the response time graph for the key service “ProcessPayment” has skyrocketed. With one click, she goes to the list of the slowest transactions recorded in the last few minutes.
Secs. 5-15: Deal selection and SDD analysis. It selects from a list one of the transactions that lasted 12 seconds. The system immediately, in the background, runs the Symptom Driven Diagnostics algorithm on the data collected for that transaction. Instead of presenting her with a giant tree of thousands of calls, Flopsar Suite immediately shows her a synthetic summary.
Secs. 15-25: Reading the root cause. A clear and unambiguous message appears on the screen, generated by SDD: “The analysis showed that 92% of the transaction execution time (11 seconds out of 12) was spent in the BLOCK (LOCK) state. The threads were waiting for the monitor to be released on the class object com.mycompany.ecommerce.PaymentGatewayLock.” Below the message, the system displays the exact call stack (stack trace) of the thread that held the lock, and the call stacks of the threads that waited for it.
Seconds 25-30: Understand and take action. Kamila immediately understands what happened. The problem is not in the database or the infrastructure. The problem lies in inefficient synchronization in the Java code, in the PaymentGatewayLock class. She takes a screenshot, creates a critical ticket in Jira, assigns it to the appropriate development team, attaching all the detailed data from Flopsar Suite.
The entire process - from alert, to identification, to pinpointing the root cause at the code line level - took her less than half a minute. At 2:10 a.m., the developer on call already has all the information needed to prepare the fix. Instead of hours of chaos and waste, there is a calm, precise and instantaneous repair process.
This is not a science-fiction scenario. This is the daily reality of hundreds of teams that use Flopsar Suite to monitor their critical Java applications. This is the power of Symptom Driven Diagnostics in practice.
How do you integrate the APM tool into your DevOps culture and CI/CD process to proactively prevent problems?
Implementing a powerful APM tool and using it only for reactive firefighting in production is realizing only a fraction of its potential. The true value of APM is unleashed when it becomes an integral part of the DevOps culture and is embedded throughout the software development lifecycle, helping teams proactively prevent problems, not just fix them.
“Shift-Left” Performance Insights: APM is traditionally associated with the “right side” of the life cycle (production). However, its data and capabilities can and should be “shifted to the left.”
-
Performance testing in CI/CD: An APM tool, such as Flopsar Suite, should be installed on the environment for performance testing. Pipeline CI/CD, after each deployment to that environment, can automatically run a set of load tests. APM data allows for automatic analysis of the results. “Quality gates” (quality gates) can be set to automatically block a deployment if the response time of a key transaction has deteriorated by more than 10% compared to the previous version, or if the new version generates an excessive number of SQL queries.
-
Developer access: Every developer should have access to APM data from development and test environments. This allows him to analyze the performance of his code long before it goes into production. He can check for himself how many database queries his new function generates, whether it creates u
ecessary objects that overload the Garbage Collector, etc.
APM as a common language for Dev and Ops: APM is a great tool for breaking down the wall between developers and operations.
-
One source of truth: When a problem arises, everyone looks at the same data and the same dashboards. The blame-shifting (“it’s not our code, it’s your servers!”) ends.
-
Shared Objectives (SLOs): APM provides objective data (SLI - Service Level Indicators) that are the basis for defining common Service Level Objectives (SLOs) for Dev and Ops, e.g., “99% of requests to the login API must be serviced in less than 200 ms.”
Proactive optimization and planning: APM data is a mine of knowledge that should be used not only for firefighting, but also for strategic planning.
-
Identification of “hot spots”: Regular analysis of APM data identifies the most stressed or slowest parts of the application, which are the best candidates for refactoring and optimization.
-
**Capacity Pla
ing:** Analyzing historical trends (e.g., growth in the number of transactions and resource consumption) allows for more accurate forecasting of future infrastructure needs.
Integrating APM with DevOps culture transforms it from a tool for specialists into a daily, democratic tool for the entire team. It becomes a compass that helps you not only get back on course quickly when you deviate from it, but more importantly, stay on the right course at all times.
In addition to solving problems faster, what are the business benefits of implementing APM?
While the dramatic reduction in incident resolution time (MTTR) is the most spectacular benefit of APM implementation, its impact on the business is much broader and more strategic. A mature APM practice translates directly into key business metrics and becomes a source of competitive advantage.
1. increase revenue and conversions: In the digital world, performance is a function. Numerous studies (conducted by Google, Amazon and Walmart, among others) clearly show that there is a direct correlation between page load time and conversion rate. Every 100 milliseconds of delay can result in a decrease in conversions by several percent. By providing high and stable performance, APM directly contributes to maximizing revenue.
2 Improve customer satisfaction and loyalty (CX): Nothing frustrates a user more than a slow or unstable application. This leads to churn (abandonment) and loss of trust in the brand. By helping to provide an excellent digital experience (Digital Experience), APM is a key investment in building long-term customer loyalty.
3 Increased productivity for IT teams: Every hour that developers and engineers spend reactively putting out fires and painstakingly searching for the cause of a problem is an hour they are not spending creating new, innovative features that bring revenue to the company. By automating and shortening the diagnosis process, APM frees up a company’s most valuable resource - the time and creativity of its best engineers.
4 Reduce operating costs:
-
Infrastructure optimization: APM accurately identifies “oversized” resources and inefficient code, leading to better infrastructure utilization and, in the case of the cloud, direct savings from FinOps practices.
-
Reduced support costs: Fewer errors and performance problems in production mean fewer support calls.
5 Better business and technology decision-making: APM provides hard data to make informed decisions. Instead of relying on intuition, questions can be answered: “Did the investment in refactoring module X actually improve the customer experience?”, “How did new feature Y affect the load on our database?”.
APM implementation is not an IT cost. It is an investment in product quality, customer satisfaction and the efficiency of the entire organization. It is one of the investments with the highest and fastest achievable return (ROI) in the entire technology stack.
The evolution of Java application monitoring: from logs to predictive analytics
The table below shows the evolution of approaches to monitoring and diagnosing Java applications, demonstrating how modern, AI-based solutions are changing the game, reducing diagnosis time from days to seconds.
| Maturity stage | Methodology | Key tools | Main questio | Mean time to diagnosis (MTTD) |
| **Stage 1: Reactive (Primary Monitoring).** | Manual correlation. "Hunting" for errors in logs after an incident has occurred. | Grep, SSH, basic scripts. Separate dashboards for CPU/memory. | "Do you see any errors in the logs?" | Hours - Days |
| **Stage 2: Proactive (Traditional APM).** | Transaction tracking (tracing). Analysis of call trees. Setting alerts on thresholds. | Traditional APM tools (e.g., Dynatrace, AppDynamics in older versions). | "Which component in the call chain is the slowest?" | Minutes - Hours |
| **Stage 3: Predictive (APM with AIOps and SDD).** | Automated root cause analysis. Anomaly detection. Prediction of problems. | Modern APM platforms with embedded AI, such as **Flopsar Suite**. | "What is the exact root cause of the problem at the code level?" | Seconds - Minutes |
Need testing support? Check our Quality Assurance services.
See also
- 10 technology trends for 2025 that every CTO needs to know
- 4 key levels of software testing - An expert
- 5G and 6G - How will ultrafast networks change business applications?
Let’s discuss your project
Have questions or need support? Contact us – our experts are happy to help.
How does the ARDURA Consulting team support organizations in implementing and maximizing value from APM?
At ARDURA Consulting, we understand that application performance and reliability are the lifeblood of modern business. Our many years of global experience in developing and maintaining critical systems for major companies has allowed us not only to become experts in Application Performance Management, but also to create our own unique solution - Flopsar Suite.
Our support for customers in the APM area is comprehensive and multidimensional:
1 Implementing and Configuring Flopsar Suite: As the developers of Flopsar Suite, we have unparalleled knowledge of its capabilities. Our team of experts helps you deploy the tool quickly and seamlessly in your environment, whether you use Oracle Weblogic, JBoss, Tomcat, or IBM Websphere. We ensure optimal configuration and integration with your processes.
2 Performance Troubleshooting as a Service.” We know that sometimes you need immediate expert help. Our team of “performance firefighters” is ready to support your team at the critical moment to diagnose and resolve the most difficult performance problems. Using the power of Flopsar Suite and our experience, we are able to identify the causes of problems that remain invisible to others.
3 Strategic consulting on observability and SRE: Implementing the tool is just the beginning. As a Trusted Advisor, we help you build a mature performance management culture and practice. We advise you on how to implement SRE processes, how to define meaningful service level objectives (SLOs), and how to integrate APM into your DevOps culture to transform observability from a reactive to a proactive process.
4 Training and competence building: We share our knowledge. We organize dedicated trainings and workshops for your development and operations teams, teaching them not only how to use the tools, but more importantly, how to think about performance and diagnose problems in a modern, efficient way.
At ARDURA Consulting, we don’t just sell software. We deliver peace of mind. Our goal is to give you peace of mind that your critical Java applications are running at peak performance, and when problems do arise, your team has the tools and support to resolve them in seconds, not hours.
If you’re tired of late-night phone calls and multi-hour “war rooms,” consult your project with us. Let us show you the future of performance management.