New Relic Runs on Insights: Customer Usage

From New Relic By Posted in New Relic News, Using Our Products 18 February 2015

This New Relic Insights use case is taken from New Relic Runs on Insights, a new whitepaper collecting Insights use cases from across New Relic, highlighting some of the many ways the product can help businesses exploit the countless possibilities for leveraging their data in real time, without needing developer resources, and allowing them to make faster, more informed decisions. Download the New Relic Runs on Insights whitepaper now!

It’s not often that a product manager gets the opportunity to use the product they manage to make a positive impact on the role and the company. But that dream scenario came true for Jim Kutz, technical product manager for New Relic Insights, who was able to leverage it to be more effective at his job.

Understanding how customers use products and their individual components and features is the heart product management. But getting detailed and accurate information on actual customer usage is a perpetual challenge for many companies.

Establishing the traditional customer feedback loop, for example, can take months of effort after a new product version is released, and even then may not provide enough detail.

To speed things up, Kutz decided to take advantage of his own product to get the immediate, detailed information he needed to guide future product development strategies and ultimately improve the customer experience. “Instead of running a focus group, conducting surveys, or interviewing a self-limiting number of customers, I realized that I could see, in real time, how our customers are using New Relic Insights,” says Kutz. “All I needed to do was write a few queries.”

pg15

Quick and easy insights into New Relic Insights usage

Kutz quickly built a New Relic Insights dashboard to display the usage data he and the rest of the Insights product team needed to drive better decisions. On a daily basis, he says, “I can see how customers are engaging with the product… I’m not only counting the amount of usage, but I can now understand the context of how our customers are using our product.”

Kutz got to experience firsthand just how easy it is to use New Relic Insights: “It literally took me minutes to build a basic dashboard,” he recalls.

Kutz and other members of the product management and development team use the information gleaned from New Relic Insights to drive development efforts and continuously improve the customer experience for Insights users. For example, Kutz and team have used their newfound insights to:

  • Target ideal candidates for a usability study for a new product feature
  • Discover common use cases across customers
  • Identify which new attributes to include out-of-the-box

Dramatically accelerating the feedback loop

“The entire team feels more engaged with customers because they can see that people are using the product. “It’s a great motivational tool,” Kutz says, “because it dramatically shortens the customer feedback loop.”

Insights is an excellent tool for continually monitoring site performance and customer experience. Product managers for many New Relic products now use Insights to glean insight about customer usage. The product manager for New Relic APM, for example, uses Insights to identify pages in the application that provide less satisfactory experiences so the engineering team can focus on improving them first.

And that’s only the beginning. “There are so many different directions we can go with this product,” Kutz says. “Understanding how people are using features and validating usage helps us determine the best roadmap.”

To learn more, download the New Relic Runs on Insights whitepaper now!

Advertisements

A Quick Tour of Transaction Management

From Crittercism  on December 3, 2014

When your app crashes or is slow, it impacts both your company’s brand and its revenue.  It impacts your brand because users can leave bad reviews/ratings, and impacts your revenue because important workflows like a checkout, sign up, or check deposit could be failing.  Crittercism Transaction Management shows how crashes and lags directly affect your bottom line. This post provides insight into what Transaction Management can offer and how to get started.

 Mobile Transactions are App Workflows That Impact Business

Let’s start with an overview of what mobile transactions are and how they are modeled.  A mobile transaction is any mobile workflow or use case in an app that has business impact.  Examples of mobile transactions include:

  • sign-up, registration, or account creation

  • importing user contacts or uploading / sharing content

  • checkout, billing, or mPOS

  • inventory or order management

  • buying / selling stock, depositing a check, or transferring funds

  • booking a hotel room or flight

These workflows usually involve multiple screens and multiple back-end or cloud API interactions.  For the workflow or transaction to be successful, each screen or back-end interaction in the process must also work.

 Monitoring a mobile transaction is easy with Crittercism.  A developer simply marks the start, the end, and the value associated with the transaction.  The value is the monetary value you would consider at risk if the transaction fails.  Typically it is something like the shopping cart value, LTV of a new customer, or a worker’s productivity cost in the case of B2E use cases.

Now that you have an idea of the type of transactions you can monitor, let’s take a quick tour of Transaction Management dashboards that allow you to connect performance metrics to business metrics.  The dashboards are designed to help you monitor the overall revenue risk, prioritize what to fix first, and troubleshoot the issue quickly.

Monitor Overall Revenue Risk with the Transaction Summary

txn-summary-top

The most basic question to answer is how much revenue is at risk at any given time.  With the Transaction Summary dashboard, you can see the aggregate revenue at risk ($715.7K over the last 7 days for this example).  Revenue risk is the aggregated value for the failed transactions.  You can also see the top failed transactions over time, which gives you valuable insight into how app performance issues like crashes or timeouts are hurting your business.

 Prioritize What To Fix First with the Transaction List

txn-summary-bottom

 Once you know there is an issue affecting your business, you need to quickly identify which transaction to fix.  In the transaction list, you can see each mobile transaction being monitored.  The summary for each transaction includes volume, failures, and revenue.   Sorting by revenue risk shows the transaction with the largest revenue risk first (Checkout in this case).

 Drill into Failures with Transaction Details

txn-detail-top

The Transaction Details dashboard gives you information to troubleshoot a specific transaction.  The dashboard allows you to easily see the Success metrics and average time.  The Revenue metrics break down the average risk per transaction as compared to the total risk.

 It’s also important to quickly understand the type of failure causing most issues for the transaction–crashes, time outs, or something marked as failed by the developer.  Demonstrated in this screenshot, the Failure Rate is being driven by time outs.

 Troubleshoot Issues with the Root Cause Analysis Tab

txn-details-rca

If crashes are the primary culprit, then the place to look is the Root Cause Analysis tab. It lists the highest occurring crash groups.  Most often you would start with the highest and work your way down.  Clicking on a crash group takes you to Crash dashboards , which gives you all the detailed diagnostics information you need to troubleshoot the issue quickly.

 View Exact Events Leading Up to a Failure With the Transaction Trace Tab

txn-details-trace

When you are trying to figure out why a particular transaction failed, it is very helpful to see the exact steps a user took that led to the problem;  The Transaction Trace tab does just that.  It shows the history of network, service, view, background/foreground, and crash events for a user’s transaction.  In this example, we are able to easily see the specific events leading to a crash.  After clicking Checkout, they switch from EDGE to Wi-Fi connectivity, lose the connection, and then call a checkout service,  at which point the app crashes.  Ideally the app would handle this gracefully and ensure the user can complete the checkout when they reconnect.

Using these Mobile Transaction dashboards you can monitor the overall revenue risk, prioritize what to fix first, and troubleshoot the issue quickly.

What’s Next

Hopefully this tour gives you a good overview of Crittercism Mobile Transaction Management.  The capabilities provide your mobile team the data needed to understand exactly how mobile performance issues are impacting your bottom line.  I encourage you to kick the tires on the solution by signing up for a free trial or contacting us for a demo today!

txn-details-trace


How to accurately identify impact of system issues on end-user response time

From Compuware APM Blog as of 4 June 2013

Triggered by current expected load projections for our community portal, our Apps Team was tasked to run a stress on our production system to verify whether we can handle 10 times the load we currently experience on our existing infrastructure. In order to have the least impact in the event the site crumbled under the load, we decided to run the first test on a Sunday afternoon. Before we ran the test we gave our Operations Team a heads-up: they could expect significant load during a two hour window with the potential to affect other applications that also run on the same environment.

During the test, with both the Ops and Application Teams watching the live performance data, we all saw end user response time go through the roof and the underlying infrastructure running out of resources when we hit a certain load level. What was very interesting in this exercise is that both the Application and Ops teams looked at the same data but examined the results from a different angle. However, they both relied on the recently announced Compuware PureStack Technology, the first solution that – in combination with dynaTrace PurePath – exposes how IT infrastructure impacts the performance of critical business applications in heavy production environments.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

The root cause of the poor performance in our scenario was CPU exhaustion – on a main server machine hosting both the Web and App Server – caused us not to meet our load goal. This turned out to be both an IT provisioning and an application problem. Let me explain the steps these teams took and how they came up with their list of action items in order to improve the current system performance in order to do better in the second scheduled test.

Step 1: Monitor and Identify Infrastructure Health Issues

Operations Teams like having the ability to look at their list of servers and quickly see that all critical indicators (CPU, Memory, Network, Disk, etc) are green. But when they looked at the server landscape when our load test reached its peak, their dashboard showed them that two of their machines were having problems:

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

Step 2: What is the actual impact on the hosted applications?

Clicking on the Impacted Applications Tab shows us the applications that run on the affected machine and which ones are currently impacted:

The increased load not only impacts the Community Portal but also our Support Portal

The increased load not only impacts the Community Portal but also our Support Portal

Already the load test has taught us something: As we expect higher load on the community in the future, we might need to move the support portal to a different machine to avoid any impact.

When examined independently, operations-oriented monitoring would not be that telling. But when it is placed in a context that relates it to data (end user response time, user experience, …) important to the Applications team, both teams gain more insight.  This is a good start, but there is still more to learn.

Step 3: What is the actual impact on the critical transactions?

Clicking on the Community Portal application link shows us the transactions and pages that are actually impacted by the infrastructure issue, but there still are two critical unanswered questions:

  • Are these the transactions that are critical to our successful operation?
  • How badly are these transactions and individual users impacted by the performance issues?

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

Step 4: Visualizing the impact of the infrastructure issue on the transaction

The transaction-flow diagram is a great way to get both the Ops and App Teams on the same page and view data in its full context, showing the application tiers involved, the physical and virtual machines they are running on, and where the hotspots are.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

We knew that our pages are very heavy on content (Images, JavaScript and CSS), with up to 80% of the transaction time spent in the browser. Seeing that this performance hotspot is now down to 50% in relation to the overall page load time we immediately know that more of the transaction time has shifted to the new hotspot: the server side. The good news is that there is no problem with the database (only shows 1% response time contribution) as this entire performance hotspot shift seems to be related to the Web and App Servers, both of which run on the same machine – the one that has these CPU Health Issues.

Step 5: Pinpointing host health issue on the problematic machine

Drilling to the Host Health Dashboard shows what is wrong on that particular server:

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

Step 6: Process Health dashboards show slow app server response

We see that the two main processes on that machine are IIS (Web Server) and Tomcat (Application Server). A closer look shows how they are doing over time:

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

Step 7: Pinpointing heavy CPU usage

Our Apps Team is now interested in figuring out what consumes all this CPU and whether this is something we can fix in the application code or whether we need more CPU power:

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Ops and Apps teams now easily prioritize both Infrastructure and app fixes

So as mentioned, ‘context is everything’.  But it’s not simply enough to have data – context relies on the ability to intelligently correlate all of the data into a coherent story.  When the “horizontal” transactional data for end-user response-time analysis is connected to the “vertical” infrastructure stack information, it becomes easy to get both teams to read from the same page and prioritize fixes that have the greatest negative impact on the business.

This exercise allowed us to identify several action items:

  • Deploy our critical applications on different machines when the applications impact each other negatively
  • Optimize the way our pages are built to reduce CPU usage
  • Increase CPU power on these virtualized machines to handle more load

AppDynamics releases powerful database monitoring solution, extends visibility beyond the application layer

AppDynamics logo


 

1st APM vendor to bridge application and database performance with a single view.

AppDynamics, the next-generation Application Performance Management solution that simplifies the management of complex apps, has announced the release of AppDynamics for Databases to help enterprises troubleshoot and tune database performance problems.  This new AppDynamics solution isavailable immediately and offers unmatched insight and visibility into how SQL and stored procedures execute within databases such as Oracle, SQL Server, DB2, Sybase, MySQL and PosgreSQL.

AppDynamics for Databases addresses the challenges that application support teams such as Developers and Operations face in trying to identify the cause of application performance issues that relate to database performance.  As many as 50% of application problems are the result of slow SQL calls and stored procedures invoked by applications—yet until now, databases have been a “black box” for application support teams.

“Giving our customers critical visibility and troubleshooting capability into the cause of database problems makes AppDynamics absolutely unique in the APM space,” said Jyoti Bansal, founder and CEO of AppDynamics. “Application support teams constantly wrestle with database performance problems in attempting to ensure uptime and availability of their mission-critical applications, but they usually lack the visibility they need to resolve problems. We’ve equipped them with a valuable new solution for ensuring application performance, and it will enable them to collaborate with their Database Administrator colleagues even more closely than before.”

With its new database monitoring solution, AppDynamics has applied its “secret sauce” from troubleshooting Java and .NET application servers to databases, allowing enterprises to pinpoint slow user transactions and identify the root cause of SQL and stored procedure queries. AppDynamics for Databases also offers universal database diagnostics covering Oracle, SQL Server, DB2, Sybase, PostgreSQL, and MySQL database platforms.

AppDynamics Pro for Databases includes the following features:

  • Production Ready: Less than 1% overhead in most production environments.
  • Application to Database drill-down: Ability to troubleshoot business transaction latency from the application right into the database and storage tiers.
  • SQL explain/execution plans: Allows developers and database administrators to pinpoint inefficient operations and logic, as well as diagnose why queries are running slowly.
  • Historical analysis: Monitors and records database activity 24/7 to allow users to analyze performance slowdowns in the database tier.
  • Top database wait states: Provides insights and visibility into database wait and CPU states to help users understand database resource contention and usage.
  • Storage visibility for NetApp: Provides the ability to correlate database performance with performance on NetApp storage.

“It is great to have a tightly integrated way to monitor, troubleshoot and optimize the performance of our key applications and the databases that support them,” said Nadine Thomson, Group IT Operations Manager at STA Travel. “We’re enthusiastic about the ability to use deep database, Java, and .NET performance information all from within a single AppDynamics product.”

AppDynamics for Databases is available now; get a free trial here.


Quick video intro from Compuware APM: Unified network and application performance solution for today’s data centers

A new video introduction from the Compuware APM team published courtesy of You Tube titled ‘Unified Network & Application Performance Solution for Today’s Data Centers.’


Application performance monitoring: Why alerts suck and monitoring solutions need to become smarter

From APM Thought Leadership at App Dynamics

I have yet to meet anyone in Dev or Ops who likes alerts. I’ve also yet to meet anyone who was fast enough to acknowledge an alert, so they could prevent an application from slowing down or crashing. In the real world alerts just don’t work, nobody has the time or patience anymore, alerts are truly evil and no-one trusts them. The most efficient alert today is an angry end user phone call, because Dev and Ops physically hear and feel the pain of someone suffering.

Why? There is little or no intelligence in how a monitoring solution determines what is normal or abnormal for application performance. Today, monitoring solutions are only as good as the users that configure them, which is bad news because humans make mistakes, configuration takes time, and time is something many of us have little of.

Its therefore no surprise to learn that behavioral learning and analytics are becoming key requirements for modern application performance monitoring (APM) solutions. In fact, Will Capelli from Gartner recently published a report on IT Operational Analytics and pattern based strategies in the data center. The report covered the role of Complex Event Processing (CEP), behavior learning engines (BLEs) and analytics as a means for monitoring solutions to deliver better intelligence and quality information to Dev and Ops. Rather than just collect, store and report data, monitoring solutions must now learn and make sense of the data they collect, thus enabling them to become smarter and deliver better intelligence back to their users.

Change is constant for applications and infrastructure thanks to agile cycles, therefore monitoring solutions must also change so they can adapt and stay relevant. For example, if the performance of a business transaction in an application is 2.5 secs one week, and that drops to 200ms the week after because of a development fix. 200ms should become the new performance baseline for that same transaction, otherwise the monitoring solution won’t learn or alert of any performance regression. If the end user experience of a business transaction goes from 2.5 secs to 200ms, then end user expectations change instantly, and users become used to an instant response. Monitoring solutions have to keep up with user expectations, otherwise IT will become blind to the one thing that impacts customer loyalty and experience the most.

So what does behavioral learning and analytics actually do, and how does it help someone in IT? Let’s look at some key Dev and Ops use cases that benefit from such technology.

#1 Problem Identification – Do I have a problem?

Alerts are only as good as the thresholds which trigger them. A key benefit of behavioral learning technology is the ability to automate the process of discovering and applying relevant performance thresholds to an application, its business transactions and infrastructure, all without human intervention. It does this by automatically learning the normal response time of an application, its business transactions and infrastructure, at different hours of the day, week and month, ensuring these references create an accurate and dynamic baseline of what normal application performance is over-time.

A performance baseline which is dynamic over-time is significantly more accurate than a baseline which is static. For example, having a static baseline threshold which assumes application performance is OK if all response times are less than 2 seconds is naive and simplistic. All user requests and business transactions are unique, they have distinct flows across the application infrastructure, which vary, depending on what data is requested, processed and packaged up as a response.

Take for example, a credit card payment business transaction – would these requests normally take less than 2 seconds for a typical web store application? not really, they can vary between 2 and 10 seconds. Why? There is often a delay whilst an application calls a remote 3rd party service to validate credit card details before it can be authorized and confirmed. In comparison, a product search business transaction is relatively simple and localized to an application, meaning it often returns sub-second response times 24/7 (e.g. like Google). Applying a 2 second static threshold to multiple business transactions like “credit card payment” and “search” will trigger alert storming (false and redundant alerts). To avoid this without behavioral learning, users must manually define individual performance thresholds for every business transaction in an application. This is bad, because as I said earlier, nobody in IT has the time to do this, so most users resort to applying thresholds which are static and global across an application. Don’t believe me? ask your Ops people whether they get enough alerts today, chances are they’ll smile or snarl.

The screenshot below shows the average response time of a production application over-time, with spikes representing peak load during weekend evening hours. You can see on weekdays normal performance is around 100ms, yet under peak load its normal to experience application performance of up to several seconds. Applying a static threshold in this scenario, of 1 or 2 seconds would basically cause alert storming at the weekend even though its normal to see such performance spikes. This application could therefore benefit from behavioral learning technology so the correct performance baseline is applied for the correct hour and day.

Another key limitation with alerts and traditional monitoring solutions is that they lack business context. They’re typically tied to infrastructure health rather the health of the business, making it impossible for anyone to understand the business impact of an alert or problem. It can be the difference between “Server CPU Utilization is above 90%” and “22% of Credit Card Payments are stalling”. You can probably guess the latter alert is more important to troubleshoot than pulling up a terminal console, logging onto a server and typing prstat to view processes and CPU usage. Behavioral learning combined with business context allows a monitoring solution to alert on the performance and activity of the business, rather than say, the performance and activity of its infrastructure. This ensures Dev and Ops have the correct context to understand and be aligned with the business services.

Analytics can also play a critical role in how monitoring data is presented to the user to help them troubleshoot. If a business transaction is slow or has breached its threshold, the user needs to understand the severity of the problem. For example, were a few or lot of user transactions impacted? how many returned errors or actually stalled and timed out? Everything is relative, Dev or Ops doesn’t have the time to investigate every user transaction breach, its therefore important to prioritize with business impact before jumping in to troubleshoot.

If we look at the below screenshot of AppDynamics Pro, you can see how behavioral learning and analytics can help a user identify a problem in production. We can see the checkout business transaction has breached its performance baseline (which was learnt automatically), we can also see the severity of the breach which shows no errors, 10 slow requests, 13 very slow and no stalls. 23 out of the 74 user requests (calls) were impacted meaning this is a critical problem for Dev and Ops to troubleshoot.

#2 Problem Isolation – Where is my problem?

Once a user has identified abnormal application performance, the next step for them is to isolate where that latency is spent in the application infrastructure. A key problem today is that most monitoring solutions collect and report data, but they don’t process or visualize it in a way that automates problem isolation for a user. Data exists, but its down to the individual users to drill down and piece together data, so they can find what they’re looking for. This is made difficult by the fact that performance data can be fragmented across multiple silo’s and monitoring toolsets, making it impossible for Dev or Ops to get a consistent end to end view of application performance and business activity. To solve this data fragmentation problem, many monitoring solutions use time-based correlation or Complex Event Processing (CEP) engines to piece together data/events from the multiple sources, so they can look for patterns or key trends which may help a user isolate where a problem or latency exists in an application.

For example, if a user credit card payment business transaction took 9 seconds to execute, where was that 9 seconds spent in the application infrastructure exactly? If you look at performance data from an OS, app server, database or network perspective you’ll end up with four different views of performance, none of which relate to that individual credit card payment business transaction which took 9 seconds. Using time-based correlation won’t’ help either, knowing the database was running at 90% cpu whilst the credit card payment transaction executed is about as helpful as a poke in the eye. Time-based correlation is effectively a guess, given the complexity and distribution of applications today, the last thing you want to be doing is guessing where a problem might be in your application infrastructure. Infrastructure metrics tell you how an application is consuming system resource, they don’t have the granularity to tell you where an individual user business transaction is slow in the infrastructure.

Behavioral learning can be used together to learn and track how business transactions flow across distributed application infrastructure. If a monitoring solution is able to learn the journey of a business transaction, then they can monitor the real flow execution of them across and inside distributed application infrastructure. By visualizing the entire journey and latency of a business transaction, at each hop in the infrastructure, monitoring solutions can make it simple for Dev and Ops to isolate problems in seconds. If you want to travel from San Francisco to LA by car, the easiest way to understand that journey, is to visualize it on Google Maps in seconds. In comparison, the easiest way for Dev or Ops to isolate a slow user business transaction, is to do the same thing and visualize its journey across the application infrastructure. For example, take the below screenshot which shows the distributed transaction flow of a “Checkout” business transaction which took 10 seconds across its application infrastructure. You can see that 99.8% of its response time is spent making a JDBC call to the Oracle database. Isolating problems this way is much faster and efficient than tailing log files or asking sys, network or DBA administrators whether their silos are performing correctly.

You can also apply dynamic base-lining and analytics to the performance and flow execution of a business transaction. This means a monitoring solution can effectively highlight to the user which application infrastructure tier is responsible for a performance breach and baseline deviation. Take for example, the below screenshot which visualizes the flow of a business transaction in a production environment, and highlights the breach for the application tier “Security Server” which has deviated from its normal performance baseline of 959ms.

Behavioral learning and analytics can therefore be a key enabler to automating problem isolation in large, complex, distributed applications.

#3 Problem Resolution – How do I fix my problem?

Once Dev or Ops has isolated where the problem is in the application infrastructure, the next step is to then identify the root cause. Many monitoring solutions today can collect diagnostic data which relate to the activity of components within an application tier such as a JVM, CLR or database. For example, a java profiler might show you thread activity, a database tool might show you top N SQL Statements. What these tools lack is the ability to tie diagnostic data to the execution of real user business transactions which are slow or breaching associated performance thresholds. When Ops picks up the phone to an angry user, users don’t complain about CPU utilization, thread synchronization or garbage collection. Users complain about specific business transactions they are trying to complete like login, search or purchase.

As I outlined above in the Problem Isolation section, monitoring solutions can leverage behavioral learning technology to monitor the flow execution of business transactions across distributed application infrastructure. This capability can also be extended inside an application tier, so monitoring solutions can learn, and monitor, the relevant code execution of a slow or breaching business transaction.

For example, here is a screenshot which shows the complete code execution (diagnostic data) of a distributed Checkout business transaction which took 10 seconds. We can see in the top dialogue the code execution from the initial struts action all the way through to the remote Web Service call which took 10 seconds. From this point we can drill inside the offending web service to its related application tier and see its code execution, before finally pinpointing the root cause of the problem which is a slow SQL statement as shown.

Without behavioral learning and analytics,  monitoring solutions lack intelligence on what diagnostic data to collect. Some solutions try to collect everything, whilst others limit what data they collect so that their agent overhead doesn’t become intrusive in production environments. The one thing you need when trying to identify root cause is complete visibility, otherwise you begin to make assumptions or guess what might be causing things to run slow. If you only have 10% visibility into the application code in production, then you’ve only got a 10% probability of finding the actual root cause of an issue. This is why users of most legacy application monitoring solutions struggle to find root cause – because they have to balance application code visibility with monitoring agent overhead.

Monitoring today isn’t about collecting everything, its about collecting what is relevant to business impact, so any business impact can be resolved as quickly as possible. You can have all the diagnostic data in the world, but if that data isn’t provided in the right context for the right problem to the right user, it becomes as about as useful as a chocolate teapot.

With applications becoming every increasingly complex, agile, virtual and distributed. Dev and Ops no longer have the time to monitor and analyze everything. Behavioral learning and analytics must help Dev and Ops monitor whats relevant in an application, so they can focus on managing real business impact instead of infrastructure noise. Monitoring solutions must become smarter so Dev and Ops can automate problem identification, isolation and resolution. The more monitoring solutions rely on human intervention to configure and analyze, the more monitoring solutions will continue fail.

If you want to experience how behavioral learning and analytics can automate the way you manage application performance, take a trial of AppDynamics Pro and see for yourself.


New blog launched by OPNET: Application Performance Matters.

A new blog has been launched by OPNET called “Application Performance Matters”.  OPNET aims to develop a  new forum for discussion on APM concepts, techniques, challenges, and directions.
APM and more generally, IT Service Assurance, is an area of utmost importance to business because virtually every enterprise today is driven by processes and information. Software applications, which started out as a means to enhance productivity and enforce policies, have now evolved into the very embodiment of these processes and the reference model of an organization’s approach to conducting business.
Today, the question of whether a change to organisational practices can be implemented is virtually inseparable from the question “can we do that in our systems?”
Given the fundamental role of applications, and the increasing complexity and sophistication of application architectures, managing performance has become a hotbed of activity. There is much information to share in this area. The technologies of APM continue to evolve rapidly, as does the entire IT environment.
Many enterprises at varying stages of adopting APM wish to learn about approaches that would enable them to reap the most benefit. The blog aims to cover a full spectrum of topics, ranging from detailed technical problem solving all the way to organisational best practices.
Their first post is intended to define an initial set of terms to serve as a basis for future discussion. Download the “APM: An Evolving Lexicon” whitepaper. We hope you enjoy participating in “Application Performance Matters” and find it useful to your initiatives and daily activities in APM.