New Relic Runs on Insights: Customer Usage

From New Relic By Posted in New Relic News, Using Our Products 18 February 2015

This New Relic Insights use case is taken from New Relic Runs on Insights, a new whitepaper collecting Insights use cases from across New Relic, highlighting some of the many ways the product can help businesses exploit the countless possibilities for leveraging their data in real time, without needing developer resources, and allowing them to make faster, more informed decisions. Download the New Relic Runs on Insights whitepaper now!

It’s not often that a product manager gets the opportunity to use the product they manage to make a positive impact on the role and the company. But that dream scenario came true for Jim Kutz, technical product manager for New Relic Insights, who was able to leverage it to be more effective at his job.

Understanding how customers use products and their individual components and features is the heart product management. But getting detailed and accurate information on actual customer usage is a perpetual challenge for many companies.

Establishing the traditional customer feedback loop, for example, can take months of effort after a new product version is released, and even then may not provide enough detail.

To speed things up, Kutz decided to take advantage of his own product to get the immediate, detailed information he needed to guide future product development strategies and ultimately improve the customer experience. “Instead of running a focus group, conducting surveys, or interviewing a self-limiting number of customers, I realized that I could see, in real time, how our customers are using New Relic Insights,” says Kutz. “All I needed to do was write a few queries.”

pg15

Quick and easy insights into New Relic Insights usage

Kutz quickly built a New Relic Insights dashboard to display the usage data he and the rest of the Insights product team needed to drive better decisions. On a daily basis, he says, “I can see how customers are engaging with the product… I’m not only counting the amount of usage, but I can now understand the context of how our customers are using our product.”

Kutz got to experience firsthand just how easy it is to use New Relic Insights: “It literally took me minutes to build a basic dashboard,” he recalls.

Kutz and other members of the product management and development team use the information gleaned from New Relic Insights to drive development efforts and continuously improve the customer experience for Insights users. For example, Kutz and team have used their newfound insights to:

  • Target ideal candidates for a usability study for a new product feature
  • Discover common use cases across customers
  • Identify which new attributes to include out-of-the-box

Dramatically accelerating the feedback loop

“The entire team feels more engaged with customers because they can see that people are using the product. “It’s a great motivational tool,” Kutz says, “because it dramatically shortens the customer feedback loop.”

Insights is an excellent tool for continually monitoring site performance and customer experience. Product managers for many New Relic products now use Insights to glean insight about customer usage. The product manager for New Relic APM, for example, uses Insights to identify pages in the application that provide less satisfactory experiences so the engineering team can focus on improving them first.

And that’s only the beginning. “There are so many different directions we can go with this product,” Kutz says. “Understanding how people are using features and validating usage helps us determine the best roadmap.”

To learn more, download the New Relic Runs on Insights whitepaper now!


SharePoint System Performance Check beyond CPU and Memory

From apmblog.dynatrace.com February 11, 2015   |  by   

If you are responsible for keeping your SharePoint Deployment healthy I assume that “traditional” system monitoring – whether via SCOM, the Performance Monitor or other tools, is on top of your list. But if your first reaction on constant high CPU, exhausted memory or full disks is to ask for more hardware then your actions are “too traditional”. Adding more hardware for sure will make your system healthier – but it comes with a price tag that might not be necessary.

In this first blog about SharePoint Sanity Checks, I show you that there are ways to figure out which sites, pages, views, custom or 3rd party Web Parts (from AvePoint, K2, Nintex, Metalogix …) in your SharePoint environment are wasteful with resources so that you can fix the root cause and not just fight the symptom.

Feel free to follow all my steps by either using your own tools or use Dynatrace Free Trial with ourSharePoint FastPack.

Step #1: Server System Health Check

The first question must always be: How healthy are the Windows Servers that run your SharePoint Sites?

Not only must you look at Windows OS Metrics such as CPU, Memory, Disk and Network Utilization, you also need to monitor the individual SharePoint AppPool worker processes (w3wp.exe) to figure out whether you have individual sites that overload this server. The following is a screenshot that shows this information on a single dashboard.

A Dynatrace Host Health Dashboard shows key OS health metrics (CPU, Memory, Disk, Network) and the key SharePoint AppPools and their resource usage.

Let me give you some recommendations on what to look out for and what to do in that case

Bad AppPools

In case some of your SharePoint AppPools consumes too many resources on that machine, you may want to consider deploying them to a different server. You don’t want cram too many heavy utilized SharePoint sites on a single server and suffer from the cross impact of these sites.

Storage Problems

If you see high disk utilization it is important to check what is causing it. I typically look closer at:

  • IIS: Is the web server busy with serving too much static content? If that’s the case make sure you have configured resource caching. That reduces static requests from users that use SharePoint often. Also check the log settings of IIS and the Modules loaded by IIS. Make sure you only log what you really need.
  • SQL Server: Is SQL Server running on the same machine as SharePoint and maybe even hosting other databases? Talk with your DBA on checking proper configuration of SQL Server as well as discuss a better deployment scenario such as putting the SharePoint Content Database on its own SQL Server.
  • SharePoint: Check the generated log files. I often see people increasing log levels for different reasons but then forgetting to turn it back to default resulting in large amounts of data that nobody looks at anyway.

CPU Utilization

The first thing I look at is if the CPU is consumed by one of the SharePoint AppPools or other services running on that same machine.

  • SharePoint: This correlates to what I wrote under Bad AppPools. If the reason is too much load on an AppPool consider deploying it on a different machine. Before you do this please follow my additional recommendations later in this blog to verify if configuration or coding issues might be to blame which can be fixed.
  • SQL Server: Do you have SharePoint Sites or individual Pages that cause extra high utilization on the SQL Server? If that is the case, follow my recommendations on how to identify bad pages or Web Parts that have excessive access to the database. In general you should talk with the DBA to do a performance sanity check.
  • Other processes? Do you have other services running on that box that spikes CPU? Some batch or reporting jobs that can be deployed on a different server?

Network Utilization

It comes down to the same suspects as above:

  • IIS: Analyze how “heavy” your SharePoint pages are. Follow the general best practices on Web Performance Optimization by making your sites “slimmer”. Make sure you have content compression features turned on and content caching properly configured.
  • SharePoint: Besides talking to the database – what other services does your SharePoint instance interact with? Do you have Web Parts communicating with an external service? If that is the case make sure that these remote service calls are optimized, e.g: cache already fetched data or only query data that you really need.
  • SQL Server: Analyze which SharePoint Sites/Services request data as well as which other applications request data. Optimize data access or consider redeploying SQL Server to optimize the data transfer between the application and the database server.

Step #2: IIS Health Check

I already covered some IIS metrics in Step 1 but I want you to have a closer look at these IIS specific metrics such as current load, available vs used worker threads and bandwidth requirements:

A Dynatrace IIS Process Health makes it easy to see whether IIS is running low on threads, serving large content and maxing out available bandwidth.

These are the metrics I always check to validate how healthy the IIS deployment is:

  • Average Page Response Size: If you have bloated websites your IIS is serving too much data. That not only clogs the network, but it also makes the end user wait longer for these pages to load. Keep an eye on the average page size. Especially after deploying an update make sure pages don’t get too big. I suggest performing constant Web Performance Sanity Checks on your top pages.
  • Thread Utilization: Have you sized your IIS correctly in terms of worker threads? Are all the busy threads really busy or just waiting on slow performing SharePoint requests? Check out my sections on Top Web Server and Top App Server Metrics of my recent Load Testing Best Practices blog
  • Bandwidth Requirement: Is our outbound network pipe already a bottleneck? If that’s the case do not blindly update your infrastructure but first check if you can optimize your page sizes as explained earlier.

Step #3: Component Health Check

What I mentioned in the first 2 steps actually falls into “traditional” system monitoring with some additional insight on metrics that go beyond normal resource utilization monitoring. If resources are maxed out I always want to find out which components are actually using these resources. Why? Because we should first try to optimize these components before we give them more resources. I look at the following dashboard for a quick sanity check:

Dynatrace SharePoint Performance Dashboard tells me whether resource usage already causes performance spikes, whether it is caused by wasteful memory usage, individual pages or problematic Web Parts

A good SharePoint health metric is response time of SharePoint pages. If I see spikes, I know we jeopardize user adoption of SharePoint and I know I need to treat this with high priority. I look at the following metrics and data points to figure out what causes these spikes which most often directly correlate to higher resource consumption such as Memory, CPU, Disk and Network:

  • Memory Usage and Garbage Collection Impact: High memory usage alone is not necessarily a problem. The problem is if more memory is requested and the Garbage Collector needs to kick in and clear out a lot of old memory. That’s why I always keep an eye on overall memory usage patterns and the amount of time spent in Garbage Collection (GC). GC impacts both response time and it consumes a lot of CPU.
  • Which Pages are Slow? Trying to figure out why individual pages are slow is often easier than trying to figure out why on average the system is slower. I don’t waste time though focusing on a single slow page that is just used by a single user. Instead I focus on those pages that are slower than expected but also used by a lot of users. Optimizing them gives me more improvements for a larger audience.
  • Problematic Web Parts? SharePoint is built on Web Parts. Whether they come from Microsoft, well known 3rd Party providers (AvePoint, K2, Nintex, Metalogix …), or your own development team. Knowing which Web Parts are used and how slow they are allows you to focus even better. Too many times I have seen “Web Parts Gone Wild” caused by bad configuration or bad implementation. Check out my Top 5 SharePoint Performance Mistakes and you understand why that is a big problem.

The reason why Web Parts and Pages are slow can be caused by bad deployments, wrong configuration or really just bad coding. This is what I am going to focus on in my next blog post!

Next Steps: Fix the Problem; Don’t Just Buy More Hardware

I am interested to hear what you think about these metrics and please share ones with me that you use. In the next blog I will cover how to go deeper into SharePoint to identify the root cause of an unhealthy or slow system. Our first action should never be to just throw more hardware at the problem, but rather to understand the issue and optimize the situation.

If you want to see some of Andreas’ steps in action watch his 15 Minute SharePoint System Performance Sanity Check Video.


A Quick Tour of Transaction Management

From Crittercism  on December 3, 2014

When your app crashes or is slow, it impacts both your company’s brand and its revenue.  It impacts your brand because users can leave bad reviews/ratings, and impacts your revenue because important workflows like a checkout, sign up, or check deposit could be failing.  Crittercism Transaction Management shows how crashes and lags directly affect your bottom line. This post provides insight into what Transaction Management can offer and how to get started.

 Mobile Transactions are App Workflows That Impact Business

Let’s start with an overview of what mobile transactions are and how they are modeled.  A mobile transaction is any mobile workflow or use case in an app that has business impact.  Examples of mobile transactions include:

  • sign-up, registration, or account creation

  • importing user contacts or uploading / sharing content

  • checkout, billing, or mPOS

  • inventory or order management

  • buying / selling stock, depositing a check, or transferring funds

  • booking a hotel room or flight

These workflows usually involve multiple screens and multiple back-end or cloud API interactions.  For the workflow or transaction to be successful, each screen or back-end interaction in the process must also work.

 Monitoring a mobile transaction is easy with Crittercism.  A developer simply marks the start, the end, and the value associated with the transaction.  The value is the monetary value you would consider at risk if the transaction fails.  Typically it is something like the shopping cart value, LTV of a new customer, or a worker’s productivity cost in the case of B2E use cases.

Now that you have an idea of the type of transactions you can monitor, let’s take a quick tour of Transaction Management dashboards that allow you to connect performance metrics to business metrics.  The dashboards are designed to help you monitor the overall revenue risk, prioritize what to fix first, and troubleshoot the issue quickly.

Monitor Overall Revenue Risk with the Transaction Summary

txn-summary-top

The most basic question to answer is how much revenue is at risk at any given time.  With the Transaction Summary dashboard, you can see the aggregate revenue at risk ($715.7K over the last 7 days for this example).  Revenue risk is the aggregated value for the failed transactions.  You can also see the top failed transactions over time, which gives you valuable insight into how app performance issues like crashes or timeouts are hurting your business.

 Prioritize What To Fix First with the Transaction List

txn-summary-bottom

 Once you know there is an issue affecting your business, you need to quickly identify which transaction to fix.  In the transaction list, you can see each mobile transaction being monitored.  The summary for each transaction includes volume, failures, and revenue.   Sorting by revenue risk shows the transaction with the largest revenue risk first (Checkout in this case).

 Drill into Failures with Transaction Details

txn-detail-top

The Transaction Details dashboard gives you information to troubleshoot a specific transaction.  The dashboard allows you to easily see the Success metrics and average time.  The Revenue metrics break down the average risk per transaction as compared to the total risk.

 It’s also important to quickly understand the type of failure causing most issues for the transaction–crashes, time outs, or something marked as failed by the developer.  Demonstrated in this screenshot, the Failure Rate is being driven by time outs.

 Troubleshoot Issues with the Root Cause Analysis Tab

txn-details-rca

If crashes are the primary culprit, then the place to look is the Root Cause Analysis tab. It lists the highest occurring crash groups.  Most often you would start with the highest and work your way down.  Clicking on a crash group takes you to Crash dashboards , which gives you all the detailed diagnostics information you need to troubleshoot the issue quickly.

 View Exact Events Leading Up to a Failure With the Transaction Trace Tab

txn-details-trace

When you are trying to figure out why a particular transaction failed, it is very helpful to see the exact steps a user took that led to the problem;  The Transaction Trace tab does just that.  It shows the history of network, service, view, background/foreground, and crash events for a user’s transaction.  In this example, we are able to easily see the specific events leading to a crash.  After clicking Checkout, they switch from EDGE to Wi-Fi connectivity, lose the connection, and then call a checkout service,  at which point the app crashes.  Ideally the app would handle this gracefully and ensure the user can complete the checkout when they reconnect.

Using these Mobile Transaction dashboards you can monitor the overall revenue risk, prioritize what to fix first, and troubleshoot the issue quickly.

What’s Next

Hopefully this tour gives you a good overview of Crittercism Mobile Transaction Management.  The capabilities provide your mobile team the data needed to understand exactly how mobile performance issues are impacting your bottom line.  I encourage you to kick the tires on the solution by signing up for a free trial or contacting us for a demo today!

txn-details-trace


Crittercism. Announcing a New Site: Industry-First Mobile Benchmarks

From Crittercism  on October 22, 2014

What is data.crittercism.com?

We are excited to introduce data.crittercism.com, which gives you up-to-date insight into the state of the mobile industry. These insights are based on aggregated (and anonymized) data from our mobile application performance management solution, which monitors over 1B users and tens of thousands of mobile apps.

Data.crittercism.com provides detailed information about the mobile landscape, including benchmark data about:

– Crash rates
– Latencies / responsiveness
– Adoption

The reports also show different factors such as platform, geographies, and devices that all affect end user experience. Below are example graphs and reports.

See crash rates and usage across major mobile platforms:

iOS-crash-rate.png

Understand how the top devices are performing:

android-device-crash-rate.png

Get insight into how responsiveness and latency vary by location and Carrier or WiFi networks:

carrier-latency-by-country.png

Who should use this?

Since Crittercism began releasing data, such as the recent iOS 8 Performance Data, and Mobile Benchmark Reports, we have seen tremendous interest from many folks in the mobile industry.  We built data.crittercism.com help provide market level insight for practitioners, including:

– Industry experts interested in how the experience and performance on the latest devices, mobile platforms, or networks is trending
– Mobile developers, engineers, and operations teams that want to benchmark their app’s performance
– Mobile app business owners prioritizing mobile investments and trying to understand the relative health of their mobile portfolio
– …and anybody curious about mobile industry trying to understand global trends for mobile performance and experiences

What’s next?

Data.crittercism.com is available today.  Hop over to the site today and explore the mobile benchmark data.  We also plan to keep updating this over time with new reports to provide insight on new trends in mobile.

In addition, this data is the tip of the iceberg.  If you would like to get actionable insights about how your own mobile application is performing, then try Crittercism’s mobile application performance management solution today.


What APM vendors can learn from building supercars

From AppDynamics Blog 3 June 2013

McLaren this year will launch their P1 Supercar, which will turn the average driver into a track day hero.

What’s significant Screen-Shot-2013-06-03-at-2.16.18-PMabout this particular car is that it relies on modern day technology and innovation to transform a drivers ability to accelerate, corner and stop faster than any other car on the planet–because it has:

  1. 903bhp on tap derived from a combined V8 Twin Turbo and KERS setup, meaning it has a better power/weight ratio than a Bugatti Veyron
  2. Active aerodynamics & DRS to control the airflow so it remains stable under acceleration and braking without incurring drag
  3. Traction control and brake steer to minimize slip and increase traction in and out of corners
  4. 600Kg of downforce at 150mph so it can corner on rails up to 2G
  5. Lightness–everything exists for a purpose so there is less weight to transfer under braking and acceleration

You don’t have to be Lewis Hamilton or Michael Schumacher to drive it fast. The P1 creates enormous amounts of mechanical grip, traction, acceleration and feedback so the driver feels “confident” in their ability to accelerate, corner and stop, without losing control and killing themselves. I’ve been lucky enough to sit in the drivers seat of a McLaren MP4-12C and it’s a special experience – you have a driving wheel, some dials and some pedals – that’s really it, with no bells or whistles that you normally get in a Mercedes or Porsche. It’s “Focused” and “Pure” so the driver has complete visibility to drive as fast as possible, which is ultimately the whole purpose of the car.

How does this relate to Application Performance Monitoring (APM)?

Well, how many APM solutions today allow a novice user to solve complex application performance problems? Erm, not many. You need to be an uber geek with most because they’ve been written for developers by developers. Death by drill-down is a common symptom because novice APM users have no idea how to interpret metrics or what to look for. It would be like McLaren putting their F1 wheel with a thousand buttons in the new P1 road car for us novice drivers to play with.

It’s actually a lot worse than that though, because many APM vendors sell these things called “suites” that are enormously complex to install, configure and use. Imagine if you paid $1.4m and McLaren delivered you a P1 in 5 pieces and you had to assemble the engine, gearbox, chassis, suspension and brakes yourself? You’d have no choice but to pay McLaren for engineers to assemble it for with your own configuration. This is pretty much how most APM vendors have sold APM over the past decade–hence why they have hundreds of consultants. The majority of customers have spent more time and effort maintaining APM than using it to solve performance issues in their business. It’s kinda like buying a supercar and not driving it.

Fortunately, a few vendors like AppDynamics have succeeded in delivering APM through a single product that combines End User Monitoring, Application Discovery and Mapping, Transaction Profiling, Deep Diagnostics and Analytics. You download it, install it and you solve your performance issues in minutes–it just works out-of-the-box. What’s even great is that you can lease the APM solution through annual subscriptions instead of buying it outright with expensive perpetual licenses and annual maintenance.

If you want an APM solution that lets you manage application performance, then make sure it does just that for you. If you don’t get value from an APM solution in the first 20 minutes, then put it in the trash can because that’s 20 minutes of your time you’ve wasted not managing application performance. Sign up for a free trial of AppDynamics and find out how easy APM can be. If APM vendors built their solutions like car manufacturers build supercars, then the world would be a faster place (no pun intended).


How to accurately identify impact of system issues on end-user response time

From Compuware APM Blog as of 4 June 2013

Triggered by current expected load projections for our community portal, our Apps Team was tasked to run a stress on our production system to verify whether we can handle 10 times the load we currently experience on our existing infrastructure. In order to have the least impact in the event the site crumbled under the load, we decided to run the first test on a Sunday afternoon. Before we ran the test we gave our Operations Team a heads-up: they could expect significant load during a two hour window with the potential to affect other applications that also run on the same environment.

During the test, with both the Ops and Application Teams watching the live performance data, we all saw end user response time go through the roof and the underlying infrastructure running out of resources when we hit a certain load level. What was very interesting in this exercise is that both the Application and Ops teams looked at the same data but examined the results from a different angle. However, they both relied on the recently announced Compuware PureStack Technology, the first solution that – in combination with dynaTrace PurePath – exposes how IT infrastructure impacts the performance of critical business applications in heavy production environments.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

The root cause of the poor performance in our scenario was CPU exhaustion – on a main server machine hosting both the Web and App Server – caused us not to meet our load goal. This turned out to be both an IT provisioning and an application problem. Let me explain the steps these teams took and how they came up with their list of action items in order to improve the current system performance in order to do better in the second scheduled test.

Step 1: Monitor and Identify Infrastructure Health Issues

Operations Teams like having the ability to look at their list of servers and quickly see that all critical indicators (CPU, Memory, Network, Disk, etc) are green. But when they looked at the server landscape when our load test reached its peak, their dashboard showed them that two of their machines were having problems:

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

Step 2: What is the actual impact on the hosted applications?

Clicking on the Impacted Applications Tab shows us the applications that run on the affected machine and which ones are currently impacted:

The increased load not only impacts the Community Portal but also our Support Portal

The increased load not only impacts the Community Portal but also our Support Portal

Already the load test has taught us something: As we expect higher load on the community in the future, we might need to move the support portal to a different machine to avoid any impact.

When examined independently, operations-oriented monitoring would not be that telling. But when it is placed in a context that relates it to data (end user response time, user experience, …) important to the Applications team, both teams gain more insight.  This is a good start, but there is still more to learn.

Step 3: What is the actual impact on the critical transactions?

Clicking on the Community Portal application link shows us the transactions and pages that are actually impacted by the infrastructure issue, but there still are two critical unanswered questions:

  • Are these the transactions that are critical to our successful operation?
  • How badly are these transactions and individual users impacted by the performance issues?

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

Step 4: Visualizing the impact of the infrastructure issue on the transaction

The transaction-flow diagram is a great way to get both the Ops and App Teams on the same page and view data in its full context, showing the application tiers involved, the physical and virtual machines they are running on, and where the hotspots are.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

We knew that our pages are very heavy on content (Images, JavaScript and CSS), with up to 80% of the transaction time spent in the browser. Seeing that this performance hotspot is now down to 50% in relation to the overall page load time we immediately know that more of the transaction time has shifted to the new hotspot: the server side. The good news is that there is no problem with the database (only shows 1% response time contribution) as this entire performance hotspot shift seems to be related to the Web and App Servers, both of which run on the same machine – the one that has these CPU Health Issues.

Step 5: Pinpointing host health issue on the problematic machine

Drilling to the Host Health Dashboard shows what is wrong on that particular server:

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

Step 6: Process Health dashboards show slow app server response

We see that the two main processes on that machine are IIS (Web Server) and Tomcat (Application Server). A closer look shows how they are doing over time:

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

Step 7: Pinpointing heavy CPU usage

Our Apps Team is now interested in figuring out what consumes all this CPU and whether this is something we can fix in the application code or whether we need more CPU power:

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Ops and Apps teams now easily prioritize both Infrastructure and app fixes

So as mentioned, ‘context is everything’.  But it’s not simply enough to have data – context relies on the ability to intelligently correlate all of the data into a coherent story.  When the “horizontal” transactional data for end-user response-time analysis is connected to the “vertical” infrastructure stack information, it becomes easy to get both teams to read from the same page and prioritize fixes that have the greatest negative impact on the business.

This exercise allowed us to identify several action items:

  • Deploy our critical applications on different machines when the applications impact each other negatively
  • Optimize the way our pages are built to reduce CPU usage
  • Increase CPU power on these virtualized machines to handle more load

APM as a Service: 4 steps to monitor real user experience in production

From Compuware APM Blog from 15 May 2013

With our new service platform and the convergence of dynaTrace PurePath Technology with the Gomez Performance Network, we are proud to offer an APMaaS solution that sets a higher bar for complete user experience management, with end-to-end monitoring technologies that include real-user, synthetic, third-party service monitoring, and business impact analysis.

To showcase the capabilities we used the free trial on our own about:performance blog as a demonstration platform. It is based on the popular WordPress technology which uses PHP and MySQL as its implementation stack. With only 4 steps we get full availability monitoring as well as visibility into every one of our visitors and can pinpoint any problem on our blog to problems in the browser (JavaScript, slow 3rd party, …), the network (slow network connectivity, bloated website, ..) or the application itself (slow PHP code, inefficient MySQL access, …).

Before we get started, let’s have a look at the Compuware APMaaS architecture. In order to collect real user performance data all you need is to install a so called Agent on the Web and/or Application Server. The data gets sent in an optimized and secure way to the APMaaS Platform. Performance data is then analyzed through the APMaaS Web Portal with drilldown capabilities into the dynaTrace Client.

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

4 Steps to setup APMaaS for our Blog powered by WordPress on PHP

From a high-level perspective, joining Compuware APMaaS and setting up your environment consists of four basic steps:

  1. Sign up with Compuware for the Free Trial
  2. Install the Compuware Agent on your Server
  3. Restart your application
  4. Analyze Data through the APMaaS Dashboards

In this article, we assume that you’ve successfully signed up, and will walk you through the actual setup steps to show how easy it is to get started.

After signing up with Compuware, the first sign of your new Compuware APMaaS environment will be an email notifying you that a new environment instance has been created:

Following the steps as explained in the Welcome Email to get started

Following the steps as explained in the Welcome Email to get started

While you can immediately take a peek into your brand new APMaaS account at this point, there’s not much to see: Before we can collect any data for you, you will have to finish the setup in your application by downloading and installing the agents.

After installation is complete and the Web Server is restarted the agents will start sending data to the APMaaS Platform – and with dynaTrace 5.5, this also includes the PHP agent which gives insight into what’s really going on in the PHP application!

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Now we are ready to go!

For Ops & Business: Availability, Conversions, User Satisfaction

Through the APMaaS Web Portal, we start with some high level web dashboards that are also very useful for our Operations and Business colleagues. These show Availability, Conversion Rates as well as User Satisfaction and Error Rates. To show the integrated capabilities of the complete Compuware APM platform, Availability is measured using Synthetic Monitors that constantly check our blog while all of the other values are taken from real end user monitoring.

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

For App Owners: Application and End User Performance Analysis

Through the dynaTrace client we get a richer view to the real end user data. The PHP agent we installed is a full equivalent to the dynaTrace Java and .NET agents, and features like the application overview together with our self-learning automatic baselining will just work the same way regardless of the server-side technology:

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Before drilling down into the performance analytics, let’s have a quick look at the key user experience metrics such as where our blog users actually come from, the browsers they use, and whether their geographical location impacts user experience:

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

If you are responsible for User Experience and interested in some of our best practices I recommend checking our other UEM-related blog posts – for instance: What to do if A/B testing fails to improve conversions?

Going a bit deeper – What impacts End User Experience?

dynaTrace automatically detects important URLs as so-called “Business Transactions.” In our case we have different blog categories that visitors can click on. The following screenshot shows us that we automatically get dynamic baselines calculated for these identified business transaction:

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Here we see that our overall response time for requests by category slowed down on May 12. Let’s investigate what happened here, and move to the transaction flow which visualizes PHP transactions from the browser to the database and maps infrastructure health data onto every tier that participated in these transactions:

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

Since we are always striving to improve our users’ experience, the first troubling thing on this screen is that we see errors happening in browsers – maybe someone forgot to upload an image when posting a new blog entry? Let’s drill down to the Errors dashlet to see what’s happening here:

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

Apparently, some of the third party widgets we have on the blog caused JavaScript errors for some users. Using the error message, we can investigate which widget causes the issue, and where it’s happening. We can also see which browsers, versions and devices this happens on to focus our optimization efforts. If you happen to rely on 3rd party plugins you want to check the blog post You only control 1/3 of your Page Load Performance.

PHP Performance Deep Dive

We will analyze the performance problems on the PHP Server Side in a follow up blog. We will show you what the steps are to identify problematic PHP code. In our case it actually turned out to be a problematic plugin that helps us identify bad requests (requests from bots, …)

Conclusion and Next Steps

Stay tuned for more posts on this topic, or try Compuware APMaaS out yourself by signing up here for the free trial!