New Relic Runs on Insights: Customer Usage

From New Relic By Posted in New Relic News, Using Our Products 18 February 2015

This New Relic Insights use case is taken from New Relic Runs on Insights, a new whitepaper collecting Insights use cases from across New Relic, highlighting some of the many ways the product can help businesses exploit the countless possibilities for leveraging their data in real time, without needing developer resources, and allowing them to make faster, more informed decisions. Download the New Relic Runs on Insights whitepaper now!

It’s not often that a product manager gets the opportunity to use the product they manage to make a positive impact on the role and the company. But that dream scenario came true for Jim Kutz, technical product manager for New Relic Insights, who was able to leverage it to be more effective at his job.

Understanding how customers use products and their individual components and features is the heart product management. But getting detailed and accurate information on actual customer usage is a perpetual challenge for many companies.

Establishing the traditional customer feedback loop, for example, can take months of effort after a new product version is released, and even then may not provide enough detail.

To speed things up, Kutz decided to take advantage of his own product to get the immediate, detailed information he needed to guide future product development strategies and ultimately improve the customer experience. “Instead of running a focus group, conducting surveys, or interviewing a self-limiting number of customers, I realized that I could see, in real time, how our customers are using New Relic Insights,” says Kutz. “All I needed to do was write a few queries.”

pg15

Quick and easy insights into New Relic Insights usage

Kutz quickly built a New Relic Insights dashboard to display the usage data he and the rest of the Insights product team needed to drive better decisions. On a daily basis, he says, “I can see how customers are engaging with the product… I’m not only counting the amount of usage, but I can now understand the context of how our customers are using our product.”

Kutz got to experience firsthand just how easy it is to use New Relic Insights: “It literally took me minutes to build a basic dashboard,” he recalls.

Kutz and other members of the product management and development team use the information gleaned from New Relic Insights to drive development efforts and continuously improve the customer experience for Insights users. For example, Kutz and team have used their newfound insights to:

  • Target ideal candidates for a usability study for a new product feature
  • Discover common use cases across customers
  • Identify which new attributes to include out-of-the-box

Dramatically accelerating the feedback loop

“The entire team feels more engaged with customers because they can see that people are using the product. “It’s a great motivational tool,” Kutz says, “because it dramatically shortens the customer feedback loop.”

Insights is an excellent tool for continually monitoring site performance and customer experience. Product managers for many New Relic products now use Insights to glean insight about customer usage. The product manager for New Relic APM, for example, uses Insights to identify pages in the application that provide less satisfactory experiences so the engineering team can focus on improving them first.

And that’s only the beginning. “There are so many different directions we can go with this product,” Kutz says. “Understanding how people are using features and validating usage helps us determine the best roadmap.”

To learn more, download the New Relic Runs on Insights whitepaper now!

Advertisements

What APM vendors can learn from building supercars

From AppDynamics Blog 3 June 2013

McLaren this year will launch their P1 Supercar, which will turn the average driver into a track day hero.

What’s significant Screen-Shot-2013-06-03-at-2.16.18-PMabout this particular car is that it relies on modern day technology and innovation to transform a drivers ability to accelerate, corner and stop faster than any other car on the planet–because it has:

  1. 903bhp on tap derived from a combined V8 Twin Turbo and KERS setup, meaning it has a better power/weight ratio than a Bugatti Veyron
  2. Active aerodynamics & DRS to control the airflow so it remains stable under acceleration and braking without incurring drag
  3. Traction control and brake steer to minimize slip and increase traction in and out of corners
  4. 600Kg of downforce at 150mph so it can corner on rails up to 2G
  5. Lightness–everything exists for a purpose so there is less weight to transfer under braking and acceleration

You don’t have to be Lewis Hamilton or Michael Schumacher to drive it fast. The P1 creates enormous amounts of mechanical grip, traction, acceleration and feedback so the driver feels “confident” in their ability to accelerate, corner and stop, without losing control and killing themselves. I’ve been lucky enough to sit in the drivers seat of a McLaren MP4-12C and it’s a special experience – you have a driving wheel, some dials and some pedals – that’s really it, with no bells or whistles that you normally get in a Mercedes or Porsche. It’s “Focused” and “Pure” so the driver has complete visibility to drive as fast as possible, which is ultimately the whole purpose of the car.

How does this relate to Application Performance Monitoring (APM)?

Well, how many APM solutions today allow a novice user to solve complex application performance problems? Erm, not many. You need to be an uber geek with most because they’ve been written for developers by developers. Death by drill-down is a common symptom because novice APM users have no idea how to interpret metrics or what to look for. It would be like McLaren putting their F1 wheel with a thousand buttons in the new P1 road car for us novice drivers to play with.

It’s actually a lot worse than that though, because many APM vendors sell these things called “suites” that are enormously complex to install, configure and use. Imagine if you paid $1.4m and McLaren delivered you a P1 in 5 pieces and you had to assemble the engine, gearbox, chassis, suspension and brakes yourself? You’d have no choice but to pay McLaren for engineers to assemble it for with your own configuration. This is pretty much how most APM vendors have sold APM over the past decade–hence why they have hundreds of consultants. The majority of customers have spent more time and effort maintaining APM than using it to solve performance issues in their business. It’s kinda like buying a supercar and not driving it.

Fortunately, a few vendors like AppDynamics have succeeded in delivering APM through a single product that combines End User Monitoring, Application Discovery and Mapping, Transaction Profiling, Deep Diagnostics and Analytics. You download it, install it and you solve your performance issues in minutes–it just works out-of-the-box. What’s even great is that you can lease the APM solution through annual subscriptions instead of buying it outright with expensive perpetual licenses and annual maintenance.

If you want an APM solution that lets you manage application performance, then make sure it does just that for you. If you don’t get value from an APM solution in the first 20 minutes, then put it in the trash can because that’s 20 minutes of your time you’ve wasted not managing application performance. Sign up for a free trial of AppDynamics and find out how easy APM can be. If APM vendors built their solutions like car manufacturers build supercars, then the world would be a faster place (no pun intended).


How to accurately identify impact of system issues on end-user response time

From Compuware APM Blog as of 4 June 2013

Triggered by current expected load projections for our community portal, our Apps Team was tasked to run a stress on our production system to verify whether we can handle 10 times the load we currently experience on our existing infrastructure. In order to have the least impact in the event the site crumbled under the load, we decided to run the first test on a Sunday afternoon. Before we ran the test we gave our Operations Team a heads-up: they could expect significant load during a two hour window with the potential to affect other applications that also run on the same environment.

During the test, with both the Ops and Application Teams watching the live performance data, we all saw end user response time go through the roof and the underlying infrastructure running out of resources when we hit a certain load level. What was very interesting in this exercise is that both the Application and Ops teams looked at the same data but examined the results from a different angle. However, they both relied on the recently announced Compuware PureStack Technology, the first solution that – in combination with dynaTrace PurePath – exposes how IT infrastructure impacts the performance of critical business applications in heavy production environments.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

The root cause of the poor performance in our scenario was CPU exhaustion – on a main server machine hosting both the Web and App Server – caused us not to meet our load goal. This turned out to be both an IT provisioning and an application problem. Let me explain the steps these teams took and how they came up with their list of action items in order to improve the current system performance in order to do better in the second scheduled test.

Step 1: Monitor and Identify Infrastructure Health Issues

Operations Teams like having the ability to look at their list of servers and quickly see that all critical indicators (CPU, Memory, Network, Disk, etc) are green. But when they looked at the server landscape when our load test reached its peak, their dashboard showed them that two of their machines were having problems:

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

Step 2: What is the actual impact on the hosted applications?

Clicking on the Impacted Applications Tab shows us the applications that run on the affected machine and which ones are currently impacted:

The increased load not only impacts the Community Portal but also our Support Portal

The increased load not only impacts the Community Portal but also our Support Portal

Already the load test has taught us something: As we expect higher load on the community in the future, we might need to move the support portal to a different machine to avoid any impact.

When examined independently, operations-oriented monitoring would not be that telling. But when it is placed in a context that relates it to data (end user response time, user experience, …) important to the Applications team, both teams gain more insight.  This is a good start, but there is still more to learn.

Step 3: What is the actual impact on the critical transactions?

Clicking on the Community Portal application link shows us the transactions and pages that are actually impacted by the infrastructure issue, but there still are two critical unanswered questions:

  • Are these the transactions that are critical to our successful operation?
  • How badly are these transactions and individual users impacted by the performance issues?

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

Step 4: Visualizing the impact of the infrastructure issue on the transaction

The transaction-flow diagram is a great way to get both the Ops and App Teams on the same page and view data in its full context, showing the application tiers involved, the physical and virtual machines they are running on, and where the hotspots are.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

We knew that our pages are very heavy on content (Images, JavaScript and CSS), with up to 80% of the transaction time spent in the browser. Seeing that this performance hotspot is now down to 50% in relation to the overall page load time we immediately know that more of the transaction time has shifted to the new hotspot: the server side. The good news is that there is no problem with the database (only shows 1% response time contribution) as this entire performance hotspot shift seems to be related to the Web and App Servers, both of which run on the same machine – the one that has these CPU Health Issues.

Step 5: Pinpointing host health issue on the problematic machine

Drilling to the Host Health Dashboard shows what is wrong on that particular server:

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

Step 6: Process Health dashboards show slow app server response

We see that the two main processes on that machine are IIS (Web Server) and Tomcat (Application Server). A closer look shows how they are doing over time:

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

Step 7: Pinpointing heavy CPU usage

Our Apps Team is now interested in figuring out what consumes all this CPU and whether this is something we can fix in the application code or whether we need more CPU power:

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Ops and Apps teams now easily prioritize both Infrastructure and app fixes

So as mentioned, ‘context is everything’.  But it’s not simply enough to have data – context relies on the ability to intelligently correlate all of the data into a coherent story.  When the “horizontal” transactional data for end-user response-time analysis is connected to the “vertical” infrastructure stack information, it becomes easy to get both teams to read from the same page and prioritize fixes that have the greatest negative impact on the business.

This exercise allowed us to identify several action items:

  • Deploy our critical applications on different machines when the applications impact each other negatively
  • Optimize the way our pages are built to reduce CPU usage
  • Increase CPU power on these virtualized machines to handle more load

APM as a Service: 4 steps to monitor real user experience in production

From Compuware APM Blog from 15 May 2013

With our new service platform and the convergence of dynaTrace PurePath Technology with the Gomez Performance Network, we are proud to offer an APMaaS solution that sets a higher bar for complete user experience management, with end-to-end monitoring technologies that include real-user, synthetic, third-party service monitoring, and business impact analysis.

To showcase the capabilities we used the free trial on our own about:performance blog as a demonstration platform. It is based on the popular WordPress technology which uses PHP and MySQL as its implementation stack. With only 4 steps we get full availability monitoring as well as visibility into every one of our visitors and can pinpoint any problem on our blog to problems in the browser (JavaScript, slow 3rd party, …), the network (slow network connectivity, bloated website, ..) or the application itself (slow PHP code, inefficient MySQL access, …).

Before we get started, let’s have a look at the Compuware APMaaS architecture. In order to collect real user performance data all you need is to install a so called Agent on the Web and/or Application Server. The data gets sent in an optimized and secure way to the APMaaS Platform. Performance data is then analyzed through the APMaaS Web Portal with drilldown capabilities into the dynaTrace Client.

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

4 Steps to setup APMaaS for our Blog powered by WordPress on PHP

From a high-level perspective, joining Compuware APMaaS and setting up your environment consists of four basic steps:

  1. Sign up with Compuware for the Free Trial
  2. Install the Compuware Agent on your Server
  3. Restart your application
  4. Analyze Data through the APMaaS Dashboards

In this article, we assume that you’ve successfully signed up, and will walk you through the actual setup steps to show how easy it is to get started.

After signing up with Compuware, the first sign of your new Compuware APMaaS environment will be an email notifying you that a new environment instance has been created:

Following the steps as explained in the Welcome Email to get started

Following the steps as explained in the Welcome Email to get started

While you can immediately take a peek into your brand new APMaaS account at this point, there’s not much to see: Before we can collect any data for you, you will have to finish the setup in your application by downloading and installing the agents.

After installation is complete and the Web Server is restarted the agents will start sending data to the APMaaS Platform – and with dynaTrace 5.5, this also includes the PHP agent which gives insight into what’s really going on in the PHP application!

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Now we are ready to go!

For Ops & Business: Availability, Conversions, User Satisfaction

Through the APMaaS Web Portal, we start with some high level web dashboards that are also very useful for our Operations and Business colleagues. These show Availability, Conversion Rates as well as User Satisfaction and Error Rates. To show the integrated capabilities of the complete Compuware APM platform, Availability is measured using Synthetic Monitors that constantly check our blog while all of the other values are taken from real end user monitoring.

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

For App Owners: Application and End User Performance Analysis

Through the dynaTrace client we get a richer view to the real end user data. The PHP agent we installed is a full equivalent to the dynaTrace Java and .NET agents, and features like the application overview together with our self-learning automatic baselining will just work the same way regardless of the server-side technology:

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Before drilling down into the performance analytics, let’s have a quick look at the key user experience metrics such as where our blog users actually come from, the browsers they use, and whether their geographical location impacts user experience:

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

If you are responsible for User Experience and interested in some of our best practices I recommend checking our other UEM-related blog posts – for instance: What to do if A/B testing fails to improve conversions?

Going a bit deeper – What impacts End User Experience?

dynaTrace automatically detects important URLs as so-called “Business Transactions.” In our case we have different blog categories that visitors can click on. The following screenshot shows us that we automatically get dynamic baselines calculated for these identified business transaction:

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Here we see that our overall response time for requests by category slowed down on May 12. Let’s investigate what happened here, and move to the transaction flow which visualizes PHP transactions from the browser to the database and maps infrastructure health data onto every tier that participated in these transactions:

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

Since we are always striving to improve our users’ experience, the first troubling thing on this screen is that we see errors happening in browsers – maybe someone forgot to upload an image when posting a new blog entry? Let’s drill down to the Errors dashlet to see what’s happening here:

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

Apparently, some of the third party widgets we have on the blog caused JavaScript errors for some users. Using the error message, we can investigate which widget causes the issue, and where it’s happening. We can also see which browsers, versions and devices this happens on to focus our optimization efforts. If you happen to rely on 3rd party plugins you want to check the blog post You only control 1/3 of your Page Load Performance.

PHP Performance Deep Dive

We will analyze the performance problems on the PHP Server Side in a follow up blog. We will show you what the steps are to identify problematic PHP code. In our case it actually turned out to be a problematic plugin that helps us identify bad requests (requests from bots, …)

Conclusion and Next Steps

Stay tuned for more posts on this topic, or try Compuware APMaaS out yourself by signing up here for the free trial!


AppDynamics releases powerful database monitoring solution, extends visibility beyond the application layer

AppDynamics logo


 

1st APM vendor to bridge application and database performance with a single view.

AppDynamics, the next-generation Application Performance Management solution that simplifies the management of complex apps, has announced the release of AppDynamics for Databases to help enterprises troubleshoot and tune database performance problems.  This new AppDynamics solution isavailable immediately and offers unmatched insight and visibility into how SQL and stored procedures execute within databases such as Oracle, SQL Server, DB2, Sybase, MySQL and PosgreSQL.

AppDynamics for Databases addresses the challenges that application support teams such as Developers and Operations face in trying to identify the cause of application performance issues that relate to database performance.  As many as 50% of application problems are the result of slow SQL calls and stored procedures invoked by applications—yet until now, databases have been a “black box” for application support teams.

“Giving our customers critical visibility and troubleshooting capability into the cause of database problems makes AppDynamics absolutely unique in the APM space,” said Jyoti Bansal, founder and CEO of AppDynamics. “Application support teams constantly wrestle with database performance problems in attempting to ensure uptime and availability of their mission-critical applications, but they usually lack the visibility they need to resolve problems. We’ve equipped them with a valuable new solution for ensuring application performance, and it will enable them to collaborate with their Database Administrator colleagues even more closely than before.”

With its new database monitoring solution, AppDynamics has applied its “secret sauce” from troubleshooting Java and .NET application servers to databases, allowing enterprises to pinpoint slow user transactions and identify the root cause of SQL and stored procedure queries. AppDynamics for Databases also offers universal database diagnostics covering Oracle, SQL Server, DB2, Sybase, PostgreSQL, and MySQL database platforms.

AppDynamics Pro for Databases includes the following features:

  • Production Ready: Less than 1% overhead in most production environments.
  • Application to Database drill-down: Ability to troubleshoot business transaction latency from the application right into the database and storage tiers.
  • SQL explain/execution plans: Allows developers and database administrators to pinpoint inefficient operations and logic, as well as diagnose why queries are running slowly.
  • Historical analysis: Monitors and records database activity 24/7 to allow users to analyze performance slowdowns in the database tier.
  • Top database wait states: Provides insights and visibility into database wait and CPU states to help users understand database resource contention and usage.
  • Storage visibility for NetApp: Provides the ability to correlate database performance with performance on NetApp storage.

“It is great to have a tightly integrated way to monitor, troubleshoot and optimize the performance of our key applications and the databases that support them,” said Nadine Thomson, Group IT Operations Manager at STA Travel. “We’re enthusiastic about the ability to use deep database, Java, and .NET performance information all from within a single AppDynamics product.”

AppDynamics for Databases is available now; get a free trial here.


Compuware unveils 2013 application performance management best practices and trends

Compuware has just published the first volume of its new Application Performance Management (APM) Best Practices collection titled: “2013 APM State-of-the-Art and Trends.” Written by Compuware’s APM Center of Excellence thought leaders and experts, the collection features 10 articles on the technology topics shaping APM in 2013.

For organisations that depend on high-performance applications, the collection provides an easy-to-absorb overview of the evolution of APM technology, best practices, methodology and techniques to help manage and optimize application performance. Download the APM Best Practices collection here.

The APM Best Practices: 2013 APM State-of-the-Art and Trends collection helps IT professionals and business stakeholders keep pace with these changes and learn how application performance techniques will develop over the new year. The collection not only explores APM technology but also examines the related business implications and provides recommendations for how best to leverage APM.

Topics covered in this collection include:

  • managing application complexity across the edge and cloud;
  • top 10 requirements for creating an APM culture;
  • quantifying the financial impact of poor user experience;
  • sorting myth from reality in real-user monitoring; and
  • lessons learned from real-world big data implementations.

To download the APM Best Practices collectionclick here.

“This collection is a source of knowledge, providing valuable information about application performance for all business and technical stakeholders,” said Andreas Grabner, Leader of the Compuware APM Center of Excellence. “IT professionals can use the collection to help implement leading APM practices in their organizations and to set direction for proactive performance improvements. Organisations not currently using APM can discover how other companies are leveraging APM to solve business and technology problems, and how these solutions might apply to their own situations.”

More volumes of the APM Best Practices collection will become available throughout the year and will cover:

With more than 4,000 APM customers worldwide, Compuware is recognised as a leader in the “Magic Quadrant for Application Performance Monitoring” report. To read more about Compuware’s leadership in the APM market, click here.


It takes more than a tool! Swarovski’s 10 requirements for creating an APM culture

By Andreas Grabner at blog.dynatrace.com

Swarovski – the leading producer of cut crystal in the world- relies on its eCommerce store as much like other companies in the highly competitive eCommerce environment. Swarovski’s story is no different from others in this space: They started with “Let’s build a website to sell our products online” a couple of years ago and quickly progressed to “We sell to 60 million annual visitors across 23 countries in 6 languages”. There were bumps along the road and they realized that it takes more than just a bunch of servers and tools to keep the site running.

Why APM and why you do not just need a tool?

Swarovski relies on Intershop’s eCommerce platform and faced several challenges as they rapidly grew. Their challenges required them to apply Application Performance Management (APM) practices to ensure they could fulfill the business requirements to keep pace with customer growth while maintaining an excellent user experience. The most insightful comment I heard was from René Neubacher, Senior eBusiness Technology Consultant at Swarovski: “APM is not just about software. APM is a culture, a mindset and a set of business processes.  APM software supports that.”

René recently discussed their Journey to APM, what their initial problems were and what requirements they ended up having on APM and the tools they needed to support their APM strategy. By now they reached the next level of maturity by establishing a Performance Center of Excellence. This allows them to tackle application performance proactively throughout the organization instead of putting out fires reactively in production.

This blog post describes the challenges they faced, the questions that arose and the new generation APM requirements that paved the way forward in their performance journey:

The Challenge!

Swarvoski had traditional system monitoring in place on all their systems across their delivery chain including web servers, application servers, SAP, database servers, external systems and the network. Knowing that each individual component is up and running 99.99% of the time is great but no longer sufficient. How might these individual component outages impact the user experience of their online shoppers? WHO is actually responsible for the end user experience and HOW should you monitor the complete delivery chain and not just the individual components? These and other questions came up when the eCommerce site attracted more customers which was quickly followed by more complaints about their user experience:

APM includes getting a holistic view of the complete delivery chain and requires someone to be responsible for end user experience.

APM includes getting a holistic view of the complete delivery chain and requires someone to be responsible for end user experience.

Questions that had no answers

In addition to “Who is responsible in case users complain?” the other questions that needed to be urgently addressed included:

  • How often is the service desk called before IT knows that there is a problem?
  • How much time is spent in searching for system errors versus building new features?
  • Do we have a process to find the root-cause when a customer reports a problem?
  • How do we visualize our services from the customer‘s point of view?
  • How much revenue, brand image and productivity are at risk or lostwhile IT is searching for the problem?
  • What to do when someone says ”it‘s slow“?

The 10 Requirements

These unanswered questions triggered the need to move away from traditional system monitoring and develop the requirements for new generation APM and user experience management.

#1: Support State-of-the-Art Architecture

Based on their current system architecture it was clear that Swarovski needed an approach that was able to work in their architecture, now and in the future. The rise of more interactive Web 2.0 and mobile applications had to be factored in to allow monitoring end users from many different devices and regardless of whether they used a web application or mobile native application as their access point.

Transactions need to be followed from the browser all the way back to the database. It is important to support distributed transactions. This approach also helps to spot architectural and deployment problems immediately

Transactions need to be followed from the browser all the way back to the database. It is important to support distributed transactions. This approach also helps to spot architectural and deployment problems immediately

#2: 100% transactions and clicks – No Averages

Based on their experience, Swarovski knew that looking at average values or sampled data would not be helpful when customers complained about bad performance. Responding to a customer complaint with “Our average user has no problem right now – sorry for your inconvenience” is not what you want your helpdesk engineers to use as a standard phrase. Averages or sampling also hides the real problems you have in your system. Check out the blog post Why Averages Suck by Michael Kopp for more detail.

Measuring end user performance of every customer interaction allows for quick identification of regional problems with CDNs, 3rd Parties or Latency.

Measuring end user performance of every customer interaction allows for quick identification of regional problems with CDNs, 3rd Parties or Latency.

Having 100% user interactions and transactions available makes it easy to identify the root cause for individual users

Having 100% user interactions and transactions available makes it easy to identify the root cause for individual users

#3: Business Visibility

As the business had a growing interest in the success of the eCommerce platform, IT had to demonstrate to the business what it took to fulfill their requirements and how business requirements are impacted by the availability or the lack of investment in the application delivery chain.

Correlating the number of Visits with Performance on incoming Orders illustrates the measurable impact of performance on revenue and what it takes to support business requirements.

Correlating the number of Visits with Performance on incoming Orders illustrates the measurable impact of performance on revenue and what it takes to support business requirements.

#4: Impact of 3rd Parties and CDNs

It was important to not only track transactions involving their own Data Center but ALL user interactions with their web site – even those delivered through CDNs or 3rd parties. All of these interactions make up the user experience and therefore ALL of it needs to be analyzed.

Seeing the actual load impact of 3rd party components or content delivered from CDNs enables IT to pinpoint user experience problems that originate outside their own data center.

Seeing the actual load impact of 3rd party components or content delivered from CDNs enables IT to pinpoint user experience problems that originate outside their own data center.

#5: Across the lifecycle – supporting collaboration and tearing down silos

The APM initiative was started because Swarovski reacted to problems happening in production. Fixing these problems in production is only the first step. Their ultimate goal is to become pro-active by finding and fixing problems in development or testing—before they spill over into production. Instead of relying on different sets of tools with different capabilities, the requirement is to use one single solution that is designed to be used across the application lifecycle (Developer Workstation, Continuous Integration, Testing, Staging and Production). It will make it easier to share application performance data between lifecycle stages allowing individuals to not only easily look at data from other stages but also compare data to verify impact and behavior of code changes between version updates.

Continuously catching regressions in Development by analyzing unit and performance tests allows application teams to become more proactive.

Continuously catching regressions in Development by analyzing unit and performance tests allows application teams to become more proactive.

Pinpointing integration and scalability issues, continuously, in acceptance and load testing makes testing more efficient and prevents problems from reaching production.

Pinpointing integration and scalability issues, continuously, in acceptance and load testing makes testing more efficient and prevents problems from reaching production.

#6: Down to the source code

In order to speed up problem resolution Swarovski’s operations and development teams  require as much code-level insight as possible — not only for their own engineers who are extending the Intershop eCommerce Platform but also for Intershop to improve their product. Knowing what part of the application code is not performing well with which input parameters or under which specific load on the system eliminates tedious reproduction of the problem. The requirement is to lower the Mean Time To Repair (MTTR) from as much as several days down to only a couple of hours.

The SAP Connector turned out to have a performance problem. This method-level detailed information was captured without changing any code.

The SAP Connector turned out to have a performance problem. This method-level detailed information was captured without changing any code.

#7: Zero/Acceptable overhead

“Who are we kidding? There is nothing like zero overhead especially when you need 100% coverage!” – Just the words from René when you explained that requirement. And he is right: once you start collecting information from a production system you add a certain amount of overhead. A better term for this would be “imperceptible overhead” – overhead that’s so small, you don’t notice it.

What is the exact number? It depends on your business and your users. The number should be worked out from the impact on the end user experience, rather than additional CPU, memory or network bandwidth required in the data center. Swarovski knew they had to achieve less than 2% overhead on page load times in production, as anything more would have hurt their business; and that’s what they achieved.

#8: Centralized data collection and administration

Running a distributed eCommerce application that gets potentially extended to additional geographical locations requires an APM system with a centralized data collection and administration option. It is not feasible to collect different types of performance information from different systems, servers or even data centers. It would either require multiple different analysis tools or data transformation to a single format to use it for proper analysis.

Instead of this approach, a single unified APM system was required by Swarovski. Central administration is equally important as they need to eliminate the need to rely on remote IT administrators to make changes to the monitored system, for example, simple tasks such as changing the level of captured data or upgrading to a new version.

By storing and accessing performance data from a single, centralized repository, enables fast and powerful analytic and visualization. For example, system metrics such as CPU utilization can be correlated with end-user response time or database execution time - all displayed on one single dashboard.

By storing and accessing performance data from a single, centralized repository, enables fast and powerful analytic and visualization. For example, system metrics such as CPU utilization can be correlated with end-user response time or database execution time – all displayed on one single dashboard.

#9: Auto-Adapting Instrumentation without digging through code

As the majority of the application code is not developed in-house but provided by Intershop, it is mandatory to get insight into the application without doing any manual code changes. The APM system must auto-adapt to changes so that no manual configuration change is necessary when a new version of the application is deployed.

This means Swarovski can focus on making their applications positively contribute to business outcomes; rather than spend time maintaining IT systems.

#10: Ability to extend

Their application is an always growing an ever-changing IT environment. Where everything might have been deployed on physical boxes it might be moved to virtualized environments or even into a public cloud environment.

Whatever the extension may be – the APM solution must be able to adapt to these changes and also be extensible to consume new types of data sources, e.g., performance metrics from Amazon Cloud Services or VMware, Cassandra or other Big Data Solutions or even extend to legacy Mainframe applications and then bring these metrics into the centralized data repository and provide new insights into the application’s performance.

Extending the application monitoring capabilities to Amazon EC2, Microsoft Windows Azure, a public or private cloud enables the analysis of the performance impact of these virtualized environments on end user experience.

Extending the application monitoring capabilities to Amazon EC2, Microsoft Windows Azure, a public or private cloud enables the analysis of the performance impact of these virtualized environments on end user experience.

The Solution and the Way Forward

Needless to say that Swarovski took the first step in implementing APM as a new process and mindset in their organization. They are now in the next phase of implementing a Performance Center of Excellence. This allows them moving from Reactive Performance Troubleshooting to Proactive Performance Prevention.

Stay tuned for more blog posts on the Performance Center of Excellence and how you can build one in your own organization. The key message is that it is not about just using a bunch of tools. It is about living and breathing performance throughout the organization. If you are interested in this check out the blogs by Steve Wilson: Proactive vs Reactive: How to prevent problems instead of fixing them faster andPerformance in Development is the Chief Cornerstone.