SharePoint System Performance Check beyond CPU and Memory

From apmblog.dynatrace.com February 11, 2015   |  by   

If you are responsible for keeping your SharePoint Deployment healthy I assume that “traditional” system monitoring – whether via SCOM, the Performance Monitor or other tools, is on top of your list. But if your first reaction on constant high CPU, exhausted memory or full disks is to ask for more hardware then your actions are “too traditional”. Adding more hardware for sure will make your system healthier – but it comes with a price tag that might not be necessary.

In this first blog about SharePoint Sanity Checks, I show you that there are ways to figure out which sites, pages, views, custom or 3rd party Web Parts (from AvePoint, K2, Nintex, Metalogix …) in your SharePoint environment are wasteful with resources so that you can fix the root cause and not just fight the symptom.

Feel free to follow all my steps by either using your own tools or use Dynatrace Free Trial with ourSharePoint FastPack.

Step #1: Server System Health Check

The first question must always be: How healthy are the Windows Servers that run your SharePoint Sites?

Not only must you look at Windows OS Metrics such as CPU, Memory, Disk and Network Utilization, you also need to monitor the individual SharePoint AppPool worker processes (w3wp.exe) to figure out whether you have individual sites that overload this server. The following is a screenshot that shows this information on a single dashboard.

A Dynatrace Host Health Dashboard shows key OS health metrics (CPU, Memory, Disk, Network) and the key SharePoint AppPools and their resource usage.

Let me give you some recommendations on what to look out for and what to do in that case

Bad AppPools

In case some of your SharePoint AppPools consumes too many resources on that machine, you may want to consider deploying them to a different server. You don’t want cram too many heavy utilized SharePoint sites on a single server and suffer from the cross impact of these sites.

Storage Problems

If you see high disk utilization it is important to check what is causing it. I typically look closer at:

  • IIS: Is the web server busy with serving too much static content? If that’s the case make sure you have configured resource caching. That reduces static requests from users that use SharePoint often. Also check the log settings of IIS and the Modules loaded by IIS. Make sure you only log what you really need.
  • SQL Server: Is SQL Server running on the same machine as SharePoint and maybe even hosting other databases? Talk with your DBA on checking proper configuration of SQL Server as well as discuss a better deployment scenario such as putting the SharePoint Content Database on its own SQL Server.
  • SharePoint: Check the generated log files. I often see people increasing log levels for different reasons but then forgetting to turn it back to default resulting in large amounts of data that nobody looks at anyway.

CPU Utilization

The first thing I look at is if the CPU is consumed by one of the SharePoint AppPools or other services running on that same machine.

  • SharePoint: This correlates to what I wrote under Bad AppPools. If the reason is too much load on an AppPool consider deploying it on a different machine. Before you do this please follow my additional recommendations later in this blog to verify if configuration or coding issues might be to blame which can be fixed.
  • SQL Server: Do you have SharePoint Sites or individual Pages that cause extra high utilization on the SQL Server? If that is the case, follow my recommendations on how to identify bad pages or Web Parts that have excessive access to the database. In general you should talk with the DBA to do a performance sanity check.
  • Other processes? Do you have other services running on that box that spikes CPU? Some batch or reporting jobs that can be deployed on a different server?

Network Utilization

It comes down to the same suspects as above:

  • IIS: Analyze how “heavy” your SharePoint pages are. Follow the general best practices on Web Performance Optimization by making your sites “slimmer”. Make sure you have content compression features turned on and content caching properly configured.
  • SharePoint: Besides talking to the database – what other services does your SharePoint instance interact with? Do you have Web Parts communicating with an external service? If that is the case make sure that these remote service calls are optimized, e.g: cache already fetched data or only query data that you really need.
  • SQL Server: Analyze which SharePoint Sites/Services request data as well as which other applications request data. Optimize data access or consider redeploying SQL Server to optimize the data transfer between the application and the database server.

Step #2: IIS Health Check

I already covered some IIS metrics in Step 1 but I want you to have a closer look at these IIS specific metrics such as current load, available vs used worker threads and bandwidth requirements:

A Dynatrace IIS Process Health makes it easy to see whether IIS is running low on threads, serving large content and maxing out available bandwidth.

These are the metrics I always check to validate how healthy the IIS deployment is:

  • Average Page Response Size: If you have bloated websites your IIS is serving too much data. That not only clogs the network, but it also makes the end user wait longer for these pages to load. Keep an eye on the average page size. Especially after deploying an update make sure pages don’t get too big. I suggest performing constant Web Performance Sanity Checks on your top pages.
  • Thread Utilization: Have you sized your IIS correctly in terms of worker threads? Are all the busy threads really busy or just waiting on slow performing SharePoint requests? Check out my sections on Top Web Server and Top App Server Metrics of my recent Load Testing Best Practices blog
  • Bandwidth Requirement: Is our outbound network pipe already a bottleneck? If that’s the case do not blindly update your infrastructure but first check if you can optimize your page sizes as explained earlier.

Step #3: Component Health Check

What I mentioned in the first 2 steps actually falls into “traditional” system monitoring with some additional insight on metrics that go beyond normal resource utilization monitoring. If resources are maxed out I always want to find out which components are actually using these resources. Why? Because we should first try to optimize these components before we give them more resources. I look at the following dashboard for a quick sanity check:

Dynatrace SharePoint Performance Dashboard tells me whether resource usage already causes performance spikes, whether it is caused by wasteful memory usage, individual pages or problematic Web Parts

A good SharePoint health metric is response time of SharePoint pages. If I see spikes, I know we jeopardize user adoption of SharePoint and I know I need to treat this with high priority. I look at the following metrics and data points to figure out what causes these spikes which most often directly correlate to higher resource consumption such as Memory, CPU, Disk and Network:

  • Memory Usage and Garbage Collection Impact: High memory usage alone is not necessarily a problem. The problem is if more memory is requested and the Garbage Collector needs to kick in and clear out a lot of old memory. That’s why I always keep an eye on overall memory usage patterns and the amount of time spent in Garbage Collection (GC). GC impacts both response time and it consumes a lot of CPU.
  • Which Pages are Slow? Trying to figure out why individual pages are slow is often easier than trying to figure out why on average the system is slower. I don’t waste time though focusing on a single slow page that is just used by a single user. Instead I focus on those pages that are slower than expected but also used by a lot of users. Optimizing them gives me more improvements for a larger audience.
  • Problematic Web Parts? SharePoint is built on Web Parts. Whether they come from Microsoft, well known 3rd Party providers (AvePoint, K2, Nintex, Metalogix …), or your own development team. Knowing which Web Parts are used and how slow they are allows you to focus even better. Too many times I have seen “Web Parts Gone Wild” caused by bad configuration or bad implementation. Check out my Top 5 SharePoint Performance Mistakes and you understand why that is a big problem.

The reason why Web Parts and Pages are slow can be caused by bad deployments, wrong configuration or really just bad coding. This is what I am going to focus on in my next blog post!

Next Steps: Fix the Problem; Don’t Just Buy More Hardware

I am interested to hear what you think about these metrics and please share ones with me that you use. In the next blog I will cover how to go deeper into SharePoint to identify the root cause of an unhealthy or slow system. Our first action should never be to just throw more hardware at the problem, but rather to understand the issue and optimize the situation.

If you want to see some of Andreas’ steps in action watch his 15 Minute SharePoint System Performance Sanity Check Video.


How to accurately identify impact of system issues on end-user response time

From Compuware APM Blog as of 4 June 2013

Triggered by current expected load projections for our community portal, our Apps Team was tasked to run a stress on our production system to verify whether we can handle 10 times the load we currently experience on our existing infrastructure. In order to have the least impact in the event the site crumbled under the load, we decided to run the first test on a Sunday afternoon. Before we ran the test we gave our Operations Team a heads-up: they could expect significant load during a two hour window with the potential to affect other applications that also run on the same environment.

During the test, with both the Ops and Application Teams watching the live performance data, we all saw end user response time go through the roof and the underlying infrastructure running out of resources when we hit a certain load level. What was very interesting in this exercise is that both the Application and Ops teams looked at the same data but examined the results from a different angle. However, they both relied on the recently announced Compuware PureStack Technology, the first solution that – in combination with dynaTrace PurePath – exposes how IT infrastructure impacts the performance of critical business applications in heavy production environments.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of “Horizontal” Transaction as well as the “Vertical” Stack.

The root cause of the poor performance in our scenario was CPU exhaustion – on a main server machine hosting both the Web and App Server – caused us not to meet our load goal. This turned out to be both an IT provisioning and an application problem. Let me explain the steps these teams took and how they came up with their list of action items in order to improve the current system performance in order to do better in the second scheduled test.

Step 1: Monitor and Identify Infrastructure Health Issues

Operations Teams like having the ability to look at their list of servers and quickly see that all critical indicators (CPU, Memory, Network, Disk, etc) are green. But when they looked at the server landscape when our load test reached its peak, their dashboard showed them that two of their machines were having problems:

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

Step 2: What is the actual impact on the hosted applications?

Clicking on the Impacted Applications Tab shows us the applications that run on the affected machine and which ones are currently impacted:

The increased load not only impacts the Community Portal but also our Support Portal

The increased load not only impacts the Community Portal but also our Support Portal

Already the load test has taught us something: As we expect higher load on the community in the future, we might need to move the support portal to a different machine to avoid any impact.

When examined independently, operations-oriented monitoring would not be that telling. But when it is placed in a context that relates it to data (end user response time, user experience, …) important to the Applications team, both teams gain more insight.  This is a good start, but there is still more to learn.

Step 3: What is the actual impact on the critical transactions?

Clicking on the Community Portal application link shows us the transactions and pages that are actually impacted by the infrastructure issue, but there still are two critical unanswered questions:

  • Are these the transactions that are critical to our successful operation?
  • How badly are these transactions and individual users impacted by the performance issues?

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

Step 4: Visualizing the impact of the infrastructure issue on the transaction

The transaction-flow diagram is a great way to get both the Ops and App Teams on the same page and view data in its full context, showing the application tiers involved, the physical and virtual machines they are running on, and where the hotspots are.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the “Horizontal” Transaction as well as the “Vertical” Stack is.

We knew that our pages are very heavy on content (Images, JavaScript and CSS), with up to 80% of the transaction time spent in the browser. Seeing that this performance hotspot is now down to 50% in relation to the overall page load time we immediately know that more of the transaction time has shifted to the new hotspot: the server side. The good news is that there is no problem with the database (only shows 1% response time contribution) as this entire performance hotspot shift seems to be related to the Web and App Servers, both of which run on the same machine – the one that has these CPU Health Issues.

Step 5: Pinpointing host health issue on the problematic machine

Drilling to the Host Health Dashboard shows what is wrong on that particular server:

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

Step 6: Process Health dashboards show slow app server response

We see that the two main processes on that machine are IIS (Web Server) and Tomcat (Application Server). A closer look shows how they are doing over time:

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can’t process them in time. The number of processed transactions actually drops.

Step 7: Pinpointing heavy CPU usage

Our Apps Team is now interested in figuring out what consumes all this CPU and whether this is something we can fix in the application code or whether we need more CPU power:

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Ops and Apps teams now easily prioritize both Infrastructure and app fixes

So as mentioned, ‘context is everything’.  But it’s not simply enough to have data – context relies on the ability to intelligently correlate all of the data into a coherent story.  When the “horizontal” transactional data for end-user response-time analysis is connected to the “vertical” infrastructure stack information, it becomes easy to get both teams to read from the same page and prioritize fixes that have the greatest negative impact on the business.

This exercise allowed us to identify several action items:

  • Deploy our critical applications on different machines when the applications impact each other negatively
  • Optimize the way our pages are built to reduce CPU usage
  • Increase CPU power on these virtualized machines to handle more load

APM as a Service: 4 steps to monitor real user experience in production

From Compuware APM Blog from 15 May 2013

With our new service platform and the convergence of dynaTrace PurePath Technology with the Gomez Performance Network, we are proud to offer an APMaaS solution that sets a higher bar for complete user experience management, with end-to-end monitoring technologies that include real-user, synthetic, third-party service monitoring, and business impact analysis.

To showcase the capabilities we used the free trial on our own about:performance blog as a demonstration platform. It is based on the popular WordPress technology which uses PHP and MySQL as its implementation stack. With only 4 steps we get full availability monitoring as well as visibility into every one of our visitors and can pinpoint any problem on our blog to problems in the browser (JavaScript, slow 3rd party, …), the network (slow network connectivity, bloated website, ..) or the application itself (slow PHP code, inefficient MySQL access, …).

Before we get started, let’s have a look at the Compuware APMaaS architecture. In order to collect real user performance data all you need is to install a so called Agent on the Web and/or Application Server. The data gets sent in an optimized and secure way to the APMaaS Platform. Performance data is then analyzed through the APMaaS Web Portal with drilldown capabilities into the dynaTrace Client.

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

Compuware APMaaS is a secure service to monitor every single end user on your application end-to-end (browser to database)

4 Steps to setup APMaaS for our Blog powered by WordPress on PHP

From a high-level perspective, joining Compuware APMaaS and setting up your environment consists of four basic steps:

  1. Sign up with Compuware for the Free Trial
  2. Install the Compuware Agent on your Server
  3. Restart your application
  4. Analyze Data through the APMaaS Dashboards

In this article, we assume that you’ve successfully signed up, and will walk you through the actual setup steps to show how easy it is to get started.

After signing up with Compuware, the first sign of your new Compuware APMaaS environment will be an email notifying you that a new environment instance has been created:

Following the steps as explained in the Welcome Email to get started

Following the steps as explained in the Welcome Email to get started

While you can immediately take a peek into your brand new APMaaS account at this point, there’s not much to see: Before we can collect any data for you, you will have to finish the setup in your application by downloading and installing the agents.

After installation is complete and the Web Server is restarted the agents will start sending data to the APMaaS Platform – and with dynaTrace 5.5, this also includes the PHP agent which gives insight into what’s really going on in the PHP application!

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Agent Overview shows us that we have both the Web Server and PHP Agent successfully loaded

Now we are ready to go!

For Ops & Business: Availability, Conversions, User Satisfaction

Through the APMaaS Web Portal, we start with some high level web dashboards that are also very useful for our Operations and Business colleagues. These show Availability, Conversion Rates as well as User Satisfaction and Error Rates. To show the integrated capabilities of the complete Compuware APM platform, Availability is measured using Synthetic Monitors that constantly check our blog while all of the other values are taken from real end user monitoring.

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Operations View: Automatic Availability and Response Time Monitoring of our Blog

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

Business View: Real Time Visits, Conversions, User Satisfaction and Errors

For App Owners: Application and End User Performance Analysis

Through the dynaTrace client we get a richer view to the real end user data. The PHP agent we installed is a full equivalent to the dynaTrace Java and .NET agents, and features like the application overview together with our self-learning automatic baselining will just work the same way regardless of the server-side technology:

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Application level details show us that we had a response time problem and that we currently have several unhappy end users

Before drilling down into the performance analytics, let’s have a quick look at the key user experience metrics such as where our blog users actually come from, the browsers they use, and whether their geographical location impacts user experience:

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

The UEM Key Metrics dashboards give us the key metrics of web analytics tools as well as tying it together with performance data. Visitors from remote locations are obviously impacted in their user experience

If you are responsible for User Experience and interested in some of our best practices I recommend checking our other UEM-related blog posts – for instance: What to do if A/B testing fails to improve conversions?

Going a bit deeper – What impacts End User Experience?

dynaTrace automatically detects important URLs as so-called “Business Transactions.” In our case we have different blog categories that visitors can click on. The following screenshot shows us that we automatically get dynamic baselines calculated for these identified business transaction:

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Dynamic Baselining detect a significant violation of the baseline during a 4.5 hour period last night

Here we see that our overall response time for requests by category slowed down on May 12. Let’s investigate what happened here, and move to the transaction flow which visualizes PHP transactions from the browser to the database and maps infrastructure health data onto every tier that participated in these transactions:

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

The Transaction Flow shows us a lot of interesting points such as Errors that happen both in the browser and the WordPress instance. It also shows that we are heavy on 3rd party but good on server health

Since we are always striving to improve our users’ experience, the first troubling thing on this screen is that we see errors happening in browsers – maybe someone forgot to upload an image when posting a new blog entry? Let’s drill down to the Errors dashlet to see what’s happening here:

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

3rd Party Widgets throw JavaScript errors and with that impact end user experience.

Apparently, some of the third party widgets we have on the blog caused JavaScript errors for some users. Using the error message, we can investigate which widget causes the issue, and where it’s happening. We can also see which browsers, versions and devices this happens on to focus our optimization efforts. If you happen to rely on 3rd party plugins you want to check the blog post You only control 1/3 of your Page Load Performance.

PHP Performance Deep Dive

We will analyze the performance problems on the PHP Server Side in a follow up blog. We will show you what the steps are to identify problematic PHP code. In our case it actually turned out to be a problematic plugin that helps us identify bad requests (requests from bots, …)

Conclusion and Next Steps

Stay tuned for more posts on this topic, or try Compuware APMaaS out yourself by signing up here for the free trial!


Compuware unveils 2013 application performance management best practices and trends

Compuware has just published the first volume of its new Application Performance Management (APM) Best Practices collection titled: “2013 APM State-of-the-Art and Trends.” Written by Compuware’s APM Center of Excellence thought leaders and experts, the collection features 10 articles on the technology topics shaping APM in 2013.

For organisations that depend on high-performance applications, the collection provides an easy-to-absorb overview of the evolution of APM technology, best practices, methodology and techniques to help manage and optimize application performance. Download the APM Best Practices collection here.

The APM Best Practices: 2013 APM State-of-the-Art and Trends collection helps IT professionals and business stakeholders keep pace with these changes and learn how application performance techniques will develop over the new year. The collection not only explores APM technology but also examines the related business implications and provides recommendations for how best to leverage APM.

Topics covered in this collection include:

  • managing application complexity across the edge and cloud;
  • top 10 requirements for creating an APM culture;
  • quantifying the financial impact of poor user experience;
  • sorting myth from reality in real-user monitoring; and
  • lessons learned from real-world big data implementations.

To download the APM Best Practices collectionclick here.

“This collection is a source of knowledge, providing valuable information about application performance for all business and technical stakeholders,” said Andreas Grabner, Leader of the Compuware APM Center of Excellence. “IT professionals can use the collection to help implement leading APM practices in their organizations and to set direction for proactive performance improvements. Organisations not currently using APM can discover how other companies are leveraging APM to solve business and technology problems, and how these solutions might apply to their own situations.”

More volumes of the APM Best Practices collection will become available throughout the year and will cover:

With more than 4,000 APM customers worldwide, Compuware is recognised as a leader in the “Magic Quadrant for Application Performance Monitoring” report. To read more about Compuware’s leadership in the APM market, click here.


It takes more than a tool! Swarovski’s 10 requirements for creating an APM culture

By Andreas Grabner at blog.dynatrace.com

Swarovski – the leading producer of cut crystal in the world- relies on its eCommerce store as much like other companies in the highly competitive eCommerce environment. Swarovski’s story is no different from others in this space: They started with “Let’s build a website to sell our products online” a couple of years ago and quickly progressed to “We sell to 60 million annual visitors across 23 countries in 6 languages”. There were bumps along the road and they realized that it takes more than just a bunch of servers and tools to keep the site running.

Why APM and why you do not just need a tool?

Swarovski relies on Intershop’s eCommerce platform and faced several challenges as they rapidly grew. Their challenges required them to apply Application Performance Management (APM) practices to ensure they could fulfill the business requirements to keep pace with customer growth while maintaining an excellent user experience. The most insightful comment I heard was from René Neubacher, Senior eBusiness Technology Consultant at Swarovski: “APM is not just about software. APM is a culture, a mindset and a set of business processes.  APM software supports that.”

René recently discussed their Journey to APM, what their initial problems were and what requirements they ended up having on APM and the tools they needed to support their APM strategy. By now they reached the next level of maturity by establishing a Performance Center of Excellence. This allows them to tackle application performance proactively throughout the organization instead of putting out fires reactively in production.

This blog post describes the challenges they faced, the questions that arose and the new generation APM requirements that paved the way forward in their performance journey:

The Challenge!

Swarvoski had traditional system monitoring in place on all their systems across their delivery chain including web servers, application servers, SAP, database servers, external systems and the network. Knowing that each individual component is up and running 99.99% of the time is great but no longer sufficient. How might these individual component outages impact the user experience of their online shoppers? WHO is actually responsible for the end user experience and HOW should you monitor the complete delivery chain and not just the individual components? These and other questions came up when the eCommerce site attracted more customers which was quickly followed by more complaints about their user experience:

APM includes getting a holistic view of the complete delivery chain and requires someone to be responsible for end user experience.

APM includes getting a holistic view of the complete delivery chain and requires someone to be responsible for end user experience.

Questions that had no answers

In addition to “Who is responsible in case users complain?” the other questions that needed to be urgently addressed included:

  • How often is the service desk called before IT knows that there is a problem?
  • How much time is spent in searching for system errors versus building new features?
  • Do we have a process to find the root-cause when a customer reports a problem?
  • How do we visualize our services from the customer‘s point of view?
  • How much revenue, brand image and productivity are at risk or lostwhile IT is searching for the problem?
  • What to do when someone says ”it‘s slow“?

The 10 Requirements

These unanswered questions triggered the need to move away from traditional system monitoring and develop the requirements for new generation APM and user experience management.

#1: Support State-of-the-Art Architecture

Based on their current system architecture it was clear that Swarovski needed an approach that was able to work in their architecture, now and in the future. The rise of more interactive Web 2.0 and mobile applications had to be factored in to allow monitoring end users from many different devices and regardless of whether they used a web application or mobile native application as their access point.

Transactions need to be followed from the browser all the way back to the database. It is important to support distributed transactions. This approach also helps to spot architectural and deployment problems immediately

Transactions need to be followed from the browser all the way back to the database. It is important to support distributed transactions. This approach also helps to spot architectural and deployment problems immediately

#2: 100% transactions and clicks – No Averages

Based on their experience, Swarovski knew that looking at average values or sampled data would not be helpful when customers complained about bad performance. Responding to a customer complaint with “Our average user has no problem right now – sorry for your inconvenience” is not what you want your helpdesk engineers to use as a standard phrase. Averages or sampling also hides the real problems you have in your system. Check out the blog post Why Averages Suck by Michael Kopp for more detail.

Measuring end user performance of every customer interaction allows for quick identification of regional problems with CDNs, 3rd Parties or Latency.

Measuring end user performance of every customer interaction allows for quick identification of regional problems with CDNs, 3rd Parties or Latency.

Having 100% user interactions and transactions available makes it easy to identify the root cause for individual users

Having 100% user interactions and transactions available makes it easy to identify the root cause for individual users

#3: Business Visibility

As the business had a growing interest in the success of the eCommerce platform, IT had to demonstrate to the business what it took to fulfill their requirements and how business requirements are impacted by the availability or the lack of investment in the application delivery chain.

Correlating the number of Visits with Performance on incoming Orders illustrates the measurable impact of performance on revenue and what it takes to support business requirements.

Correlating the number of Visits with Performance on incoming Orders illustrates the measurable impact of performance on revenue and what it takes to support business requirements.

#4: Impact of 3rd Parties and CDNs

It was important to not only track transactions involving their own Data Center but ALL user interactions with their web site – even those delivered through CDNs or 3rd parties. All of these interactions make up the user experience and therefore ALL of it needs to be analyzed.

Seeing the actual load impact of 3rd party components or content delivered from CDNs enables IT to pinpoint user experience problems that originate outside their own data center.

Seeing the actual load impact of 3rd party components or content delivered from CDNs enables IT to pinpoint user experience problems that originate outside their own data center.

#5: Across the lifecycle – supporting collaboration and tearing down silos

The APM initiative was started because Swarovski reacted to problems happening in production. Fixing these problems in production is only the first step. Their ultimate goal is to become pro-active by finding and fixing problems in development or testing—before they spill over into production. Instead of relying on different sets of tools with different capabilities, the requirement is to use one single solution that is designed to be used across the application lifecycle (Developer Workstation, Continuous Integration, Testing, Staging and Production). It will make it easier to share application performance data between lifecycle stages allowing individuals to not only easily look at data from other stages but also compare data to verify impact and behavior of code changes between version updates.

Continuously catching regressions in Development by analyzing unit and performance tests allows application teams to become more proactive.

Continuously catching regressions in Development by analyzing unit and performance tests allows application teams to become more proactive.

Pinpointing integration and scalability issues, continuously, in acceptance and load testing makes testing more efficient and prevents problems from reaching production.

Pinpointing integration and scalability issues, continuously, in acceptance and load testing makes testing more efficient and prevents problems from reaching production.

#6: Down to the source code

In order to speed up problem resolution Swarovski’s operations and development teams  require as much code-level insight as possible — not only for their own engineers who are extending the Intershop eCommerce Platform but also for Intershop to improve their product. Knowing what part of the application code is not performing well with which input parameters or under which specific load on the system eliminates tedious reproduction of the problem. The requirement is to lower the Mean Time To Repair (MTTR) from as much as several days down to only a couple of hours.

The SAP Connector turned out to have a performance problem. This method-level detailed information was captured without changing any code.

The SAP Connector turned out to have a performance problem. This method-level detailed information was captured without changing any code.

#7: Zero/Acceptable overhead

“Who are we kidding? There is nothing like zero overhead especially when you need 100% coverage!” – Just the words from René when you explained that requirement. And he is right: once you start collecting information from a production system you add a certain amount of overhead. A better term for this would be “imperceptible overhead” – overhead that’s so small, you don’t notice it.

What is the exact number? It depends on your business and your users. The number should be worked out from the impact on the end user experience, rather than additional CPU, memory or network bandwidth required in the data center. Swarovski knew they had to achieve less than 2% overhead on page load times in production, as anything more would have hurt their business; and that’s what they achieved.

#8: Centralized data collection and administration

Running a distributed eCommerce application that gets potentially extended to additional geographical locations requires an APM system with a centralized data collection and administration option. It is not feasible to collect different types of performance information from different systems, servers or even data centers. It would either require multiple different analysis tools or data transformation to a single format to use it for proper analysis.

Instead of this approach, a single unified APM system was required by Swarovski. Central administration is equally important as they need to eliminate the need to rely on remote IT administrators to make changes to the monitored system, for example, simple tasks such as changing the level of captured data or upgrading to a new version.

By storing and accessing performance data from a single, centralized repository, enables fast and powerful analytic and visualization. For example, system metrics such as CPU utilization can be correlated with end-user response time or database execution time - all displayed on one single dashboard.

By storing and accessing performance data from a single, centralized repository, enables fast and powerful analytic and visualization. For example, system metrics such as CPU utilization can be correlated with end-user response time or database execution time – all displayed on one single dashboard.

#9: Auto-Adapting Instrumentation without digging through code

As the majority of the application code is not developed in-house but provided by Intershop, it is mandatory to get insight into the application without doing any manual code changes. The APM system must auto-adapt to changes so that no manual configuration change is necessary when a new version of the application is deployed.

This means Swarovski can focus on making their applications positively contribute to business outcomes; rather than spend time maintaining IT systems.

#10: Ability to extend

Their application is an always growing an ever-changing IT environment. Where everything might have been deployed on physical boxes it might be moved to virtualized environments or even into a public cloud environment.

Whatever the extension may be – the APM solution must be able to adapt to these changes and also be extensible to consume new types of data sources, e.g., performance metrics from Amazon Cloud Services or VMware, Cassandra or other Big Data Solutions or even extend to legacy Mainframe applications and then bring these metrics into the centralized data repository and provide new insights into the application’s performance.

Extending the application monitoring capabilities to Amazon EC2, Microsoft Windows Azure, a public or private cloud enables the analysis of the performance impact of these virtualized environments on end user experience.

Extending the application monitoring capabilities to Amazon EC2, Microsoft Windows Azure, a public or private cloud enables the analysis of the performance impact of these virtualized environments on end user experience.

The Solution and the Way Forward

Needless to say that Swarovski took the first step in implementing APM as a new process and mindset in their organization. They are now in the next phase of implementing a Performance Center of Excellence. This allows them moving from Reactive Performance Troubleshooting to Proactive Performance Prevention.

Stay tuned for more blog posts on the Performance Center of Excellence and how you can build one in your own organization. The key message is that it is not about just using a bunch of tools. It is about living and breathing performance throughout the organization. If you are interested in this check out the blogs by Steve Wilson: Proactive vs Reactive: How to prevent problems instead of fixing them faster andPerformance in Development is the Chief Cornerstone.


Quick video intro from Compuware APM: Unified network and application performance solution for today’s data centers

A new video introduction from the Compuware APM team published courtesy of You Tube titled ‘Unified Network & Application Performance Solution for Today’s Data Centers.’


Introducing the new web performance project: Speed of the web

From Alois Reitbauer at DynaTrace Compuware APM.

I am excited about the launch of a new project in the Web Performance space. With SpeedoftheWeb we provide a free benchmarking and optimization service that provides key performance indicators (KPI) calculated for industry verticals like Retail, Health, Media or Travel.

The idea behind the project is that Web performance also depends the on type of service your site provides. A simple static page is different from a content rich site with a lot of interactive parts. The main question is how am I doing compared to my competition and where can I improve. SpeedoftheWeb answers exactly these questions.

You can get a free report showing you how you do against the top sites in your industry across the whole Web application delivery chain. We start from the user’s perspective by showing how long it takes to see page or fully load it. Then we dive into how individual components like JavaScript, content or server-side processing contribute to the user experience; explicitly pointing out where you have to optimize.

Performance across the Web App Delivery Chain and where to improve

For a total of 15 KPIs on Web performance we do not only answer how good you are but also what the range in your industry is. Often it is hard to specify performance KPIs as you do not know what the ideal site should look like. SpeedoftheWeb exactly provides this information. Below you see an example how the JavaScript execution time of a page relates to equivalent pages in the industry

JavaScript execution time compared to the competition

Getting better is also about learning from the best. That is why we tell you how many of the top sites in the field are better than you and what the best sites for each KPI are. Get insight into what these sites are doing and learn what their secret sauce is.

SpeedoftheWeb will also help you to justify why you want to invest in Web performance in a way that management will understand. You always wanted to get rid of this 2 MB Flash Video on your start page? Show management that it makes you more than 1 second slower than the top pages in industry.

SpeedoftheWeb provides several testing locations around the globe enable to get data from where you users are. You can even use the data you to compare performance across multiple locations.

All reports are persisted in our Cloud storage and can be accessed via a web browser. So you can easily share it with your colleagues. We gave our best to also polish them up visually so you don’t have to put a lot of make up on them before showing them to your boss – as this is often the case with performance data.

There is even more that SpeedoftheWeb can do for. Knowing in which area to improve is good, but knowing exactly what to do is even better. Therefore we automatically record an Ajax Edition session which can be downloaded for deep dive analysis. So, if something is slow you will figure it out. A nice bonus is that you can now also recordAjax Edition sessions from around the globe for free.

Detailed Diagnostics Data in Ajax Edition

I am very excited about this new service and I hope it provides a lot of value to the Web performance community. If you have ideas on how to improve it, just let me know. If you want to gain deeper insight into how performance differs across various industries I recommend checking out this presentation link

Don’t forget to visit www.speedoftheweb.org now and see what it can do for you. Enjoy using SpeedoftheWeb and provide feedback to make this an even better service.

Introducing SpeedoftheWeb

I am excited about the launch of a new project in the Web Performance space. With SpeedoftheWeb we provide a free benchmarking and optimization service that provides key performance indicators (KPI) calculated for industry verticals like Retail, Health, Media or Travel.

The idea behind the project is that Web performance also depends the on type of service your site provides. A simple static page is different from a content rich site with a lot of interactive parts. The main question is how am I doing compared to my competition and where can I improve. SpeedoftheWeb answers exactly these questions.

You can get a free report showing you how you do against the top sites in your industry across the whole Web application delivery chain. We start from the user’s perspective by showing how long it takes to see page or fully load it. Then we dive into how individual components like JavaScript, content or server-side processing contribute to the user experience; explicitly pointing out where you have to optimize.

Web App Delivery Chain Image

For a total of 15 KPIs on Web performance we do not only answer how good you are but also what the range in your industry is. Often it is hard to specify performance KPIs as you do not know what the ideal site should look like. SpeedoftheWeb exactly provides this information. Below you see an example how the JavaScript execution time of a page relates to equivalent pages in the industry

JS Time Image

Getting better is also about learning from the best. That is why we tell you how many of the top sites in the field are better than you and what the best sites for each KPI are. Get insight into what these sites are doing and learn what their secret sauce is.

SpeedoftheWeb will also help you to justify why you want to invest in Web performance in a way that management will understand. You always wanted to get rid of this 2 MB Flash Video on your start page? Show management that it makes you more than 1 second slower than the top pages in industry.

SpeedoftheWeb provides several testing locations around the globe enable to get data from where you users are. You can even use the data you to compare performance across multiple locations.

All reports are persisted in our Cloud storage and can be accessed via a web browser. So you can easily share it with your colleagues. We gave our best to also polish them up visually so you don’t have to put a lot of make up on them before showing them to your boss – as this is often the case with performance data.

There is even more that SpeedoftheWeb can do for. Knowing in which area to improve is good, but knowing exactly what to do is even better. Therefore we automatically record an Ajax Edition session which can be downloaded for deep dive analysis. So, if something is slow you will figure it out. A nice bonus is that you can now also record Ajax Edition sessions from around the globe for free.

I am very excited about this new service and I hope it provides a lot of value to the Web performance community. If you have ideas on how to improve it, just let me know. If you want to gain deeper insight into how performance differs across various industries I recommend checking out this presentationLink

Don’t forget to visit www.speedoftheweb.org

Introducing SpeedoftheWeb

I am excited about the launch of a new project in the Web Performance space. With SpeedoftheWeb we provide a free benchmarking and optimization service that provides key performance indicators (KPI) calculated for industry verticals like Retail, Health, Media or Travel.

The idea behind the project is that Web performance also depends the on type of service your site provides. A simple static page is different from a content rich site with a lot of interactive parts. The main question is how am I doing compared to my competition and where can I improve. SpeedoftheWeb answers exactly these questions.

You can get a free report showing you how you do against the top sites in your industry across the whole Web application delivery chain. We start from the user’s perspective by showing how long it takes to see page or fully load it. Then we dive into how individual components like JavaScript, content or server-side processing contribute to the user experience; explicitly pointing out where you have to optimize.

Web App Delivery Chain Image

For a total of 15 KPIs on Web performance we do not only answer how good you are but also what the range in your industry is. Often it is hard to specify performance KPIs as you do not know what the ideal site should look like. SpeedoftheWeb exactly provides this information. Below you see an example how the JavaScript execution time of a page relates to equivalent pages in the industry

JS Time Image

Getting better is also about learning from the best. That is why we tell you how many of the top sites in the field are better than you and what the best sites for each KPI are. Get insight into what these sites are doing and learn what their secret sauce is.

SpeedoftheWeb will also help you to justify why you want to invest in Web performance in a way that management will understand. You always wanted to get rid of this 2 MB Flash Video on your start page? Show management that it makes you more than 1 second slower than the top pages in industry.

SpeedoftheWeb provides several testing locations around the globe enable to get data from where you users are. You can even use the data you to compare performance across multiple locations.

All reports are persisted in our Cloud storage and can be accessed via a web browser. So you can easily share it with your colleagues. We gave our best to also polish them up visually so you don’t have to put a lot of make up on them before showing them to your boss – as this is often the case with performance data.

There is even more that SpeedoftheWeb can do for. Knowing in which area to improve is good, but knowing exactly what to do is even better. Therefore we automatically record an Ajax Edition session which can be downloaded for deep dive analysis. So, if something is slow you will figure it out. A nice bonus is that you can now also record Ajax Edition sessions from around the globe for free.

I am very excited about this new service and I hope it provides a lot of value to the Web performance community. If you have ideas on how to improve it, just let me know. If you want to gain deeper insight into how performance differs across various industries I recommend checking out this presentation Link

Don’t forget to visit http://www.speedoftheweb.org now and see what it can do for you. Enjoy using SpeedoftheWeb and provide feedback to make this an even better service.