From apmblog.dynatrace.com February 11, 2015 | by
If you are responsible for keeping your SharePoint Deployment healthy I assume that “traditional” system monitoring – whether via SCOM, the Performance Monitor or other tools, is on top of your list. But if your first reaction on constant high CPU, exhausted memory or full disks is to ask for more hardware then your actions are “too traditional”. Adding more hardware for sure will make your system healthier – but it comes with a price tag that might not be necessary.
In this first blog about SharePoint Sanity Checks, I show you that there are ways to figure out which sites, pages, views, custom or 3rd party Web Parts (from AvePoint, K2, Nintex, Metalogix …) in your SharePoint environment are wasteful with resources so that you can fix the root cause and not just fight the symptom.
Step #1: Server System Health Check
The first question must always be: How healthy are the Windows Servers that run your SharePoint Sites?
Not only must you look at Windows OS Metrics such as CPU, Memory, Disk and Network Utilization, you also need to monitor the individual SharePoint AppPool worker processes (w3wp.exe) to figure out whether you have individual sites that overload this server. The following is a screenshot that shows this information on a single dashboard.
Let me give you some recommendations on what to look out for and what to do in that case
In case some of your SharePoint AppPools consumes too many resources on that machine, you may want to consider deploying them to a different server. You don’t want cram too many heavy utilized SharePoint sites on a single server and suffer from the cross impact of these sites.
If you see high disk utilization it is important to check what is causing it. I typically look closer at:
- IIS: Is the web server busy with serving too much static content? If that’s the case make sure you have configured resource caching. That reduces static requests from users that use SharePoint often. Also check the log settings of IIS and the Modules loaded by IIS. Make sure you only log what you really need.
- SQL Server: Is SQL Server running on the same machine as SharePoint and maybe even hosting other databases? Talk with your DBA on checking proper configuration of SQL Server as well as discuss a better deployment scenario such as putting the SharePoint Content Database on its own SQL Server.
- SharePoint: Check the generated log files. I often see people increasing log levels for different reasons but then forgetting to turn it back to default resulting in large amounts of data that nobody looks at anyway.
The first thing I look at is if the CPU is consumed by one of the SharePoint AppPools or other services running on that same machine.
- SharePoint: This correlates to what I wrote under Bad AppPools. If the reason is too much load on an AppPool consider deploying it on a different machine. Before you do this please follow my additional recommendations later in this blog to verify if configuration or coding issues might be to blame which can be fixed.
- SQL Server: Do you have SharePoint Sites or individual Pages that cause extra high utilization on the SQL Server? If that is the case, follow my recommendations on how to identify bad pages or Web Parts that have excessive access to the database. In general you should talk with the DBA to do a performance sanity check.
- Other processes? Do you have other services running on that box that spikes CPU? Some batch or reporting jobs that can be deployed on a different server?
It comes down to the same suspects as above:
- IIS: Analyze how “heavy” your SharePoint pages are. Follow the general best practices on Web Performance Optimization by making your sites “slimmer”. Make sure you have content compression features turned on and content caching properly configured.
- SharePoint: Besides talking to the database – what other services does your SharePoint instance interact with? Do you have Web Parts communicating with an external service? If that is the case make sure that these remote service calls are optimized, e.g: cache already fetched data or only query data that you really need.
- SQL Server: Analyze which SharePoint Sites/Services request data as well as which other applications request data. Optimize data access or consider redeploying SQL Server to optimize the data transfer between the application and the database server.
Step #2: IIS Health Check
I already covered some IIS metrics in Step 1 but I want you to have a closer look at these IIS specific metrics such as current load, available vs used worker threads and bandwidth requirements:
These are the metrics I always check to validate how healthy the IIS deployment is:
- Average Page Response Size: If you have bloated websites your IIS is serving too much data. That not only clogs the network, but it also makes the end user wait longer for these pages to load. Keep an eye on the average page size. Especially after deploying an update make sure pages don’t get too big. I suggest performing constant Web Performance Sanity Checks on your top pages.
- Thread Utilization: Have you sized your IIS correctly in terms of worker threads? Are all the busy threads really busy or just waiting on slow performing SharePoint requests? Check out my sections on Top Web Server and Top App Server Metrics of my recent Load Testing Best Practices blog
- Bandwidth Requirement: Is our outbound network pipe already a bottleneck? If that’s the case do not blindly update your infrastructure but first check if you can optimize your page sizes as explained earlier.
Step #3: Component Health Check
What I mentioned in the first 2 steps actually falls into “traditional” system monitoring with some additional insight on metrics that go beyond normal resource utilization monitoring. If resources are maxed out I always want to find out which components are actually using these resources. Why? Because we should first try to optimize these components before we give them more resources. I look at the following dashboard for a quick sanity check:
A good SharePoint health metric is response time of SharePoint pages. If I see spikes, I know we jeopardize user adoption of SharePoint and I know I need to treat this with high priority. I look at the following metrics and data points to figure out what causes these spikes which most often directly correlate to higher resource consumption such as Memory, CPU, Disk and Network:
- Memory Usage and Garbage Collection Impact: High memory usage alone is not necessarily a problem. The problem is if more memory is requested and the Garbage Collector needs to kick in and clear out a lot of old memory. That’s why I always keep an eye on overall memory usage patterns and the amount of time spent in Garbage Collection (GC). GC impacts both response time and it consumes a lot of CPU.
- Which Pages are Slow? Trying to figure out why individual pages are slow is often easier than trying to figure out why on average the system is slower. I don’t waste time though focusing on a single slow page that is just used by a single user. Instead I focus on those pages that are slower than expected but also used by a lot of users. Optimizing them gives me more improvements for a larger audience.
- Problematic Web Parts? SharePoint is built on Web Parts. Whether they come from Microsoft, well known 3rd Party providers (AvePoint, K2, Nintex, Metalogix …), or your own development team. Knowing which Web Parts are used and how slow they are allows you to focus even better. Too many times I have seen “Web Parts Gone Wild” caused by bad configuration or bad implementation. Check out my Top 5 SharePoint Performance Mistakes and you understand why that is a big problem.
The reason why Web Parts and Pages are slow can be caused by bad deployments, wrong configuration or really just bad coding. This is what I am going to focus on in my next blog post!
Next Steps: Fix the Problem; Don’t Just Buy More Hardware
I am interested to hear what you think about these metrics and please share ones with me that you use. In the next blog I will cover how to go deeper into SharePoint to identify the root cause of an unhealthy or slow system. Our first action should never be to just throw more hardware at the problem, but rather to understand the issue and optimize the situation.
If you want to see some of Andreas’ steps in action watch his 15 Minute SharePoint System Performance Sanity Check Video.
When it comes to SharePoint end-user performance, there are several issues facing SharePoint teams:
- They have no visibility of what performance is being delivered to all users.
- They don’t know what percentage of users actually use SharePoint or what content they are accessing.
- They know performance is worse for overseas offices, but cannot quantify it.
- They are not able to establish what the impact on performance will be of code, configuration, upgrades and hardware changes before going live.
- They don’t know there is a performance issue unless users complain.
What WebTuna the SharePoint end-user monitoring tool provides…
- Each and every user’s actual performance at all times from all locations.
- Which content is being accessed, by who and when.
- Real page load times of users hitting the SharePoint site in real-time and historically. Every page view from every user is captured.
- Geographical map showing usage by country, office and individual users, and highlights in real-time regions that experience poor load times.
- User performance broken down by country, browser type, operating system, page title, URL and many more entities.