Foliotek Developer Blog

Page Level Web Application Database Optimization

Do you know what queries are required to run for a single page load in your application? If not, perhaps you should.

That is a great question I wish I would have asked myself a long time ago, but didn’t know to ask it.

Over the past several years, I have enjoyed the challenge of database query optimization. I work for a small company, and no one else seems to have much of an interest in this sort of thing, so that means I get to do it. It is certainly interesting to see how a small rewrite of a query or a new index can achieve significant results. However, I recently began looking at optimization from a different perspective–that of the individual page.

How many queries are required to load a single page? How many should it take?

To identify expensive queries that need optimization, I ran a SQL Profiler session that would catch queries with Reads greater than 1000, 5000 or maybe 10000. Or maybe more than 500 CPU. However, one day I decided to see all the queries that ran on a given page load, and I set the Read threshold all the way down to 1. I also set up the web application running on my machine to point to a test database and restricted the Profiler session accordingly.

From SQL Server Management Studio:

  • Tools > SQL Server Profiler
  • Enter Database credentials, then click Connect
  • Event Selection tab
  • Check “Show all events” and “Show all columns”
  • Column Filters - DatabaseName: TestDBName
  • HostName: ComputerName (ws###)
  • Reads: 1
  • Run

After clicking “run” and loading the first page, I was literally stunned.

One page was running far more queries than I had ever imagined. But why? I started looking through the results. The column “TextData” shows the first 100 or so characters, and then the window on the bottom shows the full query. Suffice it to say that there were many opportunities for optimization.

What I Learned

Repetitive Queries

The first thing I noticed was that on many of the page loads, there were many redundant queries. The page would query the database for the User or Company multiple times, sometimes one right after the other. None of this was intended. The impact of certain ways of doing things, however, was never realized in terms of the load on the database.

One of the biggest things to optimize are things like side menus, top menus, or footers–things that are displayed on every page. Any inefficiencies in these will be hit every time anyone loads any page. That also means, though, that any increased inefficiency can have a huge impact. If 100 pages are hit every second, and there are 20 queries that can be eliminated from each page load, you have just eliminated 1000 queries a second.

Many of these queries are small, to be sure. It takes very little if any CPU and only a few reads to retrieve a User from the User table. However, the cumulative increased traffic can be significant.

In many cases, I realized that I could just pass the user to a method that accessed a particular property on the row instead of getting the user inside the method itself. Or I could just pass in the property itself. Alternatively, I could create a page-level (or control-level [.ascx]) variable for User or Company or other tables used in many places.

private DataAccess.User _currentUser;  
public DataAccess.UserCurrentOrganization  
{  
    get  
    {  
        if (_currentUser== null)  
            _currentUser= DataAccess.User.Get(LoggedInUserID);  
        return _currentUser;  
    }
}

Unused Queries

It took some time, but I matched every query to the line of code that generated it. I found a few places where a query was called, but then the result was never used. This, of course, is bound to happen in a large application as modifications are introduced. But again, removing these for every page load can certainly help.

Unnecessary Queries

I also encountered a few places where unnecessary queries were being run. For instance, upon login, the system should check the expiration date of the user’s registration. That was the intent of the code. However, at some point, the code was moved and ended up getting run on every page load. So, the code was necessary on login, but not on every page. I never would have known it ran on every page had I not seen it for myself in the SQL Profiler session.

Final Thoughts

It is, indeed, rewarding to identify a very expensive query and optimize it down to something very small. Yet, I’ve learned that there are not only gigantic weeds. Sometimes there are millions of smaller weeds. It is better to get rid of both.


Index Recommendations from the Database Engine Tuning Adviser should not be accepted blindly

The Problem

I ran a SQL Profiler session on one of our SQL databases toda to catch any expensive queries and see if they could be optimized. One query that showed up was one that selected the top 30 rows from one table where a certain bit column was set to true, sorted by oldest first. The table being queried had close to 100,000 rows in it. Each query took about 6,000 reads.

The query looked something like this:

 SELECT TOP (30) *  
 FROM TheTable  
 WHERE BitColumn = 1  
 ORDER BY DateColumn  

The Suggested Solution

My inclination was that all I needed to do was add a very small index on the column of type BIT being used in the WHERE clause. However, I went ahead and ran the query through the Database Engine Tuning Adviser, just to see what it came up with.

The index recommendation suggested an index on three columns, one of which was, in fact, the one in the WHERE clause. The Date column was also included along with the table’s Primary ID. The estimated improvement was 99%, which was great and what I had expected.

However, the suggested index also had every other table column in its INCLUDE statement. The result was that the new index was projected to take up 68.5 MB of space. That’s quite a bit of space for an index on a bit column.

The Employed Solution

Consequently, I manually went in to SQL Server Management Studio and just added a simple index on the column used in the WHERE clause. Then I checked the size of the new index and found that it was only 1.75 MB–quite significantly less than 68.5 MB.

The query now requires only 14 reads, compared with the nearly 6,000 it used before, and I haven’t added 68 MB of index storage to the database.

Final Remark

So, there are certainly times where the recommendations from the Database Tuning Adviser would probably be fine. Each case is different, of course. In this case, adding 68 MB of index storage just was not worth it. Adding a simple index manually gave the optimization we needed without really impacting the size of the database. Of course, 68 MB is not really a lot, but adding that 10 or 20 or 50 times when it is not necessary would certainly begin to unnecessarily impact it.

So, the Database Tuning Adviser is definitely helpful, but should probably be used more as a guide rather than as something not to be questioned.


Identifying Unused Indexes in a SQL Server Database

One of our databases began timing out during our weekly index rebuild job, so I began to look into what indexes were getting rebuilt. After looking at the Index Usage Report, I noticed that some of them were not even being used. Some of them were very large indexes as well.

Obviously, an index that is never used doesn’t really need to be rebuilt to improve performance, so I removed a few of them. Then I began to look more closely at how I could identify other indexes that were not being used, aside from looking through every index on the SQL Report.

Unused Indexes with no reads in the Index Usage Stats Table

From research I did online, I saw that the system table sys.dmdbindexusagestats contains usage statistics for indexes. There are two key columns on this table that are of particular interest, namely userscans **and **userseeks. If both of these values are zero, then the system probably isn’t using the index.

It is probably a good idea to at least look at a few other columns as well, including userlookups, systemseeks, systemscans, and systemlookups.

I created a query that identifies unused indexes based on this criteria and outputs them in a pretty nice format. I was unable to find any other queries online that output everything in as nice of a format as this, so I hope you find this helpful.

 -- GET UNUSED INDEXES THAT APPEAR IN THE INDEX USAGE STATS TABLE  
 DECLARE @MinimumPageCount int  
 SET @MinimumPageCount = 500

SELECT Databases.name AS [Database],  
 object_name(Indexes.object_id) AS [Table],  
 Indexes.name AS [Index],  
 PhysicalStats.page_count as [Page_Count],  
 CONVERT(decimal(18,2), PhysicalStats.page_count * 8 / 1024.0) AS [Total Size (MB)],  
 CONVERT(decimal(18,2), PhysicalStats.avg_fragmentation_in_percent) AS [Frag %],  
 ParititionStats.row_count AS [Row Count],  
 CONVERT(decimal(18,2), (PhysicalStats.page_count * 8.0 * 1024)  
 / ParititionStats.row_count) AS [Index Size/Row (Bytes)]  
 FROM sys.dm_db_index_usage_stats UsageStats  
 INNER JOIN sys.indexes Indexes  
 ON Indexes.index_id = UsageStats.index_id  
 AND Indexes.object_id = UsageStats.object_id  
 INNER JOIN SYS.databases Databases  
 ON Databases.database_id = UsageStats.database_id  
 INNER JOIN sys.dm_db_index_physical_stats (DB_ID(),NULL,NULL,NULL,NULL)  
 AS PhysicalStats  
 ON PhysicalStats.index_id = UsageStats.Index_id  
 and PhysicalStats.object_id = UsageStats.object_id  
 INNER JOIN SYS.dm_db_partition_stats ParititionStats  
 ON ParititionStats.index_id = UsageStats.index_id  
 and ParititionStats.object_id = UsageStats.object_id  
 WHERE UsageStats.user_scans = 0  
 AND UsageStats.user_seeks = 0  
 -- ignore indexes with less than a certain number of pages of memory  
 AND PhysicalStats.page_count > @MinimumPageCount  
 -- Exclude primary keys, which should not be removed  
 AND Indexes.type_desc != ‘CLUSTERED’  
 ORDER BY [Page_Count] DESC

This query includes the following helpful information on the unused indexes:

  • Table Name
  • Index Name
  • Page Count
  • Total Index Size (MB)
  • Fragmentation (%)
  • Row Count
  • Index Size/Row (Bytes)

Unused Indexes with no entry in the Index Usage Stats Table

For part of my research on other sites, I noticed that there are some indexes that somehow never get listed in the dmdbindexusagestats table. So, running this first query is not really enough. In fact, the SQL Index Usage Report doesn’t list all the indexes either. So, there is another query that I created to identify these types of indexes. I joined on several tables again to make the output more useful.

 -- GET UNUSED INDEXES THAT DO **NOT** APPEAR IN THE INDEX USAGE STATS TABLE  
 DECLARE @dbid INT  
 SELECT @dbid = DB_ID(DB_NAME())

SELECT Databases.Name AS [Database],  
 Objects.NAME AS [Table],  
 Indexes.NAME AS [Index],  
 Indexes.INDEX_ID,  
 PhysicalStats.page_count as [Page Count],  
 CONVERT(decimal(18,2), PhysicalStats.page_count * 8 / 1024.0) AS [Total Index Size (MB)],  
 CONVERT(decimal(18,2), PhysicalStats.avg_fragmentation_in_percent) AS [Fragmentation (%)]  
 FROM SYS.INDEXES Indexes  
 INNER JOIN SYS.OBJECTS Objects ON Indexes.OBJECT_ID = Objects.OBJECT_ID  
 LEFT JOIN sys.dm_db_index_physical_stats(@dbid, null, null, null, null) PhysicalStats  
 ON PhysicalStats.object_id = Indexes.object_id  
 AND PhysicalStats.index_id = indexes.index_id  
 INNER JOIN sys.databases Databases  
 ON Databases.database_id = PhysicalStats.database_id  
 WHERE Objects.type = ‘U’ — Is User Table  
 AND Indexes.is_primary_key = 0  
 AND Indexes.INDEX_ID NOT IN (  
 SELECT UsageStats.INDEX_ID  
 FROM SYS.DM_DB_INDEX_USAGE_STATS UsageStats  
 WHERE UsageStats.OBJECT_ID = Indexes.OBJECT_ID  
 AND Indexes.INDEX_ID = UsageStats.INDEX_ID  
 AND DATABASE_ID = @dbid)  
 ORDER BY PhysicalStats.page_count DESC,  
 Objects.NAME,  
 Indexes.INDEX_ID,  
 Indexes.NAME ASC

When to Remove an Index

Just because an index is not being used does not necessarily mean it should be removed. The queries above ignore the clustered indexes on the tables, since there are very few times, if any, when a table should not have any index at all. There may be other reasons to keep an index around as well. Those decisions are best made by those that have a knowledge of how the data is used and will be used in the future.

Conclusion

I was able to use these two queries to get rid of a lot of dead weight in several of our databases. In one of them, in particular, I was able to remove almost 1 GB of space. That’s 1 GB less for every backup, so it adds up over time.

One last note… After removing indexes, it’s probably a good idea to run SQL Profiler, just to be safe, on the database to catch any queries that may be expensive and somehow used an index that should not have been deleted. See my other post on SQL Optimization – A Beginner’s Guide

For more tips on managing a SQL Server Database, see Managing a SQL Server Database: Tips for Beginners


SQL Server Database Optimization - a Beginner's Guide

Over the past month, I've been analyzing our SQL Server Database activity to identify and resolve any significantly inefficient queries. Through the process I've learned a few things that may help other people who are doing the same work. Let me know what you think.

Optimization Strategy


  • Focus on Reads and CPU.**

If a query has a high number of reads, then it is likely that either a table is not indexed properly or a query is written in such a way that appropriate indexes cannot be used. If the CPU is particularly high, a variety of factors could be at issue.

  • Identify BOTH expensive individual queries AND expensive aggregate queries.

When I started, I just focused on identifying the queries with the highest read counts and CPU. That's certainly helpful if very expensive queries can be significantly optimized. However, an expensive query that is rarely runs is probably not as problematic as a moderately expensive query that runs very often.

For instance, one query may run about once an hour and have 100,000 reads and 500 CPU. Yet, another query may only have 25,000 reads and 125 CPU, but runs 100 times an hour. The cumulative hourly expense would therefore be 2,500,000 reads and 12,500 CPU; a much higher expense than the query that only runs once an hour.

Optimization Process

  • Create a Baseline for your database

As you achieve performance improvements, you will want to compare them with how the system performed prior to the improvements being made. Having a baseline will allow you to quantify the improvements made in real numbers. Seeing significant improvements is very satisfying and well worth the time.

- Create a baseline copy of the database ? This will allow you to compare how new indexes impact performance.
- Store a run of SQL Profiler as a baseline ? Profiler allows you to write results to a database table. Name it ?Baseline? or something like that. I write to a test database to isolate that work from development. It may be good to do two runs?one that perhaps catches all queries with more than 1,000 reads and one for all queries with more than 100 CPU for a time period of 30 minutes. (These numbers certainly depend on the size and structure of your particular database. You may want to do some preliminary runs to figure out the best numbers for your database.)

1. From SQL Server Management Studio => Tools => SQL Server Profiler
2. Select database
3. Click "Save to Table" on General tab. Select development database. Table name = "Baseline". (or BaselineReads5000 and BaselineCPU100)
4. Events Selection Tab: TSQL ? SQL BatchCompleted and Stored Procedures ? RPC:Completed (if you use stored procedures)
5. Column Filters: - Run 1: Reads ? Greater than or Equal ? 5000 //Exclude rows that do not contain values
- Run 2: CPU ? Greater than or Equal ? 100 //Exclude rows that do not contain values

- Identify Expensive Queries

Query your baseline run of SQL Profiler. The following query could be of great help. It aggregates the total reads and total CPU, allowing you to sort by those columns. This allows you to see which queries are the most expensive from an aggregate perspective. Some queries that are rarely run still may show up in such sorts. This really emphasizes the importance of tackling such queries.

The key grouping is on ?SUBSTRING(Textdata, 1, 150)?. Each query that is run typically has variables appended to the end of the query. Comparing the first part of the query can allow identical queries with different parameters to be grouped together to provide aggregate results. You may need to play with the number of characters from time to time.

I also group by DatabaseName, since multiple databases may be stored on the same server.

 SELECT DatabaseName,  
 COUNT(*) AS [Query Count],  
 SUM(CPU) AS [CPU Query Total],  
 AVG(CPU) AS [CPU Per Query],  
 SUM(Reads) AS [Reads Query Total],  
 AVG(Reads) AS [Reads Per Query],  
 SUBSTRING(Textdata, 1, 150) AS Query  
 FROM My_Database  
 GROUP BY DatabaseName, SUBSTRING(Textdata, 1, 150)  
 ORDER BY [Reads, Query, Total] DESC  

This is another query that is a little more complex, but it adds some interesting information. Specifically, it gets the totals across all queries (that met the read or CPU threshold) for query counts, Reads, and CPU, then computes the percentages for each query.

 WITH [Total All Queries] AS  
 (  
 SELECT COUNT(*) AS [Query Count],  
 SUM(CPU) AS [Total CPU],  
 SUM(Reads) AS [Total Reads]  
 FROM My_Database  
 )

SELECT DatabaseName,  
 COUNT(*) AS [Query Count],  
 CAST(COUNT(*) * 100.0 / (SELECT TOP 1 [Query Count] FROM [Total All Queries]) AS DECIMAL(4,2)) AS [Query Count Percentage],  
 SUM(CPU) AS [CPU Query Total],  
 AVG(CPU) AS [CPU Per Query],  
 CAST(SUM(CPU) * 100.0 / (SELECT TOP 1 [Total CPU] FROM [Total All Queries]) AS DECIMAL(4,2)) AS [CPU Query Percentage],  
 SUM(Reads) AS [Reads Query Total],  
 AVG(Reads) AS [Reads Per Query],  
 CAST(SUM(Reads) * 100.0 / (SELECT TOP 1 [Total Reads] FROM [Total All Queries]) AS DECIMAL(4,2)) AS [Reads Query Percentage],  
 SUBSTRING(Textdata, 1, 150) AS Query  
 FROM My_Database  
 GROUP BY DatabaseName, SUBSTRING(Textdata, 1, 150)  
 ORDER BY [Reads Query Total] DESC  

After you have run this and identified a query for optimization, you?ll need to query the database again so you can get the entire query at issue. Identify something unique in the query and use that as a comparison. (In my experience, I have had to compare on individual terms.)

 SELECT CPU, Reads, DatabaseName, TextData AS Query,*  
 FROM Baseline_Reads_5000  
 WHERE Baseline_Reads_5000.Textdata like ‘%part of query%’  
 ORDER BY Baseline_Reads_5000.CPU DESC  
  • Analyze Expensive Queries in Database Engine Tuning Adviser

You can copy and paste queries from your Baseline database into a new query window, then select the query, right mouse button click, and select "Analyze Query in Database Engine Tuning Adviser". This will show any recommendations for new indexes or statistics that would help optimize the query. It also gives you a projected percentage of optimization that can be achieved.

If it recommends a new index and a few statistics that are pretty small that is projected to yield a high improvement, then you're probably in good shape. If it recommends very large indexes to achieve small results or if it doesn't offer any recommendations, then you have more work to do.

Sometimes a particular column is indexed on one table, but your query is looking for it on a different table. Changing the query to use an index that already exists on the column in a different table can give you substantial improvements without needing to create any new indexes.

Sometimes the query is just not well written. In those cases, you may need to break up the query or think of a different method to obtain the same results.

  • Profile the Results of the Optimizations

Sometimes you can see a very significant difference by just running a query in SQL Server Management Studio before and after, especially for very expensive queries. With moderately expensive queries, though, it may be helpful to obtain more precise numbers.

It is possible to run SQL Profiler and restrict the results to your machine. Alternatively, you can isolate a test database and run the query there. If you're using Linq, then you will need to get the new generated SQL to compare or just run it in a development environment that you can isolate.

For this run of SQL Profiler, you don't have to write your results to a database table, since the rows should be minimal enough to view onscreen. It can still be nice to have for later reference, though, so you can name the table something like "Optimize Registrations".

Still, you can see the difference your changes make right in the Profiler window. Sometimes the first run will show up, but the changes are so significant that the second run (with the new index or whatever) will be below the threshold you set. So, for these runs, it's good to set a lower threshold of reads and/or CPU.

Additionally, watching the impact of a specific event on the database can identify redundant queries–queries that are run multiple times to get the same data. Storing the query results the first time can provide very significant decreases in database usage. Also, every access of the database requires some overhead, so accessing the database a hundred times to get a small piece of information may be more expensive to accessing it once to get a hundred small pieces of information.

Final Notes

  • Read and CPU Thresholds

After you've optimized your most expensive queries, you may need to lower these to get more results. However, at some point the gains you achieve will not be worth the amount of effort it takes. At that point, set a reminder to come back in a few months and run another profile to see if any new queries have been introduced that are not properly optimized.

  • Complex Queries

Not all expensive queries can be resolved by adding indexes. Sometimes a query is just too complex and needs to be broken up into smaller queries. For instance, a long OR statement like ?(@value1 = -1 AND IndexedColumn2 = @value2) OR (@value1 > -1 AND IndexedColumn1 = @value1) can render the indexing of columns of no use.

  • Random Activity Profiling and Planned Activity Profiling

Random Activity Planning: When you first start out, it?s probably good to just profile the system as people are using it. This will measure random activity of your users. It may not be best to start profiling during your highest volume time, since you don?t want to overload the system, so pick an average load time at first. Eventually, though, it would be good to profile high activity times.

Planned Activity Profiling: During a less active time or in a development environment, it is a good idea to do some profiling on specific actions which may not randomly be performed by users in a particular window of profiling. They may, however, occur occasionally during your peak hours, and if they involve very expensive queries, they can really slow down everyone?s experience. So, profiling a period of time in which rare but possibly expensive queries are explicitly run can be very critical.

Good luck!

P.S. You may also find this post helpful that describes Highly Inefficient Linq Queries that Break Database Indexing

For more tips on managing a SQL Server Database, see Managing a SQL Server Database: Tips for Beginners


Highly Inefficient Linq Queries that Break Database Indexing

I have really appreciated using Linq for the past few years to query databases. The old access methods were far more tedious and time consuming. Linq is really great.

Yet, I recently spent some time analyzing our database usage with SQL Server Profiler. Some of the queries showed an alarming number of reads. My first assumption was that adding a simple index could solve the issue. However, the Database Engine Tuning Adviser offered no help.

I then began to really look at the query that had been generated, and that’s when I noticed the real problem.

Original Linq Statement (Pre-compiled query)

 private static Func<modeldatacontext getanswerparams="" iorderedqueryable="">>  
 GetAnswersListQuery = System.Data.Linq.CompiledQuery.Compile(  
 (ModelDataContext DataContext, GetAnswerParams ps) =></modeldatacontext>

 (from a in DataContext.Answers  
 where a.UserID == ps.UserID  
 && (ps.PublishedFormID == -1 || a.PublishedFormID == ps.PublishedFormID)  
 select a)  
 );  
 }  

Take note of this particular part of the query:

    && (ps.PublishedFormID -1 || a.PublishedFormID ps.PublishedFormID)

Basically, an answer may or may not be associated with a particular Published Form. There is an index on the User ID and Published Form ID columns of the Answers table. However, combining a comparison of the input value (ps.PublishedFormID) to the vaule of -1 along with a comparison on the column in the database prevents SQL Server from using that index at all. Thus, it is relegated to only one index.

In this case, if the input Published Form ID is -1, then that will also be what was saved for the Published Form ID of the Answer table as well, so this comparison needlessly breaks the index possibilities. This is not always the case, of course.

Changing the query slightly can solve the issue:

 private static Func<modeldatacontext getanswerparams="" iorderedqueryable="">>  
 GetAnswersListQuery = System.Data.Linq.CompiledQuery.Compile(  
 (ModelDataContext DataContext, GetAnswerParams ps) =></modeldatacontext>

 (from a in DataContext.Answers  
 where a.UserID == ps.UserID  
 && (a.PublishedFormID == -1 || a.PublishedFormID == ps.PublishedFormID)  
 select a)  
 );  
 }  

This small change allows the Published Form ID index on the Answer table to be used, even though it may be -1. In other cases like this, it may be necessary to simply break the query into multiple queries. Take this as an example:

&& ((ps.PublishedFormID == -1 && a.ParentID = ps.ParentID) || (ps.PublishedFormID > 0 && a.PublishedFormID == ps.PublishedFormID)

This also breaks the indexing, of course. In this case, it may be actually better to have two distinct queries, one of which is run if the passed in Published Form ID is -1 and one of which is run if the Published Form ID is not -1.

Realized Improvements

For one particular page load that called this query multiple times, I was catching any query that had more than 1,000 reads. My results showed that the new way resulted in* 0.7% as many queries* going over the threshold, utilizing 0.8% of the CPU and executing 0.6% of the reads. This was a pretty intense page, but that made the optimization all the more important. Here are the raw stats:

Approach      Total Queries Run      Total CPU Used      Total Reads      
Old68733,9943,478,705
New528121,907
So, Linq is awesome, and writing pre-compiled queries is great, but this doesn’t automatically create the most optimized SQL possible. To achieve that, careful attention must be given to the way the Linq query is written, and then you just have to test the outcomes to know for sure.