Foliotek Developer Blog

Highly Inefficient Linq Queries that Break Database Indexing

I have really appreciated using Linq for the past few years to query databases. The old access methods were far more tedious and time consuming. Linq is really great.

Yet, I recently spent some time analyzing our database usage with SQL Server Profiler. Some of the queries showed an alarming number of reads. My first assumption was that adding a simple index could solve the issue. However, the Database Engine Tuning Adviser offered no help.

I then began to really look at the query that had been generated, and that’s when I noticed the real problem.

Original Linq Statement (Pre-compiled query)

 private static Func<modeldatacontext getanswerparams="" iorderedqueryable="">>  
 GetAnswersListQuery = System.Data.Linq.CompiledQuery.Compile(  
 (ModelDataContext DataContext, GetAnswerParams ps) =></modeldatacontext>

 (from a in DataContext.Answers  
 where a.UserID == ps.UserID  
 && (ps.PublishedFormID == -1 || a.PublishedFormID == ps.PublishedFormID)  
 select a)  
 );  
 }  

Take note of this particular part of the query:

    && (ps.PublishedFormID -1 || a.PublishedFormID ps.PublishedFormID)

Basically, an answer may or may not be associated with a particular Published Form. There is an index on the User ID and Published Form ID columns of the Answers table. However, combining a comparison of the input value (ps.PublishedFormID) to the vaule of -1 along with a comparison on the column in the database prevents SQL Server from using that index at all. Thus, it is relegated to only one index.

In this case, if the input Published Form ID is -1, then that will also be what was saved for the Published Form ID of the Answer table as well, so this comparison needlessly breaks the index possibilities. This is not always the case, of course.

Changing the query slightly can solve the issue:

 private static Func<modeldatacontext getanswerparams="" iorderedqueryable="">>  
 GetAnswersListQuery = System.Data.Linq.CompiledQuery.Compile(  
 (ModelDataContext DataContext, GetAnswerParams ps) =></modeldatacontext>

 (from a in DataContext.Answers  
 where a.UserID == ps.UserID  
 && (a.PublishedFormID == -1 || a.PublishedFormID == ps.PublishedFormID)  
 select a)  
 );  
 }  

This small change allows the Published Form ID index on the Answer table to be used, even though it may be -1. In other cases like this, it may be necessary to simply break the query into multiple queries. Take this as an example:

&& ((ps.PublishedFormID == -1 && a.ParentID = ps.ParentID) || (ps.PublishedFormID > 0 && a.PublishedFormID == ps.PublishedFormID)

This also breaks the indexing, of course. In this case, it may be actually better to have two distinct queries, one of which is run if the passed in Published Form ID is -1 and one of which is run if the Published Form ID is not -1.

Realized Improvements

For one particular page load that called this query multiple times, I was catching any query that had more than 1,000 reads. My results showed that the new way resulted in* 0.7% as many queries* going over the threshold, utilizing 0.8% of the CPU and executing 0.6% of the reads. This was a pretty intense page, but that made the optimization all the more important. Here are the raw stats:

Approach      Total Queries Run      Total CPU Used      Total Reads      
Old68733,9943,478,705
New528121,907
So, Linq is awesome, and writing pre-compiled queries is great, but this doesn’t automatically create the most optimized SQL possible. To achieve that, careful attention must be given to the way the Linq query is written, and then you just have to test the outcomes to know for sure.


Using the Web.Config connection string with LINQ to SQL

When updating a project to use LINQ to SQL, I found an issue with deploying to multiple environments. Each environment (development, staging, live) had it's own database associated with this. Since I had the .dbml in another assembly, it was only reading from the app.config in the assembly it resided in. I was storing the database connection string in the web.config of the project so I thought it would be nice to just use that instead of the app.config.

The first thing I needed to do was to keep the .dbml file from reading from the app.config. After opening up the .dbml file, I opened the properties window for the file. In the properties window, there is a setting for "Connection". In the "Connection" dropdown I selected the "(None)" selection. That keeps the .dbml file from accessing the app.config for the database connection string.

Now I needed to get my MainDataContext to use the Web.Config connection string. For this I created a partial class for my MainDataContext and created a constructor that passed the connection string from the Web.Config.

public partial class MainDataContext  
{
    public MainDataContext()
    : base(System.Configuration.ConfigurationManager.ConnectionStrings["Database.connection.string.from.web.config"].ToString(), mappingSource)
    {
        OnCreated();
    }
}

Now when I deploy to different environments the .dbml file is accessing the correct database instead of the same one from the app.config.


Unexpected benefits of Precompilation of LINQ

I once had a manager who told me – I can solve any maintenance problem by adding a layer of abstraction.? I can solve any performance problem by removing a layer of abstraction.

I think LINQ to SQL is a wonderful way to abstract the persistence layer elegant, easy to use, easy to manipulate, and easy to maintain lines of code.? Instead of writing SQL which amounts to “how to retrieve” the data – you manipulate an expression tree that gets closer to specifying “what data I want”.? The upside of this is huge – you can change the expression tree at any level of your code, and let .NET decide how to best write the SQL at the last possible moment – which effectively gits rid of dealing with intermediate results and inefficiently written SQL.? Unfortunately, this abstraction does indeed cause a performance hit – the translation/compilation of the tree to SQL – and it’s probably much bigger than you would think.? See http://peterkellner.net/2009/05/06/linq-to-sql-slow-performance-compilequery-critical/ to see what I mean.? In my analysis (using ANTS Profiler), when using uncompiled LINQ – the performance hit is usually about 80% compilation and only 20% retrieving the data!? Thankfully, .NET does allow you to precompile a LINQ query and save the compilation to use over and over again.

Your natural tendency when hearing those kind of numbers might be to precompile every single LINQ query you write.? There’s a big downside to doing that, though – you lose the ability to manipulate the compiled query in other parts of your code.? Another downside is that the precompilation code itself is fairly ugly and hard to read/maintain.

I’m a big believer in avoiding “premature optimization”.? What happens if you precompile everything, and in a version or two Microsoft resolves the issue and caches compilations for you behind the scenes?? You have written a ton of ugly code that breaks a major benefit of LINQ to SQL and is totally unnecessary.

Instead, I recommend you go after the low hanging fruit first – precompile the most frequently accessed queries in your application and the ones that gain no benefit from manipulating the expression tree.? In the applications I work on – there is a perfect case that fits both of these – the “get” method that returns the LINQ object representation of a single row in the database.? These are hit quite often – and there is absolutely no case where the expression tree is further refined.

The old way it was written:

[sourcecode lang="csharp"]
public static Item Get(int itemid) {
return (from i in DataContext.Items
where i.ItemID itemid
select i).First();
}

[/sourcecode]

The new way with Precompiled LINQ:

[sourcecode lang="csharp"]
private static Func
GetQuery = CompiledQuery.Compile(
(ModelDataContext DataContext,int itemid) =>

(from i in DataContext.Items
where i.ItemID itemid
select i).First()

);

public static Item Get(int itemid) {
return GetQuery.Invoke(DataContext,itemid);
}

[/sourcecode]

Applying this fairly simple change to the application, I’d estimate we got 80%+ of the benefits of compiled LINQ, at the expense of a few extra lines of code per object/table and absolutely no loss of the tree manipulation.,>


Linq Expressions on an Interface

One thing I found myself doing quite a bit an often causing logic errors was comparing
dates in Linq queries. For example, if I want to get games that occur between a date range, I might be doing something like this:

public static IQueryable<Competition> GetBetweenDates(DateTime startDate, DateTime endDate)  
{
    return (from comp in DataContext.Competitions
        where comp.Date >= startDate && comp.Date <= endDate
        select comp);
}

I then want to get a list of tournaments that occur between a date range, and I might
essentially rewrite this code using a couple different variables. Rather, I can implement a common interface on these classes and have an expression applied to all classes that implement this interface.

public interface IDate  
{
    DateTime Date { get; set; }
}

public partial class Competition : DataBase, IDate { ... }  
public static class Expressions  
{
    public static Expression<Func<T, bool>> GetBetweenDates<T>(DateTime start, DateTime end) where T : IDate
    {
        return (t => t.Date >= start && t.Date <= end);
    }
}

Then, rather than rewriting this for each class with a date, I can just apply this expression to my queries.

public static IQueryable<Competition> GetBetweenDates(DateTime startDate, DateTime endDate)  
{
    return (from comp in DataContext.Competitions
        select comp).Where(Expressions.GetBetweenDates<Competition>(startDate, endDate));
}

Avoid static variables in ASP.NET

Occasionally, I like to write static methods on classes. They are useful whenever the method loosely relates to to the class – but it doesn’t involve a specific instance of the class.

A good example of a static method in the .NET library is the int.Parse(string val) method. This method basically reads a string and tries to return an integer representation. The method logically fits with the int Class (I suppose a string could have a ToInt() instance method…but that would be overwhelming as you added ToFloat(), ToDecimal(), To…()) – but when you run it, there’s no reason to have an instance of int already. You run it to create a new instance of int.

Commonly, I follow similar logic in the business layer of my webapps. I write a static Add(param1,….,paramx) method that returns an instance of the class after it is persisted to the database. I could accomplish this other ways, but I think the resulting application code reads better like:
[sourcecode lang="csharp"]
User.Add(username,password);
[/sourcecode]
Than:
[sourcecode lang="csharp"]
new User(username,password).Add();
[/sourcecode]
or, worse:
[sourcecode lang="csharp"]
DataContext.Users.InsertOnSubmit(new User(username,password));
DataContext.SubmitChanges();
[/sourcecode]

The issue with static methods in your business logic is that you often need a common object to talk to the database through: a SqlConnection,a TableAdapter, A LINQ DataContext, etc. I could certainly create those objects locally inside each static method, but that’s time consuming and hard to maintain. I want instead to define a common property (in the business class or a parent class of it) that lazy-initializes and returns an instance of the object when I need it. The problem is that the property must also be static for a static method to have access to it.

The easiest way to accomplish this is something like:
[sourcecode lang="csharp"]
private static ModelDataContext dataContext=null;
protected static ModelDataContext DataContext
{
get
{
if(dataContextnull)
dataContext = new ModelDataContext();
return dataContext;
}
}
[/sourcecode]

The tricky thing is that this will probably work in development and testing, until you get some load on your website. Then, you’ll start seeing all kinds of weird issues that may not even point to this code as the problem.

Why is this an issue? It’s all about how static variables are scoped inside a ASP.NET application. Most web programmers think of each page in their application as its own program. You have to manually share data between pages when you need to. So, they incorrectly assume that static variables are somehow tied to a web request or page in their application. This is totally wrong.

Really, your whole website is a single application, which spawns threads to deal with requests, and requests are dealt with by the code on the appropriate page. Think of a page in your webapp as a method inside of one big application, and not an application of its own – a method which is called by the url your visitor requests.

Why does this matter? Because static variables are not tied to any specific instance of any specific class, they must be created in the entire application’s scope. Effectively, ASP.NET static variables are the same as the global variables that all your programming teachers warned you about.

That means that, for the property above, every single request/page/user of your website will reuse the first created instance of DataContext created. That’s bad for several reasons. LINQ DataContexts cache some of the data and changes you make – you can quickly eat up memory if each instance isn’t disposed fairly quickly. TableAdapters hold open SQLConnections for reuse – so if you use enough classes of TableAdapters, you can have enough different static vars to tie up all of your db connections. Because requests can happen simultaneously, you can also end up with lots of locking/competing accesses to the variable. Etc.

What should you do about it? In my case, I take advantage of static properties that reference collections that are scoped to some appropriately short-lived object for my storage. For instance, System.Web.HttpContext.Current.Items:
[sourcecode lang="csharp"]
protected static ModelDataContext DataContext
{
get
{
if(System.Web.HttpContext.Current.Items["ModelDataContext"]null)
System.Web.HttpContext.Current.Items["ModelDataContext"] = new ModelDataContext();
return (ModelDataContext)System.Web.HttpContext.Current.Items["ModelDataContext"];
}
}
[/sourcecode]

In this case, each instance of DataContext will automatically be disposed for each hit on the site – and DataContexts will never be shared between two users accessing the site simultaneously. You could also use collections like ViewState, Session, Cache, etc (and I’ve tried several of these). For my purposes, the HttpContext.Items collection scopes my objects for exactly where I want them to be accessible and exactly how long I want them to be alive.