Real-World Refactoring: Complex LINQ-ing

March 18th, 2022

LINQ is so powerful, you can create a one-liner LINQ statement to create a report. In today's post, we put LINQ to the test by refactoring a user's piece of code

In a recent post, I mentioned a number of ways to speed up an Entity Framework LINQ query using different techniques.

These techniques provide us with a way to refactor an existing LINQ query to make it faster.

But what if you don't even have a LINQ query yet? Or you want to take an existing routine and replace it with a LINQ statement?

"Real-World" Example

I did find an example on Reddit a while ago and wanted to refactor this using LINQ.

While this isn't considered a "real-world" example (based on the Exercise21 project name), I have seen code like this before in the wild.

I thought this would be a good code-kata so we could demonstrate a refactoring.

After looking it over, I observed a couple of things:

Keeping these things in mind, I wanted to create a table structure similar to what was given in the example.

This reminded me of the Northwind database from my Collection: SQL Server Sample Databases post

It contains a similar Customer/Orders structure so I downloaded and ran the DB script to create the database.

Again, we'll use LINQPad for our walkthrough like before.

One Caveat

Before we begin, I wanted to keep myself in check with this exercise.

I've met a number of developers in the past who wrote whole programs (> 30 lines) inside of one LINQ statement. While you can create very complex LINQ statements, there should be a question as to whether it's necessary or not.

As we walk through the Reddit example, the result is for demonstration purposes as to how flexible LINQ is when it comes to using arrays and database queries.

So let's dig into this.

Finding The Right Entry Point

First off, we need to find the right table.

In their example, they started with the Customer table. I only saw one more table used (Orders) so they have it easy.

Would it make more sense if they started with the Orders table and then join to a Customer table? When it's just two tables, personally, it doesn't matter. A join is a join regardless.

In our schema, things are a little more complicated.

If you remember, one of the techniques for optimizing Entity Framework was to start from a different position in the entity hierarchy.

If we are starting with the Customer table, we have multiple "hops" (3 joins) to other tables.

Since we're calculating totals, it makes sense to start with the OrderDetails table instead. So if we move around the tables, you'll see it a bit differently.

This cuts down on an "inner join" making it 2 joins instead of 3 giving us quicker performance.

For our initial query, let's start with our OrderDetails table.

You'll notice I have a simple filter so we can test and see the results of our query immediately.

If you look a little closer at the filter, we're accessing the CustomerID in the Order table. 

As a general guideline, you want to make sure you aren't seeing too many spots (periods in your filter). Too many periods translates to more inner joins.

Grouping By Year and Month

Looking at the example, they are grouping by Customer, Month, and Year.

This is achieved by using the LINQ GroupBy() method. Since this is a custom grouping with a number of properties, we create an anonymous object.

Simple enough!

But Where's The Totals?

With our grouped results, let's display what we have so far.

It seems we forgot our totals.

If you remember, we started in the OrderDetails entity and we need to get the Order and OrderDetails. By Customer.

Yikes!

Let's keep moving forward and .Sum() the Total field.

Oh wait! Since we're working off of a grouped result, we don't even have a total field.

How do we get the totals?

We can use the Customer entity to get the Orders entities to pull the order totals and discounts.

Ok, so how do we get the Customer entity? We can add it to the Grouping object and then Sum the totals.

After adding our Total field, let's run the query and see our totals.

Ta-dah!

Oh wait...That's not good. Quite a coincidence that all totals are the same.

Hmm.

Ah! Need a filter for all of the totals plus we need to calculate the discount which, by the way, is on the Order entity.

Finally, we sort the results using a multiple sort order.

While we're at it, let's add an OrderBy as well.

At first glance of the data, I thought I wrote the LINQ query wrong.

If you're wondering why the month goes 8, then 10 then 1, 3, and 4, it's because I sorted by Company, then Year, then Month.

Should We Stop Here?

We have everything we need for a report, right?

We could remove the company name in the filter and see if we get all of our companies.

After running the query with all companies, it finishes at 1.5s. That's fast enough for our purposes.

At this point, I feel we need to snazzy it up with some formatting and additional grouping based on the person's expected output.

What we'll do is create formatted properties, group them by customer, and then display them in a grouped format.

Here's the results.

This is the output grouping requested in the Reddit post.

Was this really worth it?

Here comes the consulting answer: It Depends™

While this was a fun exercise as to how flexible (and fluent!) LINQ truly is, I always question the before and after of code refactorings.

Everyone has their reasons for choosing whether to refactor code or not. Sometimes it makes sense to revert back to the original if you're not satisfied with the "after" results.

The Final LINQPad Code

Here is our final refactored LINQPad code.  

var queryResult = OrderDetails
    .AsEnumerable()
    // Filter out a single company for testing purposes.	
    .Where(ord => ord.Order.CustomerID == "ALFKI")
    // Group requires custom grouping
    .GroupBy(e => new
    {
       Customer = e.Order.Customer,
       e.Order.Customer.CompanyName,
       OrderYear = e.Order.OrderDate.Value.Year,
       OrderMonth = e.Order.OrderDate.Value.Month
    })
    .Select(group =>
    {
       return new
       {
          Company = group.Key.CompanyName,
          Year = group.Key.OrderYear,
          Month = group.Key.OrderMonth,
          Total = group.Key.Customer.Orders
                 .Where(order =>
                     order.OrderDate.Value.Year.Equals(group.Key.OrderYear)
                     && order.OrderDate.Value.Month.Equals(group.Key.OrderMonth))
                 .Sum(order =>
                     order.OrderDetails.Sum(detail =>
                     {
                        var price = (detail.UnitPrice * detail.Quantity);
                        if (detail.Discount > 0)
                        {
                           var discount = price * (decimal)detail.Discount;
                           return price - discount;
                        }
                        return price;
                     })
                 )
       };
    })
    .OrderBy(y => y.Company)
      .ThenBy(o => o.Year)
          .ThenBy(u => u.Month)
    // Strictly for output purposes from this point on
    .Select(output => new
    {
       Customer = output.Company,
       Year = output.Year,
       MonthAndYear = output.Month.ToString() + "/" + output.Year.ToString(),
       Total = output.Total.ToString("C")
    })
    .GroupBy(grp => grp.Customer)
    .Select(output => new
    {
       output.Key,
       Details = output.GroupBy(group => group.Year)
         .Select(item => new 
         { 
            Year = item.Key, 
            item 
         })
    });

queryResult.Dump();

Conclusion

In this post, we took a simple piece of code which reported on totals for the month and year and we refactored it into a single LINQ statement.

Every approach in LINQ is chainable (or fluent) and passes it into the next method until we project the results into a variable or loop.

If you want to examine more LINQ syntax, I would recommend the following posts:

Did you find this post helpful? Was it unnecessary to refactor? How would you refactor the code? Would you? Post your comments below and let's discuss.