This morning I was working on a query for a customer report that took me longer than I expected and produced more problems than I expected.
I actually had 3 tables named customers, orders, and payments.
Customers can have multiple orders, and have a one to many relationship.
Similarly, orders can have multiple payments, and have a one to many relationship.
My seemingly simple task was to get a report row with:
- Customer Info
- Number of total orders
- Number of total payments
- Sum of total payments
- Average of total payments
- Latest payment date
- Latest payment amount
- Latest payment ID
- Latest order number
- Latest order date
This led me to build an inefficient monster query. Here’s why:
- The latest payment and latest order items are in different tables. And they are date dependent. I can’t assume that latest payment has the highest payment ID or even is related to the latest order. MOST of the time that’s the case, but sometimes orders get created by employees from a list and are not always in chronological order.
- The latest payment could only be determined by the date_payment_made field which does not contain a timestamp. If a customer made multiple payments on the same day, which was also rare, but happened, especially if they were making payments on multiple orders.
- I had to create multiple subqueries to address the most recent payment for a customer.
- Query 1 – at the lowest level I needed to have the customer ID and the latest payment date for that customer. In and of itself, that means I have to join 3 tables.
- SELECT cust_num, MAX(date_payment_made) FROM customers INNER JOIN orders on orders.cust_num = orders.cust_num INNER JOIN payments ON payments.order_id = orders.order_id
- Technically, I probably could have omitted the customers table and just pulled the orders cust_num, but I wanted to make sure I was actually using existing customers. Some have been deleted.
- Query 2 – So then you have to subselect Query 1 in another query in order to get the max payment ID for that date and customer number.
- So now in Query 2 I have a Customer Number, a most recent payment date, and the max payment ID for that date and customer.
- A similar process was done to get the most recent order dates by customer which also created some self-joining queries.
- Finally, another subquery was generated to just get order counts by customer number. Although it seems like I could have gotten that data point easily when getting the order dates, I really couldn’t because of the self joining I was doing in case there were multiple orders for the customer on the same date. The count ended up having to be in a separate subquery because trying to count them in the main query putting all this together was just counting all the payments for all the orders instead of the order numbers.
- Anyway, after all the self joining and multiple queries embedded in each other, it takes about 3 seconds to run for over 100,000 customers. Not too bad, but it is a rather mind-bending process to create these queries. I kept thinking I could make it faster, too and kept ending up re-writing it and ending up with the same query.
Geesh. You know, sometimes it’s better to go the simpler route instead of making all these queries and just loop through things with code to get the information you need. If speed isn’t an issue, that might be a better route to take!