Development Blog
SQL Optimisation
More often than not the initial database design works fine at the beginning, however as time goes on, changes are made and the data grows. Bottlenecks appear, you pick up a bunch of slow queries or you run out of storage. It is very important to perform regular checks and fix the actual issues (as supposed to caching everything to mask the problem).
There's a few techniques and tips which can be useful!
Only select what you need
It's always faster to list the columns you need rather than pulling all of them.
// Don't select everything
SELECT * FROM users;
// Only select what you need
SELECT forename, surname, address FROM users;
Avoid SELECT DISTINCT
SELECT DISTINCT is a way to remove duplicates from your query. It groups all fields in a query to create a distinct result set. This process can be quite costly and a large amount of processing power is required. Data may also be grouped inaccurately.
Select more fields and try and avoid using DISTINCT. Also, establish why the duplicates exist in the first place!
// Using DISTINCT can be costly
SELECT DISTINCT forename, surname, address FROM users;
// Select more fields and use other means to remove duplicate entries
// Also, implement some strategy to remove the root cause of the duplication
SELECT forename, surname, address, postcode, date_of_birth FROM users;
The power of INNER JOIN
Some SQL developers prefer to make joins with WHERE clauses, such as the following:
SELECT u.id, u.forename, u.surname
FROM users u, user_logs ul
WHERE ul.user_id = u.user_id;
This is known as a Cartesian Join or CROSS JOIN. All possible combinations of the variables are created. If you had 10 users with 10 logs, the query would first generate 100 results then apply a filter where the ids are correctly joined.
It is seen as an inefficient use of database resources, it's done much more work than is required. Use an INNER JOIN instead:
SELECT u.id, u.forename, u.surname
FROM users u
INNER JOIN user_logs ul
ON ul.user_id = u.user_id;
Use WHERE instead of HAVING
HAVING statements are calculated after WHERE statements. If the intent is to filter a query based on conditions, a WHERE statement is more efficient.
For example, let’s assume 200 sales have been made in the year 2016, and we want to query for the number of sales per customer in 2016.
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
GROUP BY Customers.CustomerID, Customers.Name
HAVING Sales.LastSaleDate BETWEEN '01/01/2016' AND '12/31/2016'
This query would pull 1,000 sales records from the Sales table, then filter for the 200 records generated in the year 2016, and finally count the records in the dataset.
In comparison, WHERE clauses limit the number of records pulled:
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.LastSaleDate BETWEEN '01/01/2016' AND '12/31/2016'
GROUP BY Customers.CustomerID, Customers.Name
This query would pull the 200 records from the year 2016, and then count the records in the dataset. The first step in the HAVING clause has been completely eliminated.
HAVING should only be used when filtering on an aggregated field. In the query above, we could additionally filter for customers with greater than 5 sales using a HAVING statement.
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)
FROM Customers
INNER JOIN Sales
ON Customers.CustomerID = Sales.CustomerID
WHERE Sales.LastSaleDate BETWEEN '01/01/2016' AND '12/31/2016'
GROUP BY Customers.CustomerID, Customers.Name
HAVING Count(Sales.SalesID) > 5
Use wildcards carefully
Using wildcards should be treated with care. Wildcard (%) can be placed before and after a search. When used this way, the database is tasked with searching all records for a match anywhere in the selected field. It can be inefficient.
SELECT name FROM countries WHERE name LIKE '%ug%'
This would return Uganda, Uruguay and Portugal.
Sometimes only consider using a wildcard after the phase. It can lead to more accurate results and put less strain on the database.
SELECT name FROM countries WHERE name LIKE 'ug%'
This would return Uganda.
Use LIMIT to sample query results
Say you're got a large table full of data. If you query it, it could be returning thousands and thousands of rows (which can be slow), consider using a LIMIT. With a LIMIT (and maybe an OFFSET) you can chunk the result set.
It prevents taxing the database with a large query.
An index is meant to speed up the performance of a database. You'd use indexing whenever it improves the performance of your database. As the database becomes larger, you're more likely to see the benefits of using an index.
Indexes aren't perfect though. When data is being written to a database, the original table is updated first and then all the table indexes (if you're got loads then inserting or updating a table is going to be slow).
A balance between inserting/updating and selecting needs to be struck. If the bulk of your updating is carried out off-peak (i.e over night) then indexing is the way to go, but if you're got a table which is constantly being updated, then you need to be careful.
- If used correctly, indexing can vastly reduce the time of queries
- Every table with a primary key has one clustered index
- Every table can have many non-clustered indexes to aid in querying
- Not every database will benefit from indexing
- Not every index will increase the query speed for the database
A primary key is a column in a table that is used as a unique identifier for each row. It's basically a unique identifier for a database row. A table's primary key column thus must be unique, and it cannot be empty or NULL. A foreign key is different, it links two tables together.
Why would you use a foreign key?
You can easily set up an ID which matches across two tables without using a foreign key but this may lead in some data inconsistencies. If a column is assigned a foreign key, each row of that column must contain a value that exists in the 'foreign' column it references.
The real purpose of foreign keys is that they add a restriction: entries to the table with a foreign key must have a value that corresponds with the 'foreign' table column. It is considered best practice to use them as much as possible.
Usually developers add N+1 problems without realising it. Say you've got this query:
// Get users
SELECT id, forename, surname, address, postcode, date_of_birth FROM users;
Then say we need to pull a user's log activity, so we loop through that query (via code) and execute this:
// Get user logs using the id from row found from the above query
SELECT logs FROM user_logs WHERE user_id = [USER_ID];
For every user, we're going to execute an additional query. This is the N+1 problem. If you pull thousands of users, you're going to be executing thousands of additional 'user_logs' database queries, which is going to impact speed.
// Get user logs using the id from row found from the above query
SELECT logs FROM user_logs WHERE user_id IN ([COMMA_SEPARATED_LIST_OF_USER_IDS]);
You'd then use code to access and assign the data accordingly.
The Solution
Try and pull what you need in the least amount of queries possible, you could JOIN on the 'users_logs' table and process the data, or alternatively, gather a list of user ids from the first query and execute something like this:
Sometimes theres no other option to change the structure of the database to ensure performance expectations are met. Not every change needs to be major, sometimes a minor change will do the trick! And remember ALWAYS consider a database design before any coding takes place and if you get given a chance to improve the database structure (such as migrating to a new platform) - take it.
Here are some suggestions:
Don't duplicate
If teams don't communicate or document what they're working on you may end up creating several solutions to the same problem which results in duplicated database fields, i.e:
- order_reference
- order_ref
- ref_order
It may get to the point where this cannot be changed as removing them could result in platform bugs.
Pre calculating fields
If there's operations which need to constantly compute values then it's better to compute them when they're being wrote into the database. Add that value to a new column. This technique can save time when retrieving the data, especially on large datasets.
Side note - try and do this real time (as the data is being updated or inserted, rather than a scheduled background task as the computed value may get out of date.
Database denormalisation
Normalisation of a database ensures that data redundancy is reduced and the data integrity increases. Normalisation, however, has a major drawback, many levels of tables are created and being able to pull the data you need can result in poor performing queries.
Normalisation comes in various forms (denormalisation is the process of rolling back one of three stages of normalisation), you need to decide which one works for your given situation:
1st Normal Form (1NF)
The goal here is to ensure that a single field doesn't contain multiple values. If a table contains a multi valued attribute then it violates 1NF. In the example below, there's 2 telephone numbers in one field:
ID | Forename | Surname | Phone Number |
1 | Stuart | Todd | 07540154040, 07540154041 |
A duplicate row (apart from the telephone number) would be created to satisfy 1NF:
ID | Forename | Surname | Phone Number |
1 | Stuart | Todd | 07540154040 |
2 | Stuart | Todd | 07540154041 |
2nd Normal Form (2NF)
The database table needs to considered to be in the 1NF form. Best way to describe 2NF is with an example, say we have a set of goals which are attached to an employee and we note down their age, so we end up with something like this:
ID | Goal | Employee ID | Age |
1 | Go for a run! | 1 | 38 |
1 | Read a book! | 2 | 45 |
Each field has a single piece of data so we definitely fulfils 1NF, however we have a non-key attribute (age), this 'goal' table doesn't need to know about it and its dependent on the Employee ID.
The solution is to split the table and attach the age to the employee record:
Table 1
ID | Goal | Employee ID |
1 | Go for a run! | 1 |
1 | Read a book! | 2 |
Table 2
Employee ID | Age |
1 | 38 | 2 | 45 |
3rd Normal Form (3NF)
The database table needs to considered to be in the 2NF form. Best way to describe 2NF is with an example, say we have an employee, who has a department ID and that department has a name, so we end up with something like this:
ID | Department ID | Department Name | Forename | Surname | Phone Number |
1 | 1 | Development | Stuart | Todd | 07540154040 |
1 | 1 | Marketing | Bob | Smith | 07540154010 |
A department has an ID and a name, and both go hand in hand, potentially here, you add another row with the same department ID but change the name - which can lead to some inconsistencies (As above).
The solution is to split the table and create a relationship between them:
Table 1
ID | Department ID | Forename | Surname | Phone Number |
1 | 1 | Stuart | Todd | 07540154040 |
1 | 2 | Bob | Smith | 07540154010 |
Table 2
ID | Name |
1 | Development |
2 | Marketing |
Split tables
Data such as images or videos or large pieces of text data can impact the query speed. The read time depends on the data size, therefore if the data size is huge, then its going to impact the performance of the query.
In these instances, the larger fields can be split out into a separate table and the primary table being queried can escape that overhead.
Perhaps one of the more obvious ones. Listen to the business requirements and ask the right questions - what is the goal? Why does it need to happen that way? Can more than one be attached to it etc? Keep asking questions until you fully understand what needs to happen.
This comes with experience but be curious and firmly establish the requirements, otherwise in the near future you're having to hack changes into your database structure to accomodate and before you know it, it turns into an unmanageable mess! The root of bugs and poor performance usually stems from a poor database structure.
There's hints and tips on this page but ultimately optimising your database (and the code around it) is something you're going to have to experiment with.
Log slow queries, set up a decent profiler, put a strategy in place which allows time to investigate and fix root issues (don't be the guy who solves everything by caching it) and do the basics - avoid the N+1 problem, use indexes, foreign keys, joins, only pull what you need and consider which normalisation form works best (etc etc etc). Good luck!