SQL Antipatterns

Written — Updated
  • Author: Bill Karwin
  • Logical Database Design Antipatterns
    • Formatting Lists as Comma-Separated Strings
      • I think this happens in databases that don't support array types. Apparently it comes up most often when changing an existing one-to-one relation into a many-to-one relationship as a way to avoid altering the database schema. I guess this is for DBMS that don't support array types?
      • The obvious problems with this are that querying and updating the data becomes a hassle, and foreign key constraints and indexes don't work correctly.
      • An array field works better, but it shares a lot of the downsides with the comma-separated string method: complicated indexes, harder to query and join against.
      • The best way is to use an intersection table
        • CREATE TABLE PaperAuthors (
            paper_id BIGITN NOT NULL,
            author_id BIGINT NOT NULL,
            PRIMARY KEY(paper_id, author_id),
            FOREIGN KEY(paper_id) REFERENCES Papers(paper_id),
            FOREIGN KEY(author_id) REFERENCES Authors(author_id)
          );
          
        • This also makes it easy to add metadata on a relationship, such as the date it was created or who added the entry.
    • Naive Trees
      • Representing a tree structure in a SQL database can be tricky. The obvious way is for each row to have a reference to its parent row, called the Adjacency List method, though this doesn't scale well as the tree gets deeper.
      • Deleting a node from the tree is also tricky as you have to fetch the entire subtree and then delete or reparent the elements in the right order to satisfy the foreign key constraints between each node and its children. ON DELETE CASCADE can help here, so long as you actually want to delete the elements in the subtree and not relocate them.
      • Signs you have done this wrong
        • "How many tree levels do we need to support?"
        • You hate touching the code that manages the tree. This usually means you haven't used the worst solution, but you're not using the best one either.
        • You have to periodically clean up orphaned nodes. Proper tree management should never leave the database in an invalid state like this.
      • That said, for simple applications that don't need extensive traversal or flexible updates, the Adjacency List can work just fine.
      • Recursive CTEs make querying such structures a lot easier as well.
      • Better Tree Structures
        • In general, the Adjacency List with recursive queries or the Closure Table seem best, depending on the needs.
        • Path Enumeration
          • Each node stores its entire path. This makes it easy to find all ancestors or all descendents of a node. You do have to know the entire path of the node first, but generally this is readily available.
          • Inserting a new node is as simple as adding it with the path of the parent plus the id of the new node.
          • This is best for trees that don't get too deep though, since maintaining the entire path for every node becomes costly and requires supporting application logic to keep the path in sync. It also shares problems with the Comma-Separated Lists antipattern.
        • Nested Sets
          • Each node has a number (distinct from the normal node id), and nsleft and nsright fields, which are the bounds of the range of the descendants' node numbers.
            • The numbers can be assigned by traversing the tree depth-first and assigning nsleft as you go down, incrementing as you go, and nsright as you go back up.
          • Deleting nodes becomes really easy. To remove a node, you just remove the node and the parent of it automatically takes ownership.
          • Querying ancestors or children then just involves looking for nodes where one node's nsleft is between the nsleft and nsright of the children or parents.
          • Disadvantages
            • Fetching immediate parents and children is complex. You need to use some complex outer joins to make it work, which are basically looking for nodes that fit conditions only the direct parent or children would satisfy.
            • Inserting or moving nodes requires updating node numbers, although it doesn't require recalculating everything.
          • This works best when you want to query against subtrees more than you want to work with individual nodes or update the tree structure.
        • Closure Table
          • This method involves storing the tree structure not in the same table as the nodes, but in a separate table that does nothing but store paths through the tree.
          • With two fields, ancestor and descendant, the table has one row for each relationship between two nodes, even when there are other nodes in between. Each node also references itself in this model.
          • This makes querying for ancestors or txdescendants of a node trivial.
          • Adding and removing nodes takes logn\log n rows since it involves adding or removing all ancestor and descendant rows in the table, but the queries to do so are straightforward.
          • Separating the tree structure from the nodes themselves also makes it easy to remove a node from the tree without removing the node itself, or to create multiple overlapping trees (in multiple tables) around the same set of nodes.
          • To better support querying for immediate children and parents, you can add a column to the table that contains the length of the path between the two nodes. This makes updates more complex though.
          • The main downside is increased use of space since it requires nlognn \log{n} rows. But overall this seems like the best method when heavy use of trees is required.
    • Bad Primary Keys
      • The primary key should make sense for the table.
      • Types of primary keys
        • Pseudokey is a serial ID column or similar that has no real meaning to the data except to be a unique id for the row.
        • A natural key is a column that appears naturally in the data and also happens to uniquely identify a row.
          • Be sure that it is actually unique since this can cause trouble down the line if it turns out that duplicates are ok in some cases.
        • A compound key is two or more natural keys that combine to be unique.
      • In an intersection table, the two columns for the intersection make a great compound primary key and reduce chance of bugs. No need for a separate ID.
      • The book says:
        • The primary key should have a more descriptive name, not just id.
          • e.g. account_id in an Account table.
          • Less need to rename fields in query results when joining tables.
          • This also allows more queries to use USING in joins.
        • Using an ORM can make it hard to use primary keys optimally, so that’s ok.
          • Some ORMs let you change the primary key from the default.
      • I generally disagree with the advice on primary key naming nowadays.
        • It makes the SQL easier to write but the SQL is a minor part of most applications now and tends to change less than the rest of the application.
        • Having a constant name for the primary key such as id makes it easier to write other code, particularly in the front end, that abstracts across multiple object types.
    • Missing Foreign Keys
      • Using foreign keys can appear to make some things slightly more difficult at first, but beyond the most simple cases it actually makes some things a lot easier since the database can automatically handle some types of updates, and it also reduces bug potential a lot.
      • These can be created as part of the column or listed separately:
        • column text REFERENCES other_table(other_column)
        • FOREIGN KEY (c1, c2) REFERENCES other_table(a, b)
      • Foreign keys can also define special behavior when the value in the other table updates.
      • A big thing about foreign keys is that they not only prevent deletions of data, but can help with updates too.
        • These can be set with ON UPDATE <action> and ON DELETE <action>
        • NO ACTION prevents updates and deletes that would violate the constraint. This check can be deferred until the end of a transaction.
        • RESTRICT is like NO ACTION except it can't be deferred.
        • CASCADE deletes any rows that reference a deleted foreign row, or update the values in this table to match an updated foreign row's key.
        • SET NULL sets this column to NULL when the reference is broken.
        • SET DEFAULT sets it to the default value when the reference is broken. If the default is not null then the default value must exist in the foreign table.
    • Entity-Attribute-Value
      • Some graph databases work like this, but it's not well-suited for SQL databases that don't natively support it.
      • Solutions
        • Modern SQL databases can use JSON types for flexible schema needs while retaining strong typing, easy joins and relations, and all the other niceties of RDBMS for the basic data.
        • Define one table for each type and union them together.
        • You can set up child tables and have each object reference a row in that other table when needed for the subtype fields.
          • Working with this can also be made easier with a view that does the LEFT JOIN on the child tables for you.
        • If you're stuck with an EAV schema, your best bet is to just do the reconstruction in the code that reads the database query results.
    • Polymorphic Associations
      • Sometimes you need a particular object to be linked to two different types of things. A Comment might reference an Issue or a Pull Request, for example.
      • A bad solution is to have a single foreign key that can reference either type of object, with another column that indicates which type this prticular one is referencing.
        • This makes it difficult to write joining queries
        • It's impossible to use foreign key constraints with this model.
      • Better solutions
        • Use an intersection table for each of the tables.
          • This does have the downside that it's harder to enforce a one-to-one foreign key setup, and the select queries can be complicated depending on what you really want to do.
        • Create a common parent table
          • This doesn't work for everything but is nice when it makes sense for the data model.
          • Use a single ID space shared by all the types of objects that need to be referenced. (e.g. Issues and Pull Requests share the same ID space)
          • Here you have a single parent table that just contains IDs and nothing else. Creating an object involves first adding a row to this table to generate the ID, then creating the actual object that will use the ID.
          • All of the tables then have foreign key constraints to that parent table.
            • Issues and Pull Requests have their primary key reference the parent table.
            • Comments have their parent_id reference the parent table.
          • Select queries can be done normally, and don't even have to reference the master ID table.

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.