Foreign keys – don’t go there.


You’re probably a medium-to-large-sized company, you run a medium-to-large application. You apply all the industry best practices, using micro-/plain-old-service architecture, compartmentalise your system into nice tidy chunks. As they usually do, your DB admin doesn’t like to maintain multiple databases and so you’ve been restricted to run everything under one fairly large database.

One fine day, you decided your data is important, it needs integrity. You decided the best thing to do is to introduce foreign keys.

 

You may have just made a big architectural mistake

Foreign keys are symptoms of a larger problem architecturally. In an ideal world, system dependencies should be designed in a vertical sense, and message exchange should happen at the interface layer. Inevitably, what a foreign key suggests is that there is some interface relationship between various data in your lowest level structure of your various systems. Now you’ve got at least two interface points in your system and you’ve got a future scaling issue.

deathbyoracle

There are several immediate impacts of this design decision. There’s a hint that your system is not as stateless as it should be. It also means that it’ll be fairly difficult exponentially to test and deploy your service in isolation (see continuous delivery). It also makes it impossible to maintain and mould as you lose understanding of the implicit coupling, which the ‘state’ of your service can be modified by external factors.

 

Isolate everything

My proposal is that each service should control and access all interactions with its database entirely and exclusively. I’d go as far as saying that having two systems share the same database on different is a big mistake. It’d be even better, if your database instance can be deployed within the same container of your application. The mental shift here is that the database, is served as part of an application instead of a separate integration piece.

deathbyrabbitmq

tl; dr;

In recent years, there is a fundamental shift in philosophy about how we should maintain and access our data. As we learn and adopt smaller (but more) services and smaller (but faster) delivery, we’ve come to realise in time, we need smaller, more isolated and databases (which are part of a piece of a whole of a micro-service, not as a part of a piece itself).

It is not to say foreign keys are bad. There are small sites which will never need to scale, and in themselves qualify as singular atomic pieces which benefit from constraints to maintain data integrity. But, if you’re thinking about scale, then I would advise that you try to keep each database isolated completely.

And do not use Oracle*. Yes, you database gurus can now hate me.

* if you hadn’t caught on, just a jest … 🙂

 

You may also be interested in reading the counter-argument: I’m a Database Developer, this is what I think about data.

11 thoughts on “Foreign keys – don’t go there.

  1. I’m still digesting this, but my gut reaction is that even small databases should maintain FK relationships for data that belongs together. This is how the db engine creates query plans and you’d be taking a performance hit by not using them. Scale bad data access performance… I do agree that closely related data should be in its own db. If you’ve got an app with user configs, you should have a user configs database for that app. If you have customers, that’s a customers database. Orders? That’s an order database. At some point though you would have to aggregate. Do you have an aggregation service that calls those micro-services to gather the bits separately? Or do you create an aggregate DB that’s duplicated from each DB (customers and orders)?

    I’ve encountered some painful experiences regarding data integrity when certain foreign keys were missing. It seems more like applying them in the right ways and maintaining a clean SOC in terms of the data is best.

    • Hi Philn5d,

      Thank you for your thoughtful response. About foreign key and database performance, I suppose the question is “depends on your db engine”. Certainly the ones I’ve worked with (say, SQL server) do not rely on a “logical” foreign key to optimise queries (unless we’re talking conceptual semantics here).

      Even primary keys on its own have no impact on performance, it is the clustered index implicitly created with your primary key (which in turn creates a B tree structure) that optimises your query execution.

      Of course, I’m not against foreign keys as per say. I’m merely indicating that the presence of an overload of foreign keys MAY indicate a larger problem (or you be thinking about a non relational data structure) .

      • Ah, right of course – FK would not have a direct impact on performance unless it implicitly created an index. It is the non-clustered index that would. I noticed in MSSql Server that a clustered index is created when a PK is created. Still is the ordering of the clustered index, fill factor, and overall design that will impact performance most since that disk IO will generally be the most costly (other than a function on each row etc).

        I’d be interested to know your position on how to best serve the aggregation of related data which is stored in separate data stores (across servers even).

        • ah, my view – it depends. what context are we talking about? what is the size of the data stores? are we talking about polygot data stores? is there a real time requirement for data aggregation? is there a requirement for persisted aggregation (and real time in that, or near-real-time?).

          what is the purpose for this aggregation? (data science, analysis, operational?). do you want to lean on existing technologies for data replication (if relevant) ?

          alas, you’ll need context to figure out what’s best for the problem at hand.

      • Just a little bit of clarity on SQL Server (I don’t know if this is applicable in Oracle or other engines).

        Foreign keys, primary keys and unique constraints actually all directly affect database performance, completely independently of indexes (although, in the case of primary keys and unique constraints, you can’t escape the indexes). The query optimizer looks at these database objects and uses them as part of the calculations for it’s row targets. Those row targets, how many rows it can expect based on the query and the object, determine the choices the optimizer makes. So, in the case of foreign keys, assuming they’re enforced (using the WITH CHECK option), it can know about matching rows in a table. That allows it to do things like I outline in this blog post: (http://www.scarydba.com/2010/11/22/do-foreign-key-constraints-help-performance/ ). Same thing goes for primary keys. That unique value means that it knows for a given value, only one possible row can be returned. It changes the choices it makes because of that.

        While there are also costs associated with primary keys and foreign keys, they really are a lot more than just indexes in terms of performance and performance tuning.

        I hope this is helpful.

  2. “Inevitably, what a foreign key suggests is that there is some interface relationship between various data in your lowest level structure of your various systems.”

    Well, yes. And if you have data that doesn’t have some interface relationship, then why would you regard it as part of a single corpus at all? Of course data relates!

    “In an ideal world, system dependencies should be designed in a vertical sense, and message exchange should happen at the interface layer.”

    In an ideal world, I agree that system dependencies should be vertical. But why should “message exchange” happen at the interface layer (by implication, in an ad-hoc manner) rather than at any other architectural construct you care to imagine, including the client? What’s wrong with integrating and rectifying in the persistence layer? You are asserting, without any logical or epidemiological basis.

    Personally, I don’t believe your thesis. Throughout any system, data fizzes: it materialises, moves, and disappears in barely-orchestrated pyrotechnics. It’s only at the persistence layer where it comes to rest – and its only then, when you have all your cards simultaneously spread out on the table, that you can see what you’ve actually got.

    It seems to me, in a philosophical sense, that data-in-motion and data-at-rest are very different animals. But to presume from that that data-at-rest is moribund and passive (which your thesis seems to) is to fundamentally misunderstand what data is. The persistence layer is your repository of truth – there can be no other – and that being so the ability to protect its integrity, to turn it to see it from different perspectives, to abstract it and let it tell its story – that’s the whole point of collecting all that data. You simply can’t do that at the superficial layers of the system where the data is still transient.

    Data simply isn’t just packets, no matter how fervently coders wish it were.

  3. I struggle with this as a general idea. It appears to me that you are merely moving the problem, and potentially causing other ones here. Apart from Grant’s response on performance, which is very true, there are other issues with microservices and separate databases for each service.

    Let’s say we are doing something trivial, like a simple e-comm site. I could have a separate database for products, one for orders, one for shipping actions one for customers, etc. However as I scale, I’m not using intra-database queries or FKs/joins/other techniques to verify information for an order. Now I’m sending messages. At small scale, these are trivial. At large scale, now my network traffic and messaging is vastly growing.

    Also, since there is a “products” database, and an “orders” database and a “customers” database, I still can’t necessarily scale out these microservices beyond their own db. Perhaps I can scale them higher in terms of each being as large a machine as I can now with a single Oracle/SQL Server/MySQL/etc box, but I’m still having a scaling issue. I also now have a new messaging problem I need to manage and architect. If I lose one database, how gracefully can my application degrade or does it start to fail with unforeseen interactions? Do I create more frustration when customers cannot place an order because we can’t find the customer db or because the site is down?

    Certainly there are domains of problems that would work find here. Spotify is probably a good example of this. There may be ecomm systems that allow this, perhaps being willing to notify customers after some time (hopefully minutes, perhaps hours) that a product is out of stock because the message from the orders db didn’t get to the message from the inventory db. There are certainly ways to copy and replicate some data, assuming that “fresh enough” works for your application. However that’s a business decision more than a coding decision. I also think the more you work with singleton rows and single objects in an application context, the less you need RDBMS capabilities.

    Ultimately I think FKs and strong RDBMS systems work very well in many situations. They work less well in many situations and your domain may qualify for those areas where a NoSQL, a multi-db, or other architecture works. We should certainly investigate and write about where things work and don’t work, but I wouldn’t assume the RDBMS with it’s PKs and FKs isn’t a valid, well thought out and incredibly useful architecture.

  4. Databases without primary keys inevitably get duplicates in at least some of their tables, and databases without foreign keys inevitably get orphans. Processes without transactions inevitably have partial commits. Performance is vital, but it doesn’t trump integrity. I’ve seen too many times these basic rules violated, only to produce a headache later. And as has been pointed out, at least where SQL Server is concerned, these objects are used by the optimizer to develop efficient query plans.

  5. if normalization is a sound objective – to not have customer name in each order record – then normalization needs to be enforced at the lowest possible level.Which is the db, Which is either (a) foreign keys for implicit enforcement, or (b) exclusive use of stored procedures for explicit enforcement. To suggest that an external executing app somewhere should do this, is simplistic and ill-considered. What happens when the app is retired for something bigger, better, shinier? All the enforcement logic needs to be recreated? What about other apps coming online, that need to modify the data? Is the answer really “sorry we can only ever have one app that modifies the data and we can never replace it” ? If so, then is one more reason to not have enforcement by an app but instead directly in the db. And as for multiple databases per application ‘domain’ (customers, orders), try using schemas.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s