Wednesday 4 February 2009

Changed beyond recognition and the EAV data model

One of the much maligned performance drags an application can have is its abuse of the so called Entity Attribute-Value (EAV) model for data storage. However, as with many catastrophic events, the performance problem is the result of the failure of multiple elements in the decision chain. Let's examine the motivations behind the EAV on each of the parties involved.

The buyer (or designer) dilemma


You're a person in charge of selecting an standard package. Or you're in charge of design or development of a bespoke package in your medium to large business. At some point, you'll be collecting the so-called "requirements" from future "customers", or application users.

No matter how hard you try, at some point in this process it will became clear to you that you cannot exactly define the data model with enough precision to guarantee that the database design is not going to change along the life of the application.

Usually, what is specially troublesome is to define the information structures to represent data for entities that are external to your business: your consumers and customers. This is because the amount of data that the business has at some point in time depends heavily on the number of initiatives addressing them. The more initiatives, the more disparate information is being collected from the outside.

On top of that, in extreme cases it's not even clear what is the purpose of this information. Somehow everybody agrees that having customer age information is good, but they are not sure of how to cluster it. And if they know how to group ages, this grouping is likely to change in the next six months. Worse yet, they start talking about targeting promotions according to age.

So you design, or request to the potential suppliers, some provision for storing additional entity data in your application. You don't know exactly what information you'll store, how much of it or how often. Buy you surely know that no matter how complete the data model definition appears, someone in the near future is going to request to store something that was not in the data model design.

The marketing dilemma


Whether you're internal developer, consultant or purchaser, you need to be able to answer with a big smile and a resounding YES to the question: does your application support additional data fields? This helps a lot, as any concerns about suitability of the solution can be addressed, "hey, we can always add this bit later" is the answer to all the problems.

No matter how hard you argue about this being the right solution or not, nothing beats the convenience of "unlimited" additional "fields" that can be added to the application.

The implementors dilemma


Now it's time to deliver. How do you comply with the requirement?

Imagine that you're creating a standard, packaged application. To support additional attributes, you start exploring and arrive at pretty much three options.
  1. Provide some interface from your application that allows users to change the data model. Of course this interface must be portable across as many database flavors you're going to support, adapting to the specifics of each data model. Or just make everything of the most generic "string" data type you're able to find. In any case, your database API needs to have support for data definition statements. Last time I checked, the most popular ones did not.
  2. For each entity add to it additional columns such as "ADDED_ATTR_1",
    "ADDED_ATTR_2", "ADDED_ATTR_3", and so on. The trouble with this approach is that you're not meeting marketing requirements. The number of fields is not "infinite". On top of that, the names are not exactly descriptive and you have the same problems of data model specific. Note that regardless of your choice, your app will need also some metadata tables to store what those additional attributes mean in the particular customer context.
  3. In this way, you arrive at the EAV design. You simply define a table
    like this:
    ADDITIONAL_INFO(
    ENTITY_NAME
    ENTITY_ID
    ATTRIBUTE_ID
    ATTRIBUTE_VALUE
    )
    That's it, you have designed a generic facility that is able to store customer age. Simple, if customer with ID 35687 is 25 years old you just store ('CUSTOMER',35687, 'AGE','25') Repeat that for all customers. GPS Coordinates? No problem, ('CUSTOMER',35678,'LOCATION','1.123 2.45')

    And here, you have it, the origin of the EAV model in relational databases.

Changed beyond recognition


By itself, this technique (should I call it "pattern"?) is not good or bad. As stated, it is a solution for a problem and seems to be a good compromise between flexibility, customization and usability.

Those big applications (Siebel & SAP are the examples closer to me) ship with at least rudimentary facilities to customize their behavior. These facilities are no match for a full featured IDE like Eclipse of Visual Studio, since they are intended to handle small changes in functionality so they don't need to have the bells and whistles of your regular Java or C# IDE. But using those you can easily add to the views the additional data items coming from this table. It's all solved, right?

Yes, up to this point everything fits with the original plan. You have a generic mechanism for extending entities that is flexible enough to accommodate all future needs. However, two things stand in the way to data model nirvana.

First, you'll have to face the problem of exploding data volumes in this table. Let's suppose you have X customers and wish to add four additional data items to the customer entity. You then have 4*X records in this table. Worse yet, each time you display your customer data you are accessing this table four times, making it a point of contention far worse than the original customer table. If you use this table to add detail to transactions the volume explodes even more.

Second, you essentially give up on the database providing any kind of integrity on those additional data items. In general, you cannot enforce integrity constraints, nullability or uniqueness, unless you immerse yourself in writing some complicated triggers. You were doing this because you wanted to store arbitrary data, therefore it seems ilogic to try to enforce rules on what is essentially arbitrary data.

Third, and this is the worst problem, this can easily go out of control. Specially with packaged applications, one must always resist the temptation to rewrite them. As it often happens, the usual way of deploying such packages entails some choices when there is a mismatch between the business process as supported by the package and the same process run by the company. Guess who usually wins? The company, of course. After all, the process has been there for years and even may be regarded as competitive advantage, so why they should bend to the standard packaged way?

Very few companies take the brave step of changing their processes to fit the package. Instead, they customize the application to fit with their purposes. But this when taken to the extreme, specially in business driven projects with no technical expert on the business side leaves the original package unrecognizable. In some cases, I've seen EAV entries used as foreign keys to other tables!

Aside from the problems that this creates at version upgrade time (some packaged applications have not been upgraded in 12 years just because of the cost of redoing all the customization) this also invalidates any assumptions made by the application original developers. Assumptions made about performance, usage scenarios, volumetrics, everything can be invalidated.

In fact, my completely unscientific research indicates that there is an ongoing new trend against packaged applications as the universal solution for all in-house development problems. More specifically, big IT analysis groups are discouraging business to go the packaged application route and write their own from scratch if they anticipate having to change or customize more than 20% of the functionality they are going to use. The argument is that it's cheaper on the long term, as completely overriding application functionality often takes more work than creating that functionality in the first place. On top of that, the development environments for those packaged applications are usually orders of magnitude less productive than standard IDE's.

The performance tuning dilemma


Whereas classic data modeling techniques translate "entities" into tables and attributes into columns, the EAV model tries to store different entities in the same table and also allows for entities to have a variable number of attributes. This works under some circumstances, but almost never works with relational databases.

As for performance, the overall consensus around EAV is that it's bad. In some cases it's just bad as when you get a rainy day just when you intended to have that nice bike ride, in some others bad as when you go thru extensive dental surgery. Some of the problems you may encounter when trying to optimize an EAV design:

As hard as they try, RDBMS optimizers still have lots of problems dealing with a high number of table joins in the same statement. The access plans will be dominated by he need to access the EAV tables over and over.

Yes, I know, relational databases were supposed to be built to solve the problem of handling joins. But never in the scale required for the EAV explosion. In a well designed data model, very few if any operations need to join 12 tables. But remember, EAV will require a join for each attribute accessed.

CPU utilization will be very high since you'll want to access EAV attributes using an index lookup, trying to avoid scanning a large table when you're interested only in always a small subset of it. What is retrieved in a "sane" data design as part of the entity now requires an extra lookup and table access per attribute.

Contention on data and index pages will be very high as those attributes are all the time accessed by everybody even in the simplest of cases. Database without row-level locking will suffer the most, as without partition you cannot ensure that the attributes for one entity and another are in different pages or tables.

What can be done to improve performance of EAV applications? The problem when tuning such designs is that, at least in my case, I always try to avoid changing the data model as part of the exercise. Changing the data model and putting all those attributes in tables where they belong means to change the data model and by definition this means changing also controller logic. Which lands us in very expensive territory, if possible at all (do you have source code?)
  • The best I've been able to do with those applications without touching the data model has been always around storage layout. Partitioning the table using DMBS features. Slicing the big table based on ENTITY_NAME will reduce overall contention and confuse the optimizer a bit less because you're now more or less localizing access according to entity.
  • Making sure that the EAV tables are cached (if appropriate) and do not fragment with frequent updates is another way of at least avoiding the performance deteriorate even more.
  • Apply (1) and (2) to indexes if possible. Needless to say, ensure that
    (ENTITY_NAME,ENTITY_ID)
    is the primary key.
  • Check SQL statements and verify that they always access the table using ENTITY_NAME and ENTITY_ID values. Yes, someone knows that there is not a single object other than customer 24 with ID=24 in the whole system, but the RDBMS cannot use the primary key to access it.
  • Avoid usage of EAV values as foreign key, if possible.
None of those techniques will produce any spectacular performance improvements, so your reputation may suffer. But you can always point your clients to this article so they understand a bit better the situation. And someone would possibly think again before committing this mistake in the future.

Does this means that EAV is essentially bad? No, I was not trying to express that. What I was trying to express was that this EAV idea is of very limited applicability in relational databases. It will not scale or support anything beyond the simplest "let's store the second telephone number here".

For those that really need this kind of flexibility, my advice would be to look for hash table based object storage or document indexing. Of course, you'll miss the ability of the relational database to join things together, so perhaps a mixed approach would be better.

And by any means, follow the expert advice: if you need to change more than 20% of your purchased package (make that 15%) throw it away and write your own.