DaveWentzel.com            All Things Data

EAV Models

 

Entity Attribute Values Models
What is an EAV and what are the problems and patterns with them?  Sometimes EAV is referred to as named value pair design or key-value design.  It is sometimes also called object oriented data storage or object/relational design.  The base concepts are the same. 
 
To understand EAVs you need to understand row modeling.  Often times it may "seem" as though a modeling concept is transitory in nature and when a modeler sees this they sometimes build an EAV model. 
 
Assume you have a supermarket database.  There will obviously be tables for products, sales, inventory, etc.  Let's focus on products, which may be viewed as transitory.  A generic product table may have name, description, etc, but the "columns" to describe a generic "product" may number in the thousands.  Think of the columns that would describe "canned goods" vs the columns needed to describe "fresh fruit"...very different. 
 
In a relational model we might have separate tables for canned good products and fresh fruit products.  An EAV model says instead let's focus on describing a single "fact", and make that fact a single row in a table. 
 
So, your EAV table will likely have 3 columns (forgetting about LastChngDtTime and auditing columns).  You will have a column for the entity ("canned good"), the attribute ("can size"), and the value ("8 oz").  Entity Attribute Value...EAV. 
 
Here, one row in our table shows a single fact (an 8 oz can).  In a relational model this would likely be a single column in a table. 
 
So what is the problem with this model and when do we use it?  It could be argued that the EAV model is worthwhile when the number of attributes that POTENTIALLY apply to an entity is vastly more than those that ACTUALLY apply to an individual entity. 
 
So thinking of the supermarket example, potentially a "product" could have vastly more possible attributes than it actually has, if "product" is allowed to be a generic entity.  In the relational world we would instead model canned good separately from fresh fruit to overcome this. 
 
EAV modelers would dispute this because I am tying myself into a model less flexible.  They would say the EAV model is flexible because a new attribute becomes a row in a table instead of a column...which is much less flexible.  A relational modeler would say that the canned good concept doesn't change that much so the flexibility shouldn't be an issue given EAV's other shortcomings (read on). 
 
I like to take a common ground.  EAV has its places, but you need to understand the economic tradeoffs (no such thing as a free lunch).  Think of a stick shift car.  Drivers of manual gearboxes love the control they have over their drive train.  But using a stick is tricky...even the best stick driver will stall once a year, usually on a steep hill with a red light on an icy surface.  Further, most stick drivers think they can drive better than automatic drivers, but that's probably not true.  I'll take my 1975 Bricklin SV-1 automatic against a manual Ford Mustang any day.  Well, that might not actually be fair for other reasons. 
 
What are the general design patterns of EAV? 
1)EAV design is less efficient when retrieving bulk data on numerous entities with numerous attributes at a time.  Think a supermarket report that shows which can size is the best seller for beans.  However, for "one attribute at a time" retrieval this is an efficient model.  Think a web based browsing functionality that is pulling small amounts of data. 
 
2)Attribute intensive ad hoc queries are technically difficult and less efficient than a relational counterpart.  Again think reporting.  With EAV a "query generator" is a must.  The query generator must be written to understand the hierarchical relationships and apply the self joins appropriately.  There are no COTS EAV query tools that I am aware of.  This must be built up front.  Query generators are usually desirable for other database design patterns that are non-relational and used for reporting.  It's not uncommon to see a "data warehouse browse tool" that manipulates star schemas with "Slowly Changing Dimensions". 
 
What are some benefits?
1)Flexibility.  As attributes are created and changed we don't need a schema redesign since everything is just another row in a table.  A relational modeler would argue how often new attributes are created and would opt for a hybrid model. 
 
2)More efficient storage.  If data values are sparse then we don't need to store NULL for attributes that are not applicable to a given entity.  Relational people dispute this argument.  Modern RDBMSs store nulls effectively in varying character columns.  In earlier RDBMSs a null utilized the same width as a fully populated column. 
 
Relational people dispute this further in noting that all current RDBMSs are "row-oriented" storage.  There are currently no "column-oriented" storage systems (I think).  I won't go into details but suffice it to say that row-oriented systems are better to model relational data.  Column-oriented storage is good to store data warehouse data.  This is because an "data extent" in a column store can store as many items for a particular "column" of data, regardless of number of rows.  Data warehouse queries tend to go after specific columns (a small number of them) but A LOT of rows.  Cache makes an object relational storage system that is row based but handles EAV storage a little better.  But the query language is not straight SQL. 
 
In any case, why does EAV take more space in current RDBMS's?  Think about it...each column of data requires its own row in EAV to represent its "fact".  So for a canned good we may have 50 attributes in one row in relational, 50 rows in EAV.  Each row in EAV will of course need the overhead of a primary key and auditing information (LastChangeUser/Time/etc).  That's some overhead.  But it's worse if you consider that you are also carrying the "attribute_id" (the text "can size" for instance) on every row.  In a relational model this is NEVER carried, it's definitional.  Now here's the topper...more storage is needed for each and every index for all of that redundancy. 
 
I guess "efficient storage" should be in the "problems" section instead of under "benefits". 
 
3)The design maps easily to data interchange formats.  EAV is easy to map to XML or simple, flexible 3GL languages.  Relational modelers would counter that the queries to extract the data are too complex and the model is not visually self-documenting like a relational ERD.  A relational modeler would also say we are sacrificing end user ease of use for application programmer ease of use.  Are we more concerned with our customers' experience with our data, or our programmers'? 
 
4)You get all of the benefits of the object-oriented design
  • hierarchies are easier to model (class instances become object values). 
  • inheritance of properties between classes. 
 
What are some problems?
1)You can't support data types.  The "value" col in our Products table must be flexible so the best we could ask for is a primitive like varying character.  So, date and arithmetic calculations would need to be casted from varchar to a more appropriate DBMS or programming language type, then calculated.  EAV modelers counter this by creating value columns like "ValueVarchar", "ValueDateTime","ValueInt", etc.  Then the given data value is placed into the correct column and another column is set to let us know which "typed column" contains the real value.  In EAV parlance this is called "metadata".  It becomes critical to the reporting of the system.  Note the complexity that is starting to rear its ugly head.  A data type isn't inherent, it must be inferred from the metadata or casted from a primitive. 
 
2)Data constraints cannot be applied easily.  Again, this is a generic varchar "value" col, we can't constrain to say all fresh fruit shelf lives must be less than 30 days if we store canned goods in the same structure.  EAV modelers counter and say that the programming language should have a data validation class for this and all data would flow through the application.  Consider this akin to keeping the database "dumb".  I counter by saying we can't always assume our data will come through our app or given class.  Often data comes from flat files using data feeds that must be developed quickly.  BCP or sql*loader can do this quickly.  Building a wrapper application around the class may not be so easy. 
 
An EAV'er will also want to model an additional series of tables that will handle the constraining.  A structure such as this will be given. 
 
 
3)Considerable up front programming costs.  There are no EAV modeling tools.  In a relational model the logical model is generally pretty close to the physical model.  Not so with EAV.  In EAV the model is more similar to what an object-oriented programmer would understand...objects as instances of classes.  So, to write the query using EAV on an RDBMS requires you to translate the logical semantics to a physical access pattern.  You need a "query generator" (see above). 
 
4)CRUD is slower.  Generally when a design pattern is faster on queries it is slower on CRUD and vice versa.  Interestingly this is NOT the case with EAV.  Not only do larger queries perform worse, but even the smallest INSERTs do as well.  Why?  Each "insert" is not just another row in a table but multiple rows in a table and will need to be locked appropriately causing long-running transactions with more chance of a blocking/deadlocking scenario.  UPDATES will be worse too since the updates will have to span multiple rows, most likely, and those rows will be scattered across a very large "entity" table. 
 
5)Performance Accessing Data:  This should be its own topic frankly and is the crux of the problem.  Boolean queries are especially bad. 
 
If I were to say "show me canned good and fresh fruit product names" in the relational world this would be a simple query of a few tables and a WHERE clause like:
 
SELECT productname FROM product WHERE producttype = 'Canned' OR producttype = 'FreshFruit'
 
In EAV we can't do Booleans easily, instead we must rely on set-based relational algebra.  The query becomes something like
 
SELECT value FROM product WHERE entity='canned good' AND attribute = 'productname'
UNION
SELECT value FROM product WHERE entity = 'fresh fruit' AND attribute = 'productname' 
 
Note the "complexity". 
 
So, a design pattern looks like the following table: 
 
Operation
Relational Logic
EAV Logic
AND
AND
INTERSECTION
OR
OR
UNION
NOT
NOT
DIFFERENCE
 
Note that SQL Server does not implement the set based operators for DIFFERENCE and INTERSECTION (yet...this is ANSI standard so it should be just a matter of time).  You need to devise a subquery solution.  In Oracle we have INTERSECT and EXCEPT that handle the cases as ANSI specifies. 
 
6)Foreign Keys.  How do you declare referential integrity in this model?  Again, an EAV modeler will point to "metadata" tables to handle this. 
 
Querying Problems
I mentioned this many times above.  Let's look at more examples that speak to performance.  As mentioned earlier, EAV designs have one large physical table that supports many logical tables.  To write the queries requires self JOINs of the tables back to themselves to form the hierarchy.  Any cost based RDBMS optimizer (SQL Server and newer versions of Oracle) will try to find the fastest way to do these JOINs.  If an EAV query needs 8 self joins then considerable time will be spent on query optimization phases.  The number of combinations a cost based optimizer may have to come up with is 8! (8 factorial or 8x7x6x5x4x3x2x1= 40320).  Ask any DBA and he'll say that this is true...the more JOINs the worse the plans tend to be.  Reduce the JOINs and the plans become more performant.  Further, an optimizer can't really optimize an EAV query if you think about it. 
 
EAV query developers come up with other nifty solutions to avoid all of the self JOINs.  They process a few tables at a time and dump the results to a worktable for further processing.  Performance will not be good.  In this case the developer is basically saying he can JOIN the data better than the database manager.  Another solution I've seen is the use of lots of views...similar problems as with temp tables. 
 
Cross-Tabulation
Say I wanted the following report for whatever reason. 
 
ProductName
ProductDescription
ProductManufacturer
Bananas
A Tasty Fruit
Chiquita
Green Beans
Canned vegetable, not at all tasty
DelMonte
 
In a relational model the data for this report would likely be stored in a table modeled similarly to this report.  Might be a little more normalized, maybe a separate manufacturer table.  But the query is something like
 
select productname, productdescription, productmanufacturer from product where producttype in ('cannedgoods','freshfruit')
 
But the data will be stored as such in an EAV model:
 
entity
attribute
value
canned good
productname
green beans
canned good
productdescription
Canned vegetable, not at all tasty
canned good
productmanufacturer
DelMonte
freshfruit
productname
bananas
freshfruit
productdescription
A Tasty Fruit
freshfruit productmanufacturer Chiquita
 
Now let's write the query. 
 
SELECT t1.value, t2.value, t3.value
FROM Object t1
JOIN Object t2
ON ...
JOIN Object t3
ON...
WHERE (t1.entity in ('cannedgood','freshfruit') and t1.attribute = 'productname')
AND (t2.entity in ('cannedgood','freshfruit') and t2.attribute = 'productdescription')
AND (t3.entity in ('cannedgood','freshfruit') and t3.attribute = 'productmanufacturer')
 
Notice in the EAV model that each "column" in the report is another JOIN back to the table with a different WHERE clause with 2 conditions (entity and attribute).  The SELECT will always have a pattern of t1.value, t2.value, t3.value, etc.  It won't change.  But the risk of incorrectly defining the WHERE clause is high due to complexity. 
 
In essence we are taking a "row" and converting it to a "column"...this is cross-tabulation and is tough in relational SQL.  ANSI is considering adding a CROSS TAB function to the SQL spec, but who knows when that will be adopted by MS and Oracle...and whether it will solve the EAV problem. 
 
The "Laziness" Factor
I see the EAV solution arising due to laziness frankly.  It seems like all procedural developers like to find an easy way to do things.  Relational SQL is not easy to a OO programmer.  An OO programmer likes the concepts of classes, objects, and attributes and this is EAV.  So they come up with a simple schema that takes some initial up front time to develop, but subsequent data manipulation becomes easier...for them.  They are knowingly sacrificing performance for this flexibility.  The end user, the customer, who needs to get at this data and feed it into SAP, or an enterprise data warehouse, suffer.  It's unwieldy and performance poorly. 
 
What is modeled with EAV today?
The windows registry, web cookies, and XML.  XML has an open/close attribute model that supports the nesting of other attributes.  web.config is really just named value pairs. 
 
Summary
Obviously I think EAV is poor under almost any circumstance I can think of.  But I'm no one.  But my hero, Joe Celko, has some thoughts that tend to be similar to my own.  http://www.celko.com .  This guy helped write the ANSI SQL standards.  Even he says this is something only a newbie would do
 
References
AttachmentSize
Image icon eav1.jpg32.89 KB

3 comments

Well said

with horizontal partitionning, 6nf + dknf, master data store, sql-variant data type (I know, I know...), conditionnal check constraints and and a transparent layer of indexed queries, you really can overcome many of these problems and querying becomes fun again.....

I think I wrote this at least 8 years ago.  A lot has changed.  I still wouldn't do this model in a RDBMS.  Better suited for a document store.  Probably.  

Add new comment