Elephant and Rucksack - Comparison of two CL Open Source Prevalence packages
Sunday, June 4, 2006
I've written in the past about different persistence alternatives (see here, here, here, here). The CL world is fortunate in having a number of different alternatives in this space. Franz's AllegroCache is an excellent commercial option; however, there are also a number of Open Source options available. One of the most popular of these is Elephant which uses either Berkeley DB, Postgre or SQLite as a back-end. A recent new addition though is Arthur Lemmens' Rucksack package (which he presented at ECLM 2006). Arthur will be making it available under an Open Source licencse and he is sending people copies; however, it is not yet generally available (Update: Rucksack is publicly available! See note below). Therefore, I was excited when Ian Eslick (who has done most of the enhancements to the current version of Elephant) wrote me and indicated that he was planning to compare Rucksack and Elephant and also look at the feasibility of using Rucksack as an alternative back-end for Elephant! Following is Ian's review (edited slightly and reproduced with Ian's permission):
"I distracted myself this afternoon by writing a cached binary file and buffer library with serializer as a potential step towards a native backend for Elephant. As I was contemplating some design decisions, I was curious how Arthur Lemmons made similar trade offs in Rucksack, motivating me to give his code a good read. That experience prompted the following comparison.Update-2006-06-05: Arthur Lemmens pointed out (via email) that he had created a common-lisp.net project for Rucksack about two weeks ago ( http://common-lisp.net/project/rucksack/). It contains a CVS repository, mailing lists and a link to his ECLM talk.
(Rucksack is described in detail here: http://weitz.de/eclm2006/rucksack-eclm2006.txt)
At present Elephant is fully functional and has been tested and used extensively in several demanding applications. Rucksack is not yet operational, but has a critical mass of code written for all functionality and has some architectural features worth keeping an eye on. The most exciting feature, of course, is that Rucksack is written entirely in mostly portable Common Lisp!
Serialization: Both systems take a similar approach to binary serialization and should perform similarly.
Persistent object storage: Rucksack and Elephant handle persistent objects very differently. In Elephant, every slot has a serialized descriptor (oid:class:slotname) that is used as a key to store all slot values in one large BDB BTree. The object oid is stored in class instances and used, along with class and slot names to index into the on-disk BTree to retrieve or overwrite a value.
In Rucksack, object OIDs index a large vector which contain the current on-disk location of the serialized objects. On slot-writes, a new instance of the object is written to disk. On transaction commit, the vector pointer is updated. This requires Rucksack to commit to garbage collection in order to reclaim stored objects (something Elephant doesn't do as BDB handles transaction logging differently and does writes in place). However, the Rucksack choice provides a convenient way to handle transaction logging and rollbacks without a separate logging mechanism.
This means that Rucksack has to serialize all dirty objects when it commits a transaction. This involve more writing of the disk and more total disk access than Elephant which only writes changed slot values. Within a transaction Rucksack provides an in-memory object cache of dirty objects and maintains a cache of committed objects as well so that future transactions don't need to re-serialize objects.
MOP: The metaobject protocol support for persistent objects is similar, although Rucksack's is simpler in part because it makes more commitment to object level storage instead of slot-level storage. Both Elephant and Rucksack support schema evolution, the ability to redefine objects at runtime and have the persistent instances updates as in UPDATE-INSTANCE-FOR-REDEFINED-CLASS. Rucksack saves prior schemas so old instances can be loaded and then updated. Elephant effectively does the same by storing slot names so that the new schema can pick old values stored in the same name, then run the loaded instance through the update function. There are some potential pitfalls here in Elephant and I was intending to fix them in a similar way to Rucksack as part of a serializer enhancement to avoid writing slot names all the time.
Garbage collection: Rucksack has a full incremental mark-and-sweep collector. Elephant only has a poor-man's stop-and-copy via the repository migration interface (support for doing this automatically is not built in and it's expensive). Enough said.
ACID: Rucksack has an elegant solution to ACID properties by copy-on-write for persistent objects so that each parallel transaction has its own set of live objects. This avoids conflicts but also delays rollbacks. When a transaction has to abort because of a conflict, it just throws away the live objects in memory and restarts. This does mean that rollbacks are caused by object level write conflicts instead of slot conflicts.
Summary: Rucksack is an elegant approach to persisting objects in Common Lisp. Its interface and Elephant's are very similar but they take a number of different and incompatible approaches to handling persistent slots, transactions, locking, etc. I don't foresee significant performance advantages on either side, but the serializer in Rucksack seems more efficient for standard objects at the cost of some robustness on class redefinition. I imagine I will be surprised by real-world benchmarks later. For example, I suspect that transaction performance will vary greatly based on workload. Typical website models should work the same on either as there are far fewer possible transaction collisions.
Unfortunately Rucksack isn't easily re-targeted as a native lisp backend for Elephant because of the greatly differing assumptions behind persistent objects. There may be a bit of code and design ideas that can be lifted however - such as the heap and btree implementation. There are some smart ideas in the serializer and in schema evolution that I've considered already so it's nice to have a reference implementation to refer to.
Notable differences:This review has been somewhat rambling, but I hope it makes people look forward to playing with Rucksack, produces some good ideas for Elephant and emphasizes that Elephant is ready for real world (although probably non-critical) applications today."
- Rucksack is a reasonably compact, easy-to-understand system written entirely in Common Lisp. Elephant has complex dependencies between Lisp, C and the architectural commitments of BDB. Elephant performs poorly on SQL today so BDB is the high performance backend. BDB has license issues for even small scale commercial deployment.
- Rucksack has full support for garbage collection, Elephant has minimal off-line support for storage reclamation
- Elephant will allow multiple lisp processes to use the same persistent store concurrently, a Rucksack store is locked to a single lisp instance. Elephant can be configured with BDB replication, allowing for larger-scale deployment.
- Elephant is much more mature and it's disk storage is much more likely to be reliable so it will be some time until Rucksack is sufficiently mature for prime time.
- Rucksack performs object-level collision detection, Elephant performs record-based collision in a paged storage system. This has different implications for how classes should be designed (slot values with large arrays, for instance, should be wrapped in their own persistent class so that writes to other slots does not result in multiple copies of that array).

