Tuesday, October 19, 2010

lucene-bytebuffer

What is lucene-bytebuffer.

lucene-bytebuffer is Lucene Directory implementation using Direct ByteBuffer. Directory in lucene is backing storage for index. Lucene uses directory for storing index contents. So there is RAMDirectory, FileDirectory, MemoryMappedFileDirectory, NIODirectory each presenting various different options. lucene-bytebuffer will allow in-memory index to grow upto several gigabytes without incurring garbage collection cost.

Mostly indexes are 90 to 95% read and 2-5% write ie. index hardly changes. If index is huge it will cost a lot in terms Garbage Collection CPU cycles. RAMDirectory holds arrays of size 1024 so for 1GB index its 1 million array objects. So as size gets increased in-memory index performance degrades due to garbage collection.

What if you want to index say 5GB data? Use off-heap bytebuffer backed directory.

Another question is why would you want to use lucene in-memory indexing. May be as Cache which can be queried on more than one property of object indexed?

jmalloc : Manual Memory Management in java

One of the definitive advantage of Java over C++ is automatic memory management. Automatic memory management also has its cost tough. This cost was high in earlier versions of JDK(Serial stop the world garbage collection) but now it has been much improved : parallel and concurrent collections. In some cases this cost is so high that manual memory management is the answer. jmalloc is one such simple effort.

GC pain point in java, a limiting factor in many cases. Garbage collection tuning in java is considered as black art and very difficult to tune.

Caching provides performance boost for lot of application. Caching of large data is restrictive because caching mostly is very small part of application logic but it costs relatively more in terms GC impact. JVM don't perform well predictively beyond size of 4GB. Cache is typical - it holds objects with predictable life cycle. Some objects infact live through-out the application life such as "reference data" which does not change and remain cached. Such objects are also problematic for GC, they get promoted to old generation and scanned in every Full garbage collection wasting CPU. Terracotta has addressed similar problem using direct ByteBuffer. jmalloc also does the same thing for ehcache.

BigMemory benchmark claims to have been scaled upto 350 GB of cache on beefy server. BigMemory has shown that garbage collection that java offers is not sufficient for some use-cases like Caching. Caching modifies object only two times : Put on Cache and Eviction from Cache. This is typical case of manual memory management. jmalloc is manual memory management of direct buffers with two simple routines : malloc and free. Direct buffers are not visible to java garbage collection. Thus object stored in directbyte buffer lives as long as its ByteBuffer reference is not collected. jmalloc allocates a single ByteBuffer and divides it into many variable size chunks where objects are serialized and stored.

This is just start. Apart from generic malloc/free metods, I am planning to write a helper class for ehcache which will wrap ehcache so that all benefits like eviction, disk based overflow are available but the object is stored in

If you like the idea let me know.
..
Tushar