I am currently working on a project at work that have also to deal with a huge amount of data that can vary its structure (its schema) in time. Also I am aware that I have to perform some queries on that data and, possibly, I would like to use some indices to speed up searches.
Basically, what I am looking for has already been created: it is a document-oriented database.
MongoDB is the most famous document database. Looking at its features, Mongo seemed to me the best candidate for storing my data but, after some extensive research, I found some flaws that for my project was unsustainable:
- MongoDB locking system is based on a global write lock .
- MongoDB can lost data on distributed system due to its sharding implementation .
- MongoDB default settings (until recently) did not guarantee that writes were committed to disk before acknowledging the client .
I was looking for a server that embraces all MongoDB feature but also with some assurance on persistence and without a global write lock, and I found a promising one: RethinkDB.
RethinkDB is a recent project and it seems to me that it is the revisitation of MongoDB that removes the most unplesant flaws that I quoted before. In fact:
RethinkDB implements block-level multiversion concurrency control. In case multiple writes are performed on documents, RethinkDB does take exclusive block-level locks, but reads can still proceed .
RethinkDB comes with strict write durability out of the box inspired by BTRFS inline journal and is identical to traditional database systems in this respect. No write is ever acknowledged until it's safely committed to disk .
I choose to try RethinkDB using Python and my first tests was very encouraging. From server installation to driver connection everything was smooth. Documentation for this project is well organized and almost complete.
I know RethinkDB is a relative young project with a lot of work that had to be done such as automatic failover and huge datacenter scalability but it is also a very promising tool in which investors believes.
So I have decided to try RethinkDB on field and using it as persistence storage for documents of our project. Soon another post will show if I made the right choice or not.
|||Why MongoDB is a bad choice for storing our scraped data | Scrapinghub Blog|
|||Broken by Design: MongoDB Fault Tolerance|
|||Write Concern Reference -- Mongo DB Manual|
|||(1, 2) Techical comparison: RethinkDB and MongoDB|