Wednesday, October 12, 2016

Solving Neo4j's space reclaim issue with OrientDB graph database

Neo4j Community Edition is affected by a serious disk space reclaim issue and requires a restart to reclaim the disk space after a massive delete and recreation operation. On the contrary, OrientDB automatically and internally reuses the freed space in a transparent way for the end users, while the database server remains online. As a result, OrientDB's disk space reclaim efficiency is much more higher.


There are some peculiar situations, or specific graph use-cases, that are write-intensive in nature and where a very high number of nodes and relationships are deleted and then recreated multiple times.

Assuming that the total number of nodes and relationships remain constant (or almost constant) in the database, one would expect that so does the space used by the database on disk. 

If this would not happen, in fact, it would mean that the more the users delete and recreate nodes and relationships, the more space the database server uses, even if the total number of nodes and relationships does not change - up to unpleasant situations where the disk becomes full, which, if not handled correctly, may cause additional issues or outages. We will refer to a similar situation as to a "disk space reclaim issue".

Sometime how real things work can be different from what we expect. At least in some cases. As a result the disk space reclaim issue can be seen. 

In this article we will compare how Neo4j and OrientDB behave in the above described write-intense scenario to understand if they are affected by the disk space reclaim issue or not.


As it has been publicly discussed in several places (among them the following resources: Neo4J database size / shrinkingneo4j-database-size-growsClearing nodes), Neo4j is affected by the disk space reclaim issue. 

Possible workarounds to solve this issue in Neo4j are: 
  1. Perform a Neo4j restart.
  2. Use the store-utils tool.
  3. Use the Neo4j Enterprise Edition (version 3.0.4 or higher introduce the parameter dbms.ids.reuse.types.override that should allow relationship ids to be reused).
The above workarounds, though, present some limitations, including the following:
  • Workaround #1 requires a database server restart. This means that your application / service will need to have a planned outage. Frequency of such planned outages depends on factors like the following:
    • How many frequently you are deleting and recreating nodes and relationships.
    • How many nodes and relationships are being deleted each time.
    • How much disk space you can use for Neo4j.
  • Workaround #2 requires the use of an external tool, not included in the official Neo4j distribution (although maintained by a Neo4j employee). This may produce increased complexity to integrate store-utils into your production solution. In addition, store-utils:
    • Requires extra time and efforts to perform a full database backup before performing the compact operation (although not formally required, common sense would suggest that a database full backup is made to reduce the risk linked to any possible issues that may arise when using a compact tool on a production database store).
    • Requires a database server restart, hence the limitations described with Workaround #1 also apply to Workaround #2.
  • Workaround #3 requires moving from the Community to the costly(*) Enterprise Edition, with possible implications not only on the cost but also on the license. In fact, if the Neo4j Community Edition uses the GPL v3 license, to use Neo4j Enterprise you will need to use a different license. Depending on your specific situation you may need a Commercial License (if your software is closed-source), an AGPL v3 (if you are building an open source project), an Evaluation License, or an Educational License (**).


OrientDB is not affected by the disk space reclaim issue: it internally reuses the space. No specific workarounds are needed, and there is no need to restart the server.

What happens from a technical point of view is that when some nodes and relationships are deleted, their space is put by OrientDB in a special data structure which tracks disk space usage. When the new nodes and relationships are created they will be put automatically in the unused space.

Obtaining 100% efficiency is hard, so there is, however, a very small increase of the used disk space even in OrientDB, but it is way far from the increase that happens in Neo4j. 

This small disk space increase in OrientDB can be explained in the following way: when a record is deleted the page index and record position are set to -1, and the record pointer is transformed into a "tombstone". These tombstones use some space, that hence is "lost". However, this effect can be mitigated online, by dropping and recreating the database, or with a non-frequent "offline compaction", by performing a database export/import (during the export process, in fact, tombstones will be ignored, and during the import the cluster positions will be changed and the lost space will be recovered).

Test Case

To reproduce the disk space reuse issue and check the impact on your specific situation you may follow different approaches. One could be to create a simple program, in java or other language of your preference. You may also use a list of queries (SQL for OrientDB, Cypher for Neo4j) and execute them through a console or web application.

The following may the possible steps:
  1. Let's start from a fresh installation and let's suppose we create N nodes and M relationships.
  2. Let's suppose the disk space used by the database after the create operations is X.
  3. Now let's suppose we delete P nodes and Q relationships, where P is between 0 and N, and Q is between 0 and M.
  4. Let's suppose we create other P nodes and Q relationships. Note that at this time the total number of nodes is again N and we have again M relationships (the same value we had before the delete and recreate operations). Calculate the disk space used by the database and compare this value with X.
  5. Repeat steps 3 and 4 a certain numbers of times, e.g other three times.

Servers Versions and Configurations

In the test results presented below, Neo4j 3.0.6 Community Edition and OrientDB 2.2.11 Community Edition, with the following configurations, have been used:

  • default configuration except for the parameter dbms.tx_log.rotation.retention_policy=false.
  • default configuration except for the parameter storage.useWAL = false.

In other words, we are disabling Transaction Logs in Neo4j, and Write Ahead Log in OrientDB.

How to Calculate the Used Disk Space

To calculate the used disk space, you can follow the following approach

  • Sum the size of the files neostore.*.db, under the active_database directory (typically graph.db).  
  • Calculate the total size of the directory of the database used for the test (e.g. databases\test-space-reclaim).

Something to Consider

Note that in Neo4j, it seems that the nodestore is not updated instantly, so it's important to allow some time between  node recreation and disk space calculation.

Test Results

The following test results have been obtained on my laptop (a Windows 10 machine with 4 GB of RAM). I have used:

  • N=400000
  • M=200000
  • P=Q=100000 

The test has been repeated three times, for Neo4j and other three times for OrientDB. The results presented in the charts below are the average of the three runs.

Used Disk Space - Increase from Initial Value - OrientDB vs Neo4j

The following chart show the increase of the used disk space (the servers included always the same number of nodes and relationships, before the space calculation).

As you can see, at the end of the sixth delete-and-recreate cycle, the used disk space in Neo4j has doubled. In OrientDB, the increase at the end of the sixth cycle is quite low: only 11.7%

Disk Space Reclaim Efficiency - OrientDB vs Neo4j

It's useful to calculate a disk space reclaim efficiency. I have calculated it as (100 - disk_use_increase_from_initial_value)%.

The following chart shows how the OrientDB and Neo4j Disk Space Reclaim Efficiency changes during the six cycles of the executed test. Higher efficiency means best performance from a disk space reclaim point of view:

Raw data

The following tables include the raw data of the test:

Neo4j Data

OrientDB Data

Additional Comments on the Obtained Results

If you are interested to more accurate results for your specific use case, my suggestion is that you execute a test on your own environment, using your own application.

How to Reproduce this Benchmark

A general, high level, description on how to reproduce this benchmark can be found in the section "Test Case". If you are looking for the exact queries and used code, they can be found below:


Disk space reclaim is an important aspect of any DBMS. 

The disk space reclaim issue can cause frustration to the database users and service outages (although planned) to workaround the increased utilization of disk space.

Neo4j Community Edition is affected by a serious disk space reclaim issue and requires a restart to reclaim the disk space after a massive delete and recreation operation. On the contrary, OrientDB automatically and internally reuses the freed space in a transparent way for the end users, while the database server remains online. As a result, OrientDB's disk space reclaim efficiency is much more higher.

(*) According to Gartner, Neo Technology "received among the lowest scores in the reference customer survey for value and for pricing model; cost was a key reason cited for not choosing this vendor when it was under consideration." - Magic Quadrant for Operational Database Management Systems, 05 October 2016

(**) Information about Neo4j licenses have been taken from this public page: Neo4j Licensing

All trademarks are the property of their respective owners.

No comments:

Post a Comment