Wednesday, October 12, 2016

Solving Neo4j's space reclaim issue with OrientDB graph database

Neo4j Community Edition is affected by a serious disk space reclaim issue and requires a restart to reclaim the disk space after a massive delete and recreation operation. On the contrary, OrientDB automatically and internally reuses the freed space in a transparent way for the end users, while the database server remains online. As a result, OrientDB's disk space reclaim efficiency is much more higher.

Introduction


There are some peculiar situations, or specific graph use-cases, that are write-intensive in nature and where a very high number of nodes and relationships are deleted and then recreated multiple times.

Assuming that the total number of nodes and relationships remain constant (or almost constant) in the database, one would expect that so does the space used by the database on disk. 

If this would not happen, in fact, it would mean that the more the users delete and recreate nodes and relationships, the more space the database server uses, even if the total number of nodes and relationships does not change - up to unpleasant situations where the disk becomes full, which, if not handled correctly, may cause additional issues or outages. We will refer to a similar situation as to a "disk space reclaim issue".

Sometime how real things work can be different from what we expect. At least in some cases. As a result the disk space reclaim issue can be seen. 

In this article we will compare how Neo4j and OrientDB behave in the above described write-intense scenario to understand if they are affected by the disk space reclaim issue or not.

Neo4j


As it has been publicly discussed in several places (among them the following resources: Neo4J database size / shrinkingneo4j-database-size-growsClearing nodes), Neo4j is affected by the disk space reclaim issue. 

Possible workarounds to solve this issue in Neo4j are: 
  1. Perform a Neo4j restart.
  2. Use the store-utils tool.
  3. Use the Neo4j Enterprise Edition (version 3.0.4 or higher introduce the parameter dbms.ids.reuse.types.override that should allow relationship ids to be reused).
The above workarounds, though, present some limitations, including the following:
  • Workaround #1 requires a database server restart. This means that your application / service will need to have a planned outage. Frequency of such planned outages depends on factors like the following:
    • How many frequently you are deleting and recreating nodes and relationships.
    • How many nodes and relationships are being deleted each time.
    • How much disk space you can use for Neo4j.
  • Workaround #2 requires the use of an external tool, not included in the official Neo4j distribution (although maintained by a Neo4j employee). This may produce increased complexity to integrate store-utils into your production solution. In addition, store-utils:
    • Requires extra time and efforts to perform a full database backup before performing the compact operation (although not formally required, common sense would suggest that a database full backup is made to reduce the risk linked to any possible issues that may arise when using a compact tool on a production database store).
    • Requires a database server restart, hence the limitations described with Workaround #1 also apply to Workaround #2.
  • Workaround #3 requires moving from the Community to the costly(*) Enterprise Edition, with possible implications not only on the cost but also on the license. In fact, if the Neo4j Community Edition uses the GPL v3 license, to use Neo4j Enterprise you will need to use a different license. Depending on your specific situation you may need a Commercial License (if your software is closed-source), an AGPL v3 (if you are building an open source project), an Evaluation License, or an Educational License (**).

OrientDB


OrientDB is not affected by the disk space reclaim issue: it internally reuses the space. No specific workarounds are needed, and there is no need to restart the server.

What happens from a technical point of view is that when some nodes and relationships are deleted, their space is put by OrientDB in a special data structure which tracks disk space usage. When the new nodes and relationships are created they will be put automatically in the unused space.

Obtaining 100% efficiency is hard, so there is, however, a very small increase of the used disk space even in OrientDB, but it is way far from the increase that happens in Neo4j. 

This small disk space increase in OrientDB can be explained in the following way: when a record is deleted the page index and record position are set to -1, and the record pointer is transformed into a "tombstone". These tombstones use some space, that hence is "lost". However, this effect can be mitigated online, by dropping and recreating the database, or with a non-frequent "offline compaction", by performing a database export/import (during the export process, in fact, tombstones will be ignored, and during the import the cluster positions will be changed and the lost space will be recovered).

Test Case


To reproduce the disk space reuse issue and check the impact on your specific situation you may follow different approaches. One could be to create a simple program, in java or other language of your preference. You may also use a list of queries (SQL for OrientDB, Cypher for Neo4j) and execute them through a console or web application.

The following may the possible steps:
  1. Let's start from a fresh installation and let's suppose we create N nodes and M relationships.
  2. Let's suppose the disk space used by the database after the create operations is X.
  3. Now let's suppose we delete P nodes and Q relationships, where P is between 0 and N, and Q is between 0 and M.
  4. Let's suppose we create other P nodes and Q relationships. Note that at this time the total number of nodes is again N and we have again M relationships (the same value we had before the delete and recreate operations). Calculate the disk space used by the database and compare this value with X.
  5. Repeat steps 3 and 4 a certain numbers of times, e.g other three times.

Servers Versions and Configurations


In the test results presented below, Neo4j 3.0.6 Community Edition and OrientDB 2.2.11 Community Edition, with the following configurations, have been used:

Neo4j:
  • default configuration except for the parameter dbms.tx_log.rotation.retention_policy=false.
OrientDB:
  • default configuration except for the parameter storage.useWAL = false.

In other words, we are disabling Transaction Logs in Neo4j, and Write Ahead Log in OrientDB.

How to Calculate the Used Disk Space


To calculate the used disk space, you can follow the following approach

Neo4j:
  • Sum the size of the files neostore.*.db, under the active_database directory (typically graph.db).  
OrientDB:
  • Calculate the total size of the directory of the database used for the test (e.g. databases\test-space-reclaim).

Something to Consider


Note that in Neo4j, it seems that the nodestore is not updated instantly, so it's important to allow some time between  node recreation and disk space calculation.


Test Results


The following test results have been obtained on my laptop (a Windows 10 machine with 4 GB of RAM). I have used:

  • N=400000
  • M=200000
  • P=Q=100000 

The test has been repeated three times, for Neo4j and other three times for OrientDB. The results presented in the charts below are the average of the three runs.


Used Disk Space - Increase from Initial Value - OrientDB vs Neo4j


The following chart show the increase of the used disk space (the servers included always the same number of nodes and relationships, before the space calculation).

As you can see, at the end of the sixth delete-and-recreate cycle, the used disk space in Neo4j has doubled. In OrientDB, the increase at the end of the sixth cycle is quite low: only 11.7%





Disk Space Reclaim Efficiency - OrientDB vs Neo4j


It's useful to calculate a disk space reclaim efficiency. I have calculated it as (100 - disk_use_increase_from_initial_value)%.

The following chart shows how the OrientDB and Neo4j Disk Space Reclaim Efficiency changes during the six cycles of the executed test. Higher efficiency means best performance from a disk space reclaim point of view:




Raw data


The following tables include the raw data of the test:


Neo4j Data

OrientDB Data


Additional Comments on the Obtained Results


If you are interested to more accurate results for your specific use case, my suggestion is that you execute a test on your own environment, using your own application.

How to Reproduce this Benchmark


A general, high level, description on how to reproduce this benchmark can be found in the section "Test Case". If you are looking for the exact queries and used code, they can be found below:

Conclusion


Disk space reclaim is an important aspect of any DBMS. 

The disk space reclaim issue can cause frustration to the database users and service outages (although planned) to workaround the increased utilization of disk space.

Neo4j Community Edition is affected by a serious disk space reclaim issue and requires a restart to reclaim the disk space after a massive delete and recreation operation. On the contrary, OrientDB automatically and internally reuses the freed space in a transparent way for the end users, while the database server remains online. As a result, OrientDB's disk space reclaim efficiency is much more higher.


(*) According to Gartner, Neo Technology "received among the lowest scores in the reference customer survey for value and for pricing model; cost was a key reason cited for not choosing this vendor when it was under consideration." - Magic Quadrant for Operational Database Management Systems, 05 October 2016

(**) Information about Neo4j licenses have been taken from this public page: Neo4j Licensing

All trademarks are the property of their respective owners.

Thursday, October 6, 2016

Solving lack of Neo4j's multitenancy with OrientDB graph database

If you have used Neo4j, you may have noticed that it does not support multi-tenant deployments. This limitation may be one the first things you notice when starting to learn and test Neo4j, and has been publicly discussed in several places, among them the following stackoverflow questions: Neo4j Multi-tenancyHow to achieve Multi-Tenancy in Neo4j.

Lack of multitenancy basically means that you can't create different databases inside Neo4j. You can surely store multiple graphs in Neo4j, but they are all part of a single "database". This has some serious consequences, including lack of isolation.

If you are using Neo4j in a client/server mode and you have multiple clients, you cannot have each client (or set of clients) storing and accessing their own isolated data, unless you use some workarounds.

One workaround to achieve multitenancy in Neo4j is to deploy different Neo4j instances on the same machine (running on different ports) and store different graphs in these different Neo4j instances. This can have of course several implications, including:
  • resource utilization: different servers are running on the same machine;
  • increased complexity: you will have to configure each of the instances. If you want to change a configuration option or tune a parameter you will need to do it on all instances.
Another workaround is to have your application taking care of isolation and access control. But again, as you can understand this workaround has some limitations. You should avoid that your clients connect directly to the database otherwise they may query and edit everything, including data they are not supposed to have access to.

OrientDB, on the other side, does support multitenancy in a similar way other traditional databases do, and in the way you may expect it to work.

Once you deploy OrientDB you can create different databases. Each database can store different data / graphs and your clients can have access to one or more databases. As a result your data is kept secure and isolated.

OrientDB Studio is the web application that you can use to interact with OrientDB.

When you launch Studio, you can select the database to connect to:



You can also create a new database:



or import a public database to make some tests and get some familiarity with OrientDB:



Once you login, you will see the database you are connected to in the top-right corner (in the image below, we are connected to the movie database):



You can also export a single database, using the Studio's export feature:



To manage who can connect to a specific database and the user roles, from Studio, you can click on the Security menu:



Admin, reader and writer are three standard users, with roles admin, reader and writer respectively (you can remove these users if you like). In the image above you can see that I have created an additional user my_movie_user with role admin. This user will have access only to the movie database (unless you create a similar user for other databases).

If you prefer to use the Console to connect to OrientDB, you can specify the database you want to connect to, using the CONNECT syntax. The following command will connect to a remote database movie, using the user my_movie_user:

orientdb> CONNECT REMOTE:192.168.1.1/movie my_movie_user my_password

If the database does not exist, you can create it with the CREATE DATABASE syntax. You can create users with the CREATE USER syntax.

The above are just a few examples: OrientDB supports, in fact, a full set of SQL commands to manage database and users, among them:

All trademarks are the property of their respective owners.