There’s a long introduction. For the interesting juice, click here
Whatever new technology comes along, it is accompanied with a “great deal.” I don’t just mean an above normal coverage, I mean a big promise that you’re getting a great deal by using/switching/adopting/etc.
In neo4j’s case, it was no different.
During our Information Management course, we had to dig into graph databases and neo4j was right there waiting for us. In the process (of digging), we were sucked right in. All information we read were pointing in just one direction — neo4j is awesome.
And maybe it was. But there was a problem — why wasn’t there “another side of the story” anywhere? All strong comparisons were posted to prove that neo4j was better in doing certain tasks than the most common RDBMS out there — MySQL.
After spending 6 weeks digging into neo4j and believing all the resources we read (and we read a lot! Online resources, literature, etc.) we showcased to our class that neo4j was amazing. However, our Professor was not convinced. Instead, she asked us: could there be a neglected side to all this?
We set journey to at least try and duplicate these data to see if they had any truth to them. We came across this blog post by Jörg Baach, with exactly the same goal as us — seeing if it was the entire truth. Baach had posted the code he used for his tests, which we based our work on. (His original work could be found here.)
The information that we wanted to duplicate and at least have a go at were retrieved from Baach’s post, and this paper as well.
Our code is on github: https://github.com/kamasheto/neo4j-exps
The work we did consisted of four main tasks:
1. Installing the appropriate software (and tuning as necessary)
2. Generating the appropriate data
3. Importing the generated data in both database engines
4. Running and reporting the tests
Installing database engines
We installed neo4j as normal, and used the properties provided by Baach to further tune the setup. This set of properties added indexing features and increased the buffer size of neo4j. Refer to Baach’s blogpost for information on how to tune both MySQL and neo4j further.
The requirements are, however:
- Python (with MySQLdb module)
- Neo4j and MySQL (obviously)
Generating the appropriate data
To generate the data, we ran `gen.py`:
$ python gen.py 1000
This created a `data/pickle.1000` file that included 1000 friends information, with 50 outgoing relationships each. This was stored so we’d have identical information in both cases. (You can change the 1000 to any number you desire, but keep in mind the time factor.)
Importing the data
To import the data, we ran `import.py`
$ python import.py mysql 1000
$ python import.py neo4j 1000
Running and reporting
After warming the cache, we run the tests and reported times with the `tests.py` script
$ python tests.py mysql 1000 3
$ python tests.py neo4j 1000 3
The 3 in this case is the number of hops the script would need to test against. The query is ran 10 times with different starting points and average time is reported. The sample query is printed out in the first run for reference.
The reported results were as follows — the paper reported:
MySQL neo4j 100 S0 19.56 8 S1 33 12.65 S2 111.334 19.57 500 S0 281.38 10 S1 333.96 17 S2 620.56 21
This set of results clearly showed neo4j outperforming MySQL in every criteria tested. Baach, however, reported the exact opposite:
MySQL neo4j Baach 100k S0 0.000 0.010 S1 0.001 0.018 S2 0.072 0.376 1M S0 0.000 0.010 S1 0.002 0.017 S2 0.082 0.484
His results showed MySQL in fact outperformed neo4j (he even reports his python implementation was even faster than both!) We reported results that were quite different than both:
MySQL neo4j Our Results 100k S0 0.0050 0.0417 S1 0.0357 0.1503 S2 1.3740 1.1313 S3 64.547 ∞
The following remarks could be made about the obtained results
- H0: MySQL has great indexing capabilities, allowing it to outperform neo4j up to two levels of relation hops
- H1: neo4j’s graph traversing powers come to use starting to query more than three levels of relationship hops
- neo4j is extremely resource intensive. This might be the reason behind it not reporting results for four levels of relationship traversal. We speculate it would outperform MySQL if it did complete.
- Needless to say, tuning plays a great role in the execution times. Further tuning would be needed to produce more robust results