timwilliate's comments

timwilliate · on Feb 15, 2016

If the paper is only 6 months old, why would the authors use an ancient version of Neo4j?

espeed · on Feb 15, 2016

Titan 0.4 [1] and Neo4j 1.9.4 [2] were the versions released in the Fall of 2013 so maybe that's when they did the work.

[1] https://github.com/thinkaurelius/titan/releases/tag/0.4.0

[2] https://github.com/neo4j/neo4j/releases/tag/1.9.4

timwilliate · on Dec 22, 2015

Checkpoint only allows specifying that all packages as they existed on a particular data in CRAN be downloaded and utilized within a project. This is limiting in that an R developer may want to utilize very specific versions of packages that span multiple epochs.

I would really like to see R develop functionality akin to Maven or SBT, such that an R developer can explicitly specify the exact versions of all dependencies, which will then be installed at the first run.

hadley · on Dec 22, 2015

That's the goal of packrat. (But it's still a work in progress)

timwilliate · on Oct 23, 2015

I disagree with you on the statement that Neo4j does not work well in life sciences. I am a data scientist building large scale systems for mining genomic data, and we built a fairly critical piece of that infrastructure around Neo4j. I actually presented an overview of that work at GraphConnect this week:

http://speakerdeck.com/timwilliate/graphs-are-feeding-the-wo...

Many meaningful lineages in life sciences can be hundreds to thousands of levels deep (our datasets are great examples). Neo4j is the only graph database I have evaluated that handles traversals across lineages of this depth while still achieving the performance scalability promised by maintaining index-free adjacency across which ever node in the cluster a traversal is sent to.

a_bonobo · on Oct 23, 2015

The recent "huge open tree of life" paper uses a Neo4j database as well: http://www.pnas.org/content/112/41/12764.full

jerven · on Oct 23, 2015

I am just going to point to our work at sparql.uniprot.org. A graph database with 17 billion edges and 3 billion+ nodes. Containing in its whole the NCBI tax and GO tax trees. That you can access for free over HTTP using standard SPARQL 1.1. This does not run on a cluster but single nodes with Virtuoso 7.2.1.

I am not saying that Neo4J is a bad choice, I am just saying that it due to its lack of federation support it is an expensive choice for the life sciences. i.e. an economic argument over a technical one, and not even looking at 1 project a time but in general for the community. Neo4J and Cypher will never support federation in the way that SPARQL allows. This is because all this URI business in RDF is annoying when modelling your data but critical when merging datasets on demand between separate databases. e.g. joining ChEMBL & UniProt & MeSH & PubChem etc...

We in the life sciences rarely do graph traversals for graph traversal sake, but tend to join trees. e.g. intersect a branch of a taxonomic tree with a branch of the GO tree. There are cases where real graph traversals are being done (assembly&variation graphs).

OpenCypher is a great step forward. Now Neo4J needs a open public standard for serializing graphs to disk that can imported into Neo4J and other databases. RDF being supported by so many different databases allows us to support many more of our users (at UniProt) even if they don't use SPARQL or our choice of Graph database themselves.