Hacker Newsnew | past | comments | ask | show | jobs | submit | adsharma's commentslogin

Thank you for the shout out! I looked into your benchmark setup a bit. Two things going on:

- Ladybug by default allocates 80% of the physical memory to the buffer pool. You can limit it. This wasn't the main reason.

- Much of the RSS is in ladybug native memory connected to the python connection object. I noticed that you keep the connection open between benchmark runs. For whatever reason, python is not able to garbage collect the memory.

We ran into similar lifetime issues with golang and nodejs bindings as well. Many race conditions where the garbage collector releases memory while another thread still has a reference to native memory. We now require that the connection be closed for the memory to be released.

  https://github.com/LadybugDB/ladybug/issues/320
  https://github.com/LadybugDB/go-ladybug/issues/7
  https://github.com/LadybugDB/ladybug-nodejs/pull/1

This is not just a random idea.

AlexNet -> Tansformers -> ChatGPT -> Claude Code -> Small LMs serving KBs

Large LLMs could have a role in efficiently producing such KBs.


So this thing is based on Kiwix, which is based on the ZIM file format.

In the meanwhile, wikipedia ships wikidata, which uses RDF dumps (and probably 8x less compressed than it should be).

https://www.wikidata.org/wiki/Wikidata:Database_download

There is room for a third option leveraging commercial columnar database research.

https://adsharma.github.io/duckdb-wikidata-compression/


And for those who are only vaguely familiar, this ZIM file format is not the same as the https://zim-wiki.org one.

I am actually only vaguely familiar and I was wondering about that every time I saw the format referenced but never bothered to check, your comment is informative!

Yeah, I'm a long time user/disciple of https://zim-wiki.org ; it was basically Obsidian but 15-20 years early. To do some of the things that are now trivially easy with Obsidian I learned scripting and such, so I'm familiar with this very weird coincidence/name collision.

> and probably 8x less compressed than it should be

ZIM uses zstd so it is pretty compressed--but the thing that takes a lot of room is actually the full-text search index built in to each ZIM file.

Unfortunately the UI of kiwix-serve search doesn't take full advantage of this and the search experience kinda sucks...

Have you done anything useful with RDF? Seems like it is just one of those things universities spend money on and it doesn't really do anything


I really curious about what the world of archival formats is like - is there consensus? are the most-used formats actually any good and well-supported,and self documenting?

Library of Congress has some well considered recommendations for archival. https://www.loc.gov/preservation/resources/rfs/TOC.html

For web content they recommend gzipped WARC. This is great for retaining the content, but isn’t easy to search or render.

I do WARC dumps then convert those to ZIM for easier access.


This is the same topic I had an intense argument with my coworkers at the company formerly called FB a decade ago. There is a belief that most joins are 1-2 deep. And that many hop queries with reasoning are rare and non-existent.

I wonder how you reconcile the demand for LLMs with multihop reasoning with the statement above.

I think a lot what is stated here is how things work today and where established companies operate.

The contradictions in their positions are plain and simple.


There are worst-case optimal algorithms for multi-way and multi-hop joins. This does not require giving up the relational model.

I maintain LadybugDB which implements WCOJ (inherited from the KuzuDB days). So I don't disagree with the idea. Just that it's a graph database with relational internals and some internal warts that makes it hard to compose queries. Working on fixing them.

https://github.com/LadybugDB/ladybug/discussions/204#discuss...


Also an important test is the check on whether it's WCOJ on top of relational storage or is the compressed sparse row (CSR) actually persisted to disk. The PGQ implementations don't.

There are second order optimizations that LLMs logically implement that CSR implementing DBs don't. With sufficient funding, we'll be able to pursue those as well.


CSR is an array-based trie hence very costly to update. It can serve as an index for parts of the graph that basically will almost never change, but not otherwise.

Makes it a good match for columnar databases which already operate on the read-only, read-mostly part of the spectrum.

Perhaps people can invent LSM like structures on top of them.

But at least establish that CSR on disk is a basic requirement before you claim that you're a legit graph database.


I maintain a fork of pgserver (pglite with native code). It's called pgembed. Comes with many vector and BM25 extensions.

Just in case folks here were wondering if I'm some type of a graphdb bigot.


For starters, LLMs themselves are a graph database with probabilistic edge traversal.

Some apps want it to be deterministic.

I'm surprised this question comes up so often.

It's mainly from the vector embedding camp, who rightfully observe that vector + keyword search gets you to 70-80% on evals. What is all this hype about graphs for the last 20-30%?


"LLMs themselves are a graph database with probabilistic edge traversal" whaat?

Do you have any good demos to showcase where graph DBs clearly have an advantage? Its mostly just toy made demos.

vector embeddings on the other hand no matter how limited clearly have proven themselves useful beyond youtube/linkedin thought leader demos.


It comes from people who develop LLMs. Anthropic and Google. References below.

My other favorite quote: transformers are GNNs which won the hardware lottery.

Longer form at blog.ladybugmem.ai

You want to believe that everything probabilistic has more value and determinism doesn't? Or that the world is made up of tabular data? You have a lot of company.

The other side of the argument I believe has a lot of money.

https://www.anthropic.com/research/mapping-mind-language-mod...

https://research.google/blog/patchscopes-a-unifying-framewor...


Not sure how that was the take away from both the posts above.

I read the blog post and your website but unfortunately didnt help change my perspective.

Thanks for the share


That importing is expensive and prevents you from handling billion scale graphs.

It's possible to run cypher against duckdb (soon postgres as well via duckdb's postgres extension) without having to import anything. That's a game changer when everything is in the same process.


What is open source and what is a graph database are both hotly debated topics.

Author of ArcadeDB critiques many nominally open source licenses here:

https://www.linkedin.com/posts/garulli_why-arcadedb-will-nev...

What is a graph database is also relevant:

  - Does it need index free adjacency?
  - Does it need to implement compressed sparse rows?
  - Does it need to implement ACID?
  - Does translating Cypher to SQL count as a graph database?

What people perceive as "Facebook production graph" is not just TAO. There is an ecosystem around it and I wrote one piece of it.

Full history here: https://www.linkedin.com/pulse/brief-history-graphs-facebook...


Are you talking about the query plan for scanning the rel table? Kuzu used a hash index and a join.

Trying to make it optional.

Try

explain match (a)-[b]->(c) return a.rowid, b.rowid, c.rowid;


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: