More

adsharma · 2026-03-25T04:24:11 1774412651

Thank you for the shout out! I looked into your benchmark setup a bit. Two things going on:

- Ladybug by default allocates 80% of the physical memory to the buffer pool. You can limit it. This wasn't the main reason.

- Much of the RSS is in ladybug native memory connected to the python connection object. I noticed that you keep the connection open between benchmark runs. For whatever reason, python is not able to garbage collect the memory.

We ran into similar lifetime issues with golang and nodejs bindings as well. Many race conditions where the garbage collector releases memory while another thread still has a reference to native memory. We now require that the connection be closed for the memory to be released.

  https://github.com/LadybugDB/ladybug/issues/320
  https://github.com/LadybugDB/go-ladybug/issues/7
  https://github.com/LadybugDB/ladybug-nodejs/pull/1

adsharma · 2026-03-22T16:15:25 1774196125

This is not just a random idea.

AlexNet -> Tansformers -> ChatGPT -> Claude Code -> Small LMs serving KBs

Large LLMs could have a role in efficiently producing such KBs.

adsharma · 2026-03-22T16:12:54 1774195974

So this thing is based on Kiwix, which is based on the ZIM file format.

In the meanwhile, wikipedia ships wikidata, which uses RDF dumps (and probably 8x less compressed than it should be).

https://www.wikidata.org/wiki/Wikidata:Database_download

There is room for a third option leveraging commercial columnar database research.

https://adsharma.github.io/duckdb-wikidata-compression/

jrm4 · 2026-03-22T19:01:49 1774206109

And for those who are only vaguely familiar, this ZIM file format is not the same as the https://zim-wiki.org one.

hofrogs · 2026-03-22T19:10:09 1774206609

I am actually only vaguely familiar and I was wondering about that every time I saw the format referenced but never bothered to check, your comment is informative!

jrm4 · 2026-03-23T16:36:09 1774283769

Yeah, I'm a long time user/disciple of https://zim-wiki.org ; it was basically Obsidian but 15-20 years early. To do some of the things that are now trivially easy with Obsidian I learned scripting and such, so I'm familiar with this very weird coincidence/name collision.

xk3 · 2026-03-23T21:20:04 1774300804

> and probably 8x less compressed than it should be

ZIM uses zstd so it is pretty compressed--but the thing that takes a lot of room is actually the full-text search index built in to each ZIM file.

Unfortunately the UI of kiwix-serve search doesn't take full advantage of this and the search experience kinda sucks...

Have you done anything useful with RDF? Seems like it is just one of those things universities spend money on and it doesn't really do anything

skrtskrt · 2026-03-23T03:59:29 1774238369

I really curious about what the world of archival formats is like - is there consensus? are the most-used formats actually any good and well-supported,and self documenting?

synack · 2026-03-23T11:31:18 1774265478

Library of Congress has some well considered recommendations for archival. https://www.loc.gov/preservation/resources/rfs/TOC.html

For web content they recommend gzipped WARC. This is great for retaining the content, but isn’t easy to search or render.

I do WARC dumps then convert those to ZIM for easier access.

adsharma · 2026-03-22T15:10:20 1774192220

This is the same topic I had an intense argument with my coworkers at the company formerly called FB a decade ago. There is a belief that most joins are 1-2 deep. And that many hop queries with reasoning are rare and non-existent.

I wonder how you reconcile the demand for LLMs with multihop reasoning with the statement above.

I think a lot what is stated here is how things work today and where established companies operate.

The contradictions in their positions are plain and simple.

zozbot234 · 2026-03-22T16:09:37 1774195777

There are worst-case optimal algorithms for multi-way and multi-hop joins. This does not require giving up the relational model.

adsharma · 2026-03-22T16:56:23 1774198583

I maintain LadybugDB which implements WCOJ (inherited from the KuzuDB days). So I don't disagree with the idea. Just that it's a graph database with relational internals and some internal warts that makes it hard to compose queries. Working on fixing them.

https://github.com/LadybugDB/ladybug/discussions/204#discuss...

adsharma · 2026-03-22T16:59:19 1774198759

Also an important test is the check on whether it's WCOJ on top of relational storage or is the compressed sparse row (CSR) actually persisted to disk. The PGQ implementations don't.

There are second order optimizations that LLMs logically implement that CSR implementing DBs don't. With sufficient funding, we'll be able to pursue those as well.

zozbot234 · 2026-03-22T21:33:14 1774215194

CSR is an array-based trie hence very costly to update. It can serve as an index for parts of the graph that basically will almost never change, but not otherwise.

adsharma · 2026-03-22T21:50:47 1774216247

Makes it a good match for columnar databases which already operate on the read-only, read-mostly part of the spectrum.

Perhaps people can invent LSM like structures on top of them.

But at least establish that CSR on disk is a basic requirement before you claim that you're a legit graph database.

adsharma · 2026-03-22T15:06:37 1774191997

I maintain a fork of pgserver (pglite with native code). It's called pgembed. Comes with many vector and BM25 extensions.

Just in case folks here were wondering if I'm some type of a graphdb bigot.

adsharma · 2026-03-22T02:03:31 1774145011

For starters, LLMs themselves are a graph database with probabilistic edge traversal.

Some apps want it to be deterministic.

I'm surprised this question comes up so often.

It's mainly from the vector embedding camp, who rightfully observe that vector + keyword search gets you to 70-80% on evals. What is all this hype about graphs for the last 20-30%?

Tsarp · 2026-03-22T05:23:51 1774157031

"LLMs themselves are a graph database with probabilistic edge traversal" whaat?

Do you have any good demos to showcase where graph DBs clearly have an advantage? Its mostly just toy made demos.

vector embeddings on the other hand no matter how limited clearly have proven themselves useful beyond youtube/linkedin thought leader demos.

adsharma · 2026-03-22T13:54:14 1774187654

It comes from people who develop LLMs. Anthropic and Google. References below.

My other favorite quote: transformers are GNNs which won the hardware lottery.

Longer form at blog.ladybugmem.ai

You want to believe that everything probabilistic has more value and determinism doesn't? Or that the world is made up of tabular data? You have a lot of company.

The other side of the argument I believe has a lot of money.

https://www.anthropic.com/research/mapping-mind-language-mod...

https://research.google/blog/patchscopes-a-unifying-framewor...

Tsarp · 2026-03-23T11:18:13 1774264693

Not sure how that was the take away from both the posts above.

I read the blog post and your website but unfortunately didnt help change my perspective.

Thanks for the share

adsharma · 2026-03-21T19:59:41 1774123181

That importing is expensive and prevents you from handling billion scale graphs.

It's possible to run cypher against duckdb (soon postgres as well via duckdb's postgres extension) without having to import anything. That's a game changer when everything is in the same process.

adsharma · 2026-03-21T19:56:51 1774123011

What is open source and what is a graph database are both hotly debated topics.

Author of ArcadeDB critiques many nominally open source licenses here:

https://www.linkedin.com/posts/garulli_why-arcadedb-will-nev...

What is a graph database is also relevant:

  - Does it need index free adjacency?
  - Does it need to implement compressed sparse rows?
  - Does it need to implement ACID?
  - Does translating Cypher to SQL count as a graph database?

adsharma · 2026-03-21T19:51:57 1774122717

What people perceive as "Facebook production graph" is not just TAO. There is an ecosystem around it and I wrote one piece of it.

Full history here: https://www.linkedin.com/pulse/brief-history-graphs-facebook...

adsharma · 2026-03-21T16:48:30 1774111710

Are you talking about the query plan for scanning the rel table? Kuzu used a hash index and a join.

Trying to make it optional.

Try

explain match (a)-[b]->(c) return a.rowid, b.rowid, c.rowid;