Hacker Newsnew | past | comments | ask | show | jobs | submit | more adeptima's commentslogin

this guy .....


THIS GUY FUCKS!


Meilisearch is great, used it for a quick demo

However if you need a full-text search similar to Apache Lucene, my go-to options are based on Tantivy

Tantivy https://github.com/quickwit-oss/tantivy

Asian language, BM25 scoring, Natural query language, JSON fields indexing support are all must-have features for me

Quickwit - https://github.com/quickwit-oss/quickwit - https://quickwit.io/docs/get-started/quickstart

ParadeDB - https://github.com/paradedb/paradedb

I'm still looking for a systematic approach to make a hybrid search (combined full-text with embedding vectors).

Any thoughts on up-to-date hybrid search experience are greatly appreciated


Quickwit was bought by Datadog, so I feel there's some risk quickwit-oss becomes unmaintained if Datadog's corporate priority shifts in the future, or OSS maintenance stops providing return on investment. Based on the Quickwit blog post, they are relicensing to Apache2 and releasing some enterprise features, so it seems very possible the original maintainers will move to other things, and it's unclear if enough community would coalesce to keep the project moving forward.

https://quickwit.io/blog/quickwit-joins-datadog#the-journey-...


I have an implementation of Quickwit, so I've thought about this.

The latest version is stable and fast enough, that I think this won't be an issue for a while. It's the kind of thing that does what it needs to do, at least for me.

But I totally agree that the project is at risk, given the acquisition.


As far as combining full-text search with embedding vectors goes, Typesense has been building features around that - https://typesense.org/docs/28.0/api/vector-search.html

I haven't tried those features but I did try Meilisearch awhile back and I found Typesense to index much faster (which was a bottleneck for my particular use case) and also have many more features to control search/ranking. Although just to say, my use case was not typical for search and I'm sure Meilisearch has come a long way since then, so this is not to speak poorly of Meilisearch, just that Typesense is another great option.


Meilisearch just improved the indexing speed and simplified the update path. We released v1.12 and highly improved indexing speed [1]. We improved the upgrade path with the dumpless upgrade feature [2].

The main advantage of Meilisearch is that the content is written to disk. Rebooting an instance is instant, and that's quite useful when booting from a snapshot or upgrading to a smaller or larger machine. We think disk-first is a great approach as the user doesn't fear reindexing when restarting the program.

That's where Meilisearch's dumpless upgrade is excellent: all the content you've previously indexed is still written to disk and slightly modified to be compatible with the latest engine version. This differs from Typesense, where upgrades necessitate reindexing the documents in memory. I don't know about embeddings. Do you have to query OpenAI again when upgrading? Meilisearch keeps the embeddings on disk to avoid costs and remove the indexing time.

[1]: https://github.com/meilisearch/meilisearch/releases/tag/v1.1... [2]: https://github.com/meilisearch/meilisearch/releases/tag/v1.1...


Thank you for the response here. Not being able to upgrade the machine without completely re-indexing has actually become a huge issue for me. My use case is that I need to upgrade the machine to perform a big indexing operation that happens all at once and then after that reduce the machine resources. Typesense has future plans to persist the index to disk but it's not on the road map yet. And with the indexing improvements, Meilisearch may be a viable option for my use case now. I'll be checking this out!


I hate the way typesense are doing their « hybrid search ». It’s called fusion search and the idea is that you have no idea of how well the semantic and full text search are being doing, so you’re going to randomly mix them together without looking at all at the results both searches are returning.

I tried to explain them in an issue that in this state it was pretty much useless because you would always have one or the other search strategy that would give you awful results, but they basically said « some other engine are doing that as well so we won’t try to improve it » + a ton a justification instead of just admitting that this strategy is bad.


We generally tend to engage in in-depth conversations with our users.

But in this case, when you opened the GitHub issue, we noticed that you’re part of the Meilisearch team, so we didn’t want to spend too much time explaining something in-depth to someone who was just doing competitive research, when we could have instead spent that time helping other Typesense users. Which is why the response to you might have seemed brief.

For what it’s worth, the approach used in Typesense is called Reciprocal Rank Fusion (RRF) and it’s a well researched topic that has a bunch of academic papers published on it. So it’s best to read those papers to understand the tradeoffs involved.


> But in this case, when you opened the GitHub issue, we noticed that you’re part of the Meilisearch team, so we didn’t want to spend too much time explaining something in-depth to someone who was just doing competitive research, when we could have instead spent that time helping other Typesense users. Which is why the response to you might have seemed brief.

Well, in this case I was just trying to be a normal user that want the best relevancy possible and couldn’t find a solution. But the reason why I couldn’t find it was not because you didn’t want to spend more time on my case, it was because typesense provide no solution to this problem.

> it’s a well researched topic that has a bunch of academic papers published on it. So it’s best to read those papers to understand the tradeoffs involved.

Yeah, cool or in other word « it’s bad, we know it and we can’t help you, but it’s the state of the art, you should instruct yourself ». But guess what, meilisearch may need some fine-tuning around your model etc, but in the end it gives you the tool to make a proper hybrid search that knows the quality of the results before mixing them.

If other people want to see the original issue: https://github.com/typesense/typesense/issues/1964


I think this is a good example of why people should disclose their background when commenting on competing products/projects. Even if the intentions were sound, which seems to be the case here, upfront disclosure would have given the conversation more weight and meaning.


+1 typesense is really fast. the only drawback is starting up is slow when index getting larger. the good thing is full text search (excl vector) is relatively stable feature set, so if your use case is just FTS, you won't need to restart very often for version upgrade.


>I'm still looking for a systematic approach to make a hybrid search (combined full-text with embedding vectors).

Start off with ES or Vespa, probably. ES is not hard at all to get started with, IMO.

Try RRF - see how far that gets you for your use case. If it's not where you want to be, time to get thinking about what you're trying to do. Maybe a score multiplication gets you where you want to be - you can do it in Vespa I think, but you have to hack around the inability to express exactly that in ES.


I’m using Typesense hybrid search, it does the job, well priced and is low-effort to implement. Feel free to ask any specific questions


You should try Meilisearch then, you'll be astonished by the quality of the results and the ease of setup.


https://news.ycombinator.com/user?id=Kerollmops

> Meilisearch Co-Founder and Tech Lead.

You really should disclose your affiliation.


Try LanceDB https://github.com/lancedb/lancedb

It’s based off of the data fusion engine, has vector indexing and BM 25 indexing, has pipes on and rust bindings


> I'm still looking for a systematic approach to make a hybrid search (combined full-text with embedding vectors).

You know that Meilisearch is the way to go, right? Tantivy, even though, I love the product, doesn't support vector search. Its Hybrid search is stunningly good. You can try it on our demo [1].

[1]: https://wheretowatch.meilisearch.com/


why couldn't it be possible to just embed Meilisearch/Tantivy/Quickwit inside Postgres as a plugin to simplify the setup?


> [..] to simplify the setup?

It would be simpler to keep Meilisearch and its key-value store out of Postgres' WAL and stuff and better propose a good SQL exporter (in the plan).


Perhaps on a technical level, but for a dev, if I just need to install Postgres and some plugins and, boom, I have a full searchable index, it's even easier


Look at Superset chart implementations and component choice all the time.


same experience


Happy eChart user. Added a tiny Reactjs wrapper on the top and ditched all D3 libraries. Never look back. Easy to inline and embed into slatejs based documents. Usable on mobile and responsive enough for my use cases.


if author covers cgroups,io_uring,namespaces,ebpf topics and update the book repo, i would just buy it to support.

https://github.com/stewartweiss/intro-linux-sys-prog

IMHO having up-to-date broad range of simple examples is worth of 100 bucks


Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)

Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).

Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.

Appreciate any tips on the subject


And by the way, has anyone researched on GNAP (published 20 March 2024)?

> GNAP (Grant Negotiation and Authorization Protocol) is an in-progress effort to develop a next-generation authorization protocol

From spec https://oauth.net/gnap/

> GNAP is not an extension of OAuth 2.0 and is not intended to be directly compatible with OAuth 2.0. GNAP seeks to provide functionality and solve use cases that OAuth 2.0 cannot easily or cleanly address.

> GNAP and OAuth 2.0 will likely exist in parallel for many deployments, and considerations have been taken to facilitate the mapping and transition from existing OAuth 2.0 systems to GNAP

Doesnt look like GNAP will fly any time soon, however there is a very interesting part - Security Considerations section. Looks like it was made by people who are familiar with all varieties of cyberops and usability issues in OAuth2/OIDC spec.

Security Considerations section

https://datatracker.ietf.org/doc/html/draft-ietf-gnap-core-p...

If any cyberops, pentester pro reading this, please advise how to research more. Thanx in advance.


> Some specs should be mandatory 100% agree

OpenID Foundation seems took a path of making "profiles" like FAPI rather consolidation and enforcing the best practices and depricating the bad.

FAPI (Financial-grade API Security Profile 1.0) https://openid.net/specs/openid-financial-api-part-1-1_0.htm...

I hope the community will combine it all at some point and add specifications for proper policy and resources management too by looking at the full lifecycle of modern applications.


Hopefully, when OAuth 2.1 is released, OpenID Connect will be updated to be based on OAuth 2.1. This would make some of the useful advice in FAPI (like PKCE) mandatory. A lot of the FAPI stuff that is not included in OAuth 2.1 or the OAuth BCP is just over-engineering by wanabee cryptographers, bad advice, or at least useless advice.

Knowing the OpenID foundation, this could be yet another undocumented errata set released, but we can still dream of a better world, can't we? In a better world, instead of "Use 2048 bit RSA keys" the spec will say "Don't use RSA ever."

The advanced FAPI has even more directly bad advice, as requiring PS256 and ES256. Now, these are not so bad as the common RS256 (RSA with PKCSv1.5 padding), but they are still bad algorithms. The only good asymmetric algorithm defined in JWS is EdDSA, which just like that, is forbidden by OIDC FAPI. So I'm quite happy FAPI is just a profile that would mostly be ignored.


It looks like FAPI 2.0 has finally been released in December, and thankfully it killed off most of the excesses of FAPI 1.0 and is better aligned with OAuth 2.0.

At this point the main differences are:

1. PAR: A good idea that should become a part of the OAuth standard, even if it costs an extra RTT. It prevents a pretty large class of attacks.

2. "iss" response parameter becomes mandatory (it is a core part of OAuth 2.1, but considered optional). This is a useful measure against mix-up attacks in certain conditions, but the conditions that enable it are less common than the ones

3. Requires either Mutual TLS or DPoP. I am less sold on that.

Mutual TLS is great for high security contexts, since it prevents Man-in-the-Middle attacks and generally comes with guaranteed key rotation. But mTLS is still quite troublesome to implement. DPoP is easier, but of more questionable value. It doesn't fully protect against MitM, keys are rarely rotated and it is generally susceptible to replay attacks against you take costly measures and it relies on JWT being implemented securely, by a developer who understand how not to shoot themselves in the foot with their brand new JWT Mark II shotgun. The

4. Which bring us to cryptographic algorithm usage guidelines. These are not part of OAuth 2.1, since OAuth does not mandate or rely on any cryptography with the sole exception of the SHA-256 hash used for PKCE.

This is good design. When there is an alternative that doesn't require cryptography (such as stateful tokens or the authorization code flow), it is generally more secure. You have one less algorithm to worry about being broken (e.g. by advances in quantum computing).

For what it's worth, the guidelines okay, but not good enough. RSA is still allowed. Yes, it requires PSS and 2048 bit keys, but there are knobs left that you can can use to generate valid but insecure RSA keys (e.g. a weak exponent). With EdDSA there is no such option. Weak keys are impossible to generate. Considering EdDSA is also faster, has smaller signature size and better security, there are no good reason to use RSA (and to a lesser degree ECDSA) anymore.

In short, in an ideal world I think I would just want OAuth 2.1 to incorporate PAR and make the "iss" response parameter mandatory. The cryptographic (JOSE) parts of the specification seem to me like too much add complexity, for too little gain, with too little in the way of making cryptography safe.


my guess the idea and intension of .well-known was good, so generic end-user libraries can be implement ... the reality is ugly and generate lot of man hours for cyberops consultancies


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: