Hacker Newsnew | past | comments | ask | show | jobs | submit | stefano's commentslogin

Those are some very big claims with respect to performance. Has anyone outside of the author been able to reproduce the claims, considering you need to pay 100k/month just to do it?

I also wonder if the commercial version has anti-benchmark clauses like some database vendors. I've always seen claims that K is much faster than anything else out there, but I've never seen an actual independent benchmark with numbers.

Edit: according to https://mlochbaum.github.io/BQN/implementation/kclaims.html, commercial licenses do indeed come with anti-benchmark clauses, which makes it very hard to take the one in this post at face value.


I used to use K professionally inside a hedge fund a few years back. Aside from the terrible user experience (if your code isn’t correct you will often just get ‘error’ or ‘not implemented’ with no further detail), if the performance really was as stellar as claimed, then there wouldn’t need to be a no benchmark clause in the license. It can be fast, if your data is in the right formats, but not crazy fast. And easy to beat if you can run your code on the GPU.


The last point is spot on... Pandas on GPUs (cudf) gets you both the perf + usability, without having to deal with issues common to stack/array languages (k) and lazy languages (dask, polars). My flow is pandas -> cudf -> dask cudf , spark, etc.

More recently, we have been working on GFQL with users at places like banks (graph dataframe query language), where we translate down to tools like pandas & cudf. A big "aha" is that columnar operations are great -- not far from what array languages focus on -- and having a static/dynamic query planner so optimizations around that helps once you hit memory limits. Eg, dask has dynamic DFS reuse of partitions as part of its work stealing. More SQL-y tools like Spark may make plans like that ahead of time. In contrast, that lands more on the user if they stick with pandas or k, eg, manual tiling.


I've been using kdb/q since 2010. Started at a big bank and have used it ever since.

Kdb/q is like minimalist footwear. But you can run longer and faster with it on. There's a tipping point where you just "get it". It's a fantastic language and platform.

The problem is very few people will pay 100k/month for shakti. I'm not saying people won't pay and it won't be a good business. But if you want widespread adoption you need to create and an ecosystem. Open sourcing it is a start. Creating libraries and packages comes after. The mongodb model is the right approach IMO


Can you elaborate on what mongo did right? My understanding is that AWS is stealing their business by creating a compatible api


MongoDB's biggest feat is marketing. Recovering from people comparing your database with a black hole is... something.


Would you recommend K?

Is something else better (if so what)?


I think the main reason to use any of these array languages (for work) is job security. Since it's so hard to find expert programmers if you can get your employer to sign off on an array language for all the mission-critical stuff then you can lock in your job for life! How can they possibly replace you if they can't find anyone else who understands the code?

Otherwise, I don't see anything you can do in an array language that you couldn't do in any other language, albeit less verbosely. But I believe in this case a certain amount of verbosity is a feature if you want people to be able to read and understand the code. Array languages and their symbol salad programs are like the modern day equivalent of medieval alchemists writing all their lab notes in a bespoke substitution cipher. Not unbreakable (like modern cryptography) but a significant enough barrier to dissuade all but the most determined investigators.

As an aside, I think the main reason these languages took off among quants is that investing as an industry tends toward the exultation of extremely talented geniuses. Perhaps unintelligible "secret sauce" code has an added benefit of making industrial espionage more challenging (and of course if a rival firm steals all your code they can arbitrage all of your trades into oblivion).


I'm sorry but you really sound like you judge APLs from an outsider pov. For sure, it's not job security that's keeping APLs afloat, because APLs are very easy to learn. A pro programmer would never use K or any APL. But pro mathematicians or scientists needing some array programming for their job will.


I have been a programmer, scientist, etc, at various times in my life and I have programmed in J. I don't think there is any compelling reason to use J over Matlab, R, or Python and very many reasons not to. Vector languages are mind expanding, for sure, but they have done a very poor job keeping up with the user experience and network effects of newer languages.

A few years ago I wrote a pipeline in J and then re-implemented it R. The J code was exactingly crafted and the R code was naive, but the R code was still faster, easier to read and maintain and, frankly, easier to write. J gives a certain perverse frisson, but beyond that I don't really see the use case.


I disagree. For the stuff I'm doing, I've been rewriting the same 1-screen program in different ways many times. J made that easy, R would make that much more tedious since I'd have >250 lines instead of 30. Of course if I'm doing something for which R has libs, I'll do it in R. Additionally, I'll very probably end up rewriting my final program in another language for speed and libs, because by then I'll know precisely what I want to write. IMO, the strength of array languages lies in easy experimentation.


I think there is an element of truth to this, but the context where this ends up being an advantage is personal and idiosyncratic.


Are you a quant, or doing very specialized numerical things that aren't in libs on 2D datasets? Then 10X yes. If not, no. Everything else will be better


I'm not a quant but having used "data science" languages, ocaml, J, R, etc, I strongly doubt that an array language offers any substantial advantages at this date. I could be wrong, of course, but it seems unlikely.


J only is a data science language in the sense that its core is suited to dataset manipulation, but it's severely lacking in modelling librairies. If you're doing exotic stuff and you know exactly what you're doing, I can see the rationale but otherwise it's R/Python any day. It makes plenty of sense for custom mathematical models, such as simulations though.


J and OPFS


no benchmark clause sounds like webgpu


Quick for building a website? Probably not.

Quick for evaluating some idea you just had if you are a quant? Yes absolutely!

So imagine you have a massive dataset, and an idea.

For testing out your idea you want that data to be in an “online analytical processing” (OLAP) kind of database. These typically store the data by column not row and other tricks to speed up crunching through reads, trading off write performance etc.

There are several big tech choices you could make. Mainstream king is SQL.

Something that was trendy a few years ago in the nosql revolution was to write some scala at the repl.

It is these that K is competing with, and being faster than.


I would probably use Matlab for that sort of stuff tbh. Is K faster than Matlab?


For using a heap of libs to compute something that has been done a thousand times? No. For writing a completely new data processing pipeline from scratch? Much, much faster! Array langs have decent performance, but that's not why they are used. They are used because you develop faster for the kind of problem they're good at. People always conflate those two aspects. I use J for scientific research.


Seems weird to switch to develop faster and complain about people conflating the two aspects when this thread is clearly talking about runtime performance, triggered by the benchmark claims:

> real-sql(k) is consistently 100 times faster (or more) than redshift, bigquery, snowflake, spark, mongodb, postgres, ..

> same data. same queries. same hardware. anyone can run the scripts.


GP here :)

When talking about speed I was rolling the time taken to write the query and get the query to run with the run time. The total speed depends upon both.

In another thread someone pointed out that being a good quant is also about having the best ideas in the first place.

This is just a general comment on the whole comments section in general that I leave here:

We have this weird situation where the general programming population looks at a small group of the most profitable programmers who are thinking about their domain problems in languages that are a mapping or mathematical symbols to a qwerty keyboard (in the old days there were APL keyboards). And the mainstream programmers say that is so weird that it must be wrong and must be a lie and so on.

Occam’s razor says that those profitable programmers wouldn’t be buying K if the same results at lower TCO or better results at a higher price?

In broader data engineering there has been tech like “I use scala!” that are used to gatekeep and recognise the in crowd. But that is in the faceless corporate end of enterprise data engineering where people are not measured in bottom lines.

Sorry for venting :)


> most profitable programmers

This more than anything demonstrates the hothouse-flower mentality of K stans. Quants have long since stopped being the best-paid or most value-generating engineers, and since K has zero application outside of quant, it's no longer even a particularly lucrative skill to acquire.

It's interesting though that the opacity of the "I make more money than you" argument fits so snugly with other unverifiable and outdated claims of K supremacy, like performance, job security, or expressiveness.


Besides, it has been my experience that the more the programming part of a given quant's job contributes to their profitability, the less they enjoy using K. K is a neat language for research, but I don't know of many who still like it as a language to write code you intend to reuse or maintain.

That said, I would personally rather do research in python, especially now that the performance situation is reversed.


> Seems weird to switch to develop faster and complain about people conflating the two aspects when this thread is clearly talking about runtime performance, triggered by the benchmark claims

It doesn't look to me like GP switched to develop and complained of conflation. The switch happened higher up the thread by wood_spirit, and GP just continued the conversation (and called out the tendency to conflate, without calling out a specific person).

On a meta note, I wish this trend of saying "it seems weird" and then calling out some fallacy or error would die. Fallacies are extremely common and not "weird", and it comes off as extremely condescending.

It happens quite frequently on HN (and surely other places, though I don't regularly patronize those). So to be clear, this isn't critcism levelled at you exclusively. (I even include myself as target for this criticism, as I've used the expression previously on HN as well, before I thought more about it).

Firstly, in this case and in most cases where that expression is used, it's actually weird to call it weird[1]. Fallacies, logic errors, and other mistakes are extremely natural to humans. Even with significant training and effort, we still make those mistakes routinely.

Secondly, it seems like it's often used as a veiled ad hominem or insult. It's entirely superfluous to add. In this case you could have just said "you complained about people conflating the two aspects and then conflated them yourself." (It still wouldn't have been correct as GP didn't conflate them, but it would have been more direct and clear).

Thirdly, it comes off as condescending[2]. It's sort of like saying, "whoa dude, we're all normal and don't make mistakes, but you're weird and do make mistakes." In reality, we all do it so it's not weird at all.

[1]: https://www.merriam-webster.com/dictionary/weird

  1: of strange or extraordinary character : ODD, FANTASTIC
  2: of, relating to, or caused by witchcraft or the supernatural : MAGICAL
[2]: The irony of this is not lost on me. I can definitely see how this comment might also come off as condescending. I don't intend it to be, but it is a ridiculously long comment for such a simple point. It also included a dictionary check which is also frequently a charactersitc of condescending comments. I don't intend it to be condescending, merely reflective of self-analytical, but as is my theme here, we all make mistakes :-)


You can understand something fully and still call it weird. I’ve used perl for decades, some of the things it does are still just weird. As for fallacies, one of the first things I was taught in logic class was that using an argument’s formal structure (or lack thereof) to determine truth is itself a fallacy. Unsound != Untrue, and throwing around Latin names of fallacies doesn’t actually support an argument.


I'm not switching anything. Just trying to add to the conversation. BTW, for simpler queries I have no doubt the benchmarks are correct. I anticipate it would not hold for more beefy queries.


You came by it honestly! The initial conflation (and therefore context switch) happened further up-thread.


Yeah to be clear I meant performance-wise. In terms of development speed it looks like it just goes to insane extreme on the "fast at the beginning / fast in the middle" trade-off. You know how dynamic typing can be faster to develop when you're writing a one-man 100 line script, but once you get beyond that the extra effort to add static types means you overtake dynamic typing.


Performance-wise I'll admit I don't really know. But I do know that performance is at least decent, in that it lets you experiment easily with pretty much any array problem. Your dynamic/static typing analogy is pretty much on point.


I think some things will be faster and the real difference is more in things that are easy to express in the language. Eg solving differential equations may be easier in matlab and as-of joins in k.


Array languages are very fast for code that fits sensibly into arrays and databases spend a lot of compute time getting correctness right. 100x faster than postgres on arbitrary data sounds unlikely-to-impossible but on specific problems might be doable.


Yes, when I took a look at shakti's database benchmarks before, they seemed entirely reasonable with typical array language implementation methods. I even compared shakti's sum-by benchmarks to BQN group followed by sum-each, which I'd expect to be much slower than a dedicated implementation, and it was around the same order of magnitude (like 3x slower or something) when accounting for multiple cores. I was surprised that something like Polars would do it so slowly, but that's what the h2o benchmark said... I guess they just don't have a dedicated sum-by implementation for each type. I think shakti may have less of an advantage with more complicated queries, and doing the easy stuff blazing fast is nice but they're probably only a small fraction of the typical database workload anyway.


The comparison posted may be true, but on some level it doesn't make sense. It's like this old awk-vs-hadoop post https://adamdrake.com/command-line-tools-can-be-235x-faster-...

Yes, a very specific implementation will be faster than a generic system which includes network delays and ensures you can handle things larger than your memory in a multi-client system. But the result is meaningless - perl or awk will also be faster here.

If you need a database system, you're not going to replace it with K, if you need super fast in-memory calculations, you're not going to use a database system. Apples and oranges.


They at least used to have free evaluation licenses that were good for a month. Our license was even unlocked for unlimited cores.

I doubt they'd give them out to a random individual or small startup, but maybe still possible for a serious potential customer.


> three to seven transactions per second, but with a centralized trusted entity

The byzantine generals problem doesn't apply when you have trusted entities.


> when you have trusted entities.

Except for the minor detail that you never do.


No, from a legal/state POV you do have enough trust.

Perfect solutions do not matter.

The only thing which matters is good enough solutions.

For a state (i.e. most states in the current world) federated validators are good enough.

And systems which go beyond that (wrt. byzantine generals problem) have properties states tend to not want.

(I'm not judging if it's ethical good or bad.)


Depending on your age and how much you can deduct, the switch over can come as "low" as 90k. At 100k you should always be ahead in any case. And that's without counting the social security contributions. Depending on how you value them, it might be more convenient even under <90k.


Taxes at that income level are around 50%, including ~20k worth of social security contributions. You're probably mixing up VAT, advance payments and final balance payments to get to that 76%, which is NOT what you're really paying on your income, unless your accountant made some major mistake.

BTW, if you invoiced 65k instead of 80k, you could use the "forfettario" tax regimen, which would result in a higher net income (yes, it's that crazy: you earn more if you earn less).


Fourth option: don't introduce this change at all. It's a debatable stylistic improvement in exchange for a big breaking change that will force your users to go through, update and re-test every single query. Not worth it.


How do you guarantee it doesn't copy a GPL-ed function line-by-line?


Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-...

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitatio... and here: https://copilot.github.com/#faq-does-github-copilot-recite-c... though they neatly ignore the issue of licensing :)


That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.


It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.


> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.


What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?


If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.


Matching on the abstract syntax tree might be sufficient, but might be complex to implement.


You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...


Surprised not to see more mention of this. It would make sense for an AI to "copy" existing solutions. In the real world, we use clean room to avoid this.

In the AI world, unless all GPL (etc.) code is excluded from the training data, it's inevitable that some will be "copied" into other code.

Where lawyers decide what "copy" means.


It's not just about copying verbatim. They clearly use GPL code during training to create a derivative work.

Then you have the issue of attribution with more permissive licenses.


How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere? "Line by line" code does not exist anywhere in the neural network. It doesn't store or reference data in that way. Every character of code is in some sense "synthesized". If anything, this exposes the fragility of our concept of "copyright" in the realm of computer programs and source code. It has always been ridiculous. GPL is just another license that leverages the copyright framework (the enforcement of GPL cannot exist outside such a copyright framework after all) so in such weird "edge cases" GPL is bound to look stupid just like any other scheme. Remember that GPL also forbids "derivative" works to be relicensed (with a less "permissive" one). It is safe to say that you are writing code that is close enough to be considered "derivative" to some GPL code somewhere pretty much every day, and you can't possibly prove that you didn't cheat. So the whole framework collapses in the end anyways.


> How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere?

I don't, but then I didn't go first look at the GPL code, memorize it completely, do some brain math, and then write it out character by character.


I truly don't think they can guarantee that. Which is a massive concern.


Maybe if it comes with clauses for a substantial transfer of technology to EU partners. But I doubt Intel would agree to that.


As an alternative to SPID you can also use your national electronic ID card, which is issued by the state without the involvement of private companies.


If a service uses SPID, you need SPID. The SPID login screen doesn’t offer “National ID card” as an option.

Some services may offer alternative logins, but my guess is that they’ll be phased out like INPS is phasing out PIN-based logins.


Same. The only issue I had with virtualenv was when I copied one to a different directory and it didn't work. It turns out you can't do that. Everything else has always worked fine, and I've been using it professionally for 10 years.


Sanofi will be producing the Pfizer-Biontech vaccine, but it takes a lot of time to set up a new production line. Last estimate I saw was saying July.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: