Hacker Newsnew | past | comments | ask | show | jobs | submit | marginalia_nu's commentslogin

Gamer lean is when it gets really serious.

Funny you say that, as medicine is one of the epicenters of the replication crisis[1].

[1] https://en.wikipedia.org/wiki/Replication_crisis#In_medicine


you get a replication crisis on the bleeding edge between replication being possible and impossible. There’s never going to be a replication crisis in linear algebra, there’s never going to be a replication crisis in theology, there definitely was a replication crisis in psych and a replication crisis in nutrition science is distinctly plausible and would be extremely good news for the field as it moves through the edge.

Leslie Lamport came up with a structured method to find errors in proof. Testing it on a batch, he found most of them had errors. Peter Guttman's paper on formal verification likewise showed many "proven" or "verified" works had errors that were spottes quickly upon informal review or testing. We've also see important theories in math and physics change over time with new information.

With the above, I think we've empirically proven that we can't trust mathmeticians more than any other humans We should still rigorously verify their work with diverse, logical, and empirical methods. Also, build ground up on solid ideas that are highly vetted. (Which linear algebra actually does.)

The other approach people are taking are foundational, machine-checked, proof assistants. These use a vetted logic whose assistant produces a series of steps that can be checked by a tiny, highly-verified checker. They'll also oftne use a reliable formalism to check other formalisms. The people doing this have been making everything from proof checkers to compilers to assembly languages to code extraction in those tools so they are highly trustworthy.

But, we still need people to look at the specs of all that to see if there are spec errors. There's fewer people who can vet the specs than can check the original English and code combos. So, are they more trustworthy? (Who knows except when tested empirically on many programs or proofs, like CompCert was.)


It's arguably one of the central principles of Christianity. Let he who is without sin cast the first stone and so on.

Yeah, that's exactly what I was thinking of, just didn't want to start a flame war.

The woke movement in many ways has taken core Christian principles, cut out the supernatural elements, and formed a new quasi religious movement. It has its dogmas and priests, it focuses on the poor & disadvantaged, etc. That's not a criticism of woke, I see it more as a response of the failures of Christianity in practice to live or embrace those values.

Yes, sounds right. Because you can't hit a killer with stone if you envy your rich neighbor.

There's a reason Nietzsche labeled it slave morality. It undermines people's confidence to act and judge other appropriately, revalues weakness to be a virtue and strength as evil, and demands that people stop trying to change the world for the better and focus instead on their own supposed guilt. It's morality developed for people who are structurally unable to act (because they are commoner serfs with no power) to make them feel justified and satisfied with inaction.

FWIW it seems Russia's trolling activities took a pretty significant hit after Prigozhin fell out of a window in 2023, as the "Internet Research Agency" was one of his ventures.

Probably just caused outsourcing to india and china.

It's probably advisable to have a lawyer eye through such a document even if that document is in English if there is the slightest question about what it says.

Pacta sunt servanda can be a real bitch sometimes.


Intel was resting on their laurels throughout most of the 2010s, while AMD floundered and couldn't catch up. By the time AMD got their shit together with Ryzen, Intel had all but been defeated by their own complacency.

There's not really not much more room for Microsoft's consumer software to grow, but the next quarterly report must show black numbers, so the only way to stay profitable is to produce software in a way that is cheaper than the previous month.

Incidentally, neither a rigorous quality control process, nor a team of experienced engineers is particularly cheap.


Growth mindset can be such a cancer. Many mature businesses don't need to grow and are perfectly fine as they are. You could continue running them forever, making steady cash. Or you could enshittify them, make slightly more money for 3 years, and get overtaken by competitors, all the while pissing everyone off and wasting billions of dollars.

Short term greed. Maximize immediate profits at the the cost of future profits.

Or you could invest some of the business's profits into growth attempts without letting them stifle the reliable existing business.

Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.

You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.

It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.


This is how Haiku packages are managed, from the outside its a single zstd file, internally all dependacies and files and included in read only file. Reduces IO, reduces file clutter, instant install/uninstall, zero chance for user to corrupt files or dependancy, and easy to switch between versions. The Haiku file system also supports virtual dir mapping so the stubborn Linux port thinks its talking to /usr/local/lib, but in reality its part of the zstd file in /system/packages.

Isn't this what is already common in the Python community?

> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?


There is some confusion here.

ZIP retains timestamps. This makes sense because timestamps are a global concept. Consider them a attribute dependent on only the file in ZIP, similar to the file's name.

Owners and permissions are dependent also on the computer the files are stored on. User "john" might have a different user ID on another computer, or not exist there at all, or be a different John. So there isn't one obvious way to handle this, while there is with timestamps. Archiving tools will have to pick a particular way of handling it, so you need to pick the tool that implements the specific way you want.


> ZIP retains timestamps.

It does, but unless the 'zip' archive creator being used makes use of the extensions for high resolution timestamps, the basic ZIP format retains only old MSDOS style timestamps (rounded to the closed two seconds). So one may lose some precision in ones timestamps when passing files through a zip archive.


> Why should I throw away all the timestamps, just because the file were temporarily in an archive?

In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).

I am a big last modified timestamp fan and am often discouraged that scp, git, and even many zip utilities are not (at least by default).


git updates timestamps in part by necessity of compatibility with build systems. If it applied the timestamp of when the file was last modified on checkout then most build systems would break if you checked out an older commit.

git blame is more useful than the file timestamp in any case.

> Isn't this what is already common in the Python community?

I'm not aware of standards language mandating it, but build tools generally do compress wheels and sdists.

If you're thinking of zipapps, those are not actually common.


Yes, it's a lossy process.

If your archive drops it you can't get it back.

If you don't want it you can just chmod -R u=rw,go=r,a-x


> If your archive drops it you can't get it back.

Hence, the common archive format is tar not zip.


Gzip will make most line protocols efficient enough that you can do away with needing to write a cryptic one that will just end up being friction every time someone has to triage a production issue. Zstd will do even better.

The real one-two punch is make your parser faster and then spend the CPU cycles on better compression.


DNA researchers developed a parallel format for gzip they call "bgzip" ( https://learngenomics.dev/docs/genomic-file-formats/compress... ) that makes data seem less trapped behind a decompression perf wall. Zstd is still a bit faster (but < ~2X) and also gets better compression ratios (https://forum.nim-lang.org/t/5103#32269)

> It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

JAR files generally do/did use compression, though. I imagine you could forgo it, but I didn't see it being done. (But maybe that was specific to the J2ME world where it was more necessary?)


> Zip with no compression is a nice contender for a container format that shouldn't be slept on

SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.


Wouldn't you still have a lot of syscalls?

Yes, but with much lower latency. The squashfs file ensures the files are close together and you benefit from fs cache a lot.

You then use io_uring

Doesn’t ZIP have all the metadata at the end of the file, requiring some seeking still?

It has an index at the end of the file, yeah, but once you've read that bit, you learn where the contents are located and if compression is disabled, you can e.g. memory map them.

With tar you need to scan the entire file start-to-finish before you know where the data is located, as it's literally a tape archiving format, designed for a storage medium with no random access reads.


Yes, but it's an O(1) random access seek rather than O(n) scanning seek

> I wouldn't expect this to be something that transfers between machines

Maybe non-UNIX machines I suppose.

But I 100% need executable files to be executable.


Honestly, sometimes I just want to mark all files on a Linux system as executable and see what would even break and why. Seriously, why is there a whole bit for something that's essentially an 'read permission, but you can also directly execute it from the shell'?

Do you also want the setuid bit I added?

I thought Tar had an extension to add an index, but I can't find it in the Wikipedia article. Maybe I dreamt it.

You might be thinking of ar, the classic Unix ARchive that is used for static libraries?

The format used by `ar` is a quite simple, somewhat like tar, with files glued together, a short header in between and no index.

Early Unix eventually introduced a program called `ranlib` that generates and appends and index for libraries (also containing extracted symbols) to speed up linking. The index is simply embedded as a file with a special name.

The GNU version of `ar` as well as some later Unix descendants support doing that directly instead.


Besides `ar` as a sibiling observed, you might also be thinking of pixz - https://github.com/vasi/pixz , but really any archive format (cpio, etc.) can, in principle, just put a stake in the ground to have its last file be any kind of binary / whatever index file directory like Zip. Or it could hog a special name like .__META_INF__ instead.

Cloud is probably the better comparison, since crypto never had the sort of mainstream management buy-in that the other two got. Microsoft's handling of OneDrive in particular foreshadows how AI is being pushed out.

The difference is OneDrive is moderately useful.

i dont like onedrive very much. i get it its useful as a pigeonhole, what i really dont like is how it is used. its the thing that moves files to onedrive and destroys local copies, that i hate, and onedrive is something that enables that. so i dont hate onedrive, i just dont like it.

LLMs are also moderately useful.

the comparison is pretty good actually

"AI" agents randomly delete your files

and so does OneDrive


Anchor text information is arguably a better source for relevance ranking in my experience.

I publish exports of the ones Marginalia is aware of[1] if you want to play with integrating them.

[1] https://downloads.marginalia.nu/exports/ grab 'atags-25-04-20.parquet'


Though I'd think that you'd want to weight unaffiliated sites' anchor text to a given URL much higher than an affiliated site.

"Affiliation" is a tricky term itself. Content farms were popular in the aughts (though they seem to have largely subsided), firms such as Claria and Gator. There are chumboxes (Outbrain, Taboola), and of course affiliate links (e.g., to Amazon or other shopping sites). SEO manipulation is its own whole universe.

(I'm sure you know far more about this than I do, I'm mostly talking at other readers, and maybe hoping to glean some more wisdom from you ;-)


Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.

I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.

e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...

The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.


Oh, that is clever!

I'd also suspect that there are networks / links which are more likely signs of low-value content than others. Off the top of my head, crypto, MLM, known scam/fraud sites, and perhaps share links to certain social networks might be negative indicators.


You can actually identify clusters of websites based on the cosine similarity of their outbound links. Pretty useful for identifying content farms spanning multiple websites.

Have a lil' data explorer for this: https://explore2.marginalia.nu/

Quite a lot of dead links in the dataset, but it's still useful.


Very interesting, and it is very kind of you to share your data like that. Will review!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: