Hacker Newsnew | past | comments | ask | show | jobs | submit | andrejserafim's commentslogin

I've switched from photoprism to immich. Immich is a much more active project, bugs are fixed, face recognition is an order of magnitude better, just an overall more solid experience. If you are choosing, I wouldn't doubt for a second to go with immich.


Proemion | Senior Backend Engineer | Full Time | REMOTE (Europe)

Proemion is providing an all-included remote telemetry service for industrial machinery. We develop hardware, firmware and software to collect remote machine telemetry, analyze it and provide in a way that empowers manufacturers and their customers to use the machine better.

The position is for the web services driving the analysis, reporting and our web portal.

Our tech-stack is mostly in java.

https://career.proemion.com/en/jobs/it-software-development/...


I appreciate you say this with humour. But if you voluntary or due to circumstances out of your control cannot distract you brain... At least nine starts thinking. The energy cost is high, but I always end up with a very profound experience and some great outcomes.

So I wouldn't cross out "staring at a wall" as inarguably invaluable. It's what's going on in your head when you do it. Same as when reading that book.


This and also being able to satisfy an itch at any moment. Be it youtube or facebook, whatever your favourite drug is. It's infinitely available and is much easier than any other form of activity.

But in an of itself doesn't create anything. Mindless content consumption doesn't even leave memories sometimes, what to say of ideas.


I realize it is too late to get rid of mobile devices. But I have long advocated for “Techno-lent”: a period of forty days and nights where you put your mobile devices off and in a drawer, you turn off your WiFi.

Before you claim it is impossible because of your work, is it really? Did you try and think of a way that was possible?


Many moons ago a friend of mine came up with an idea called "Experimonth". Where you would pick something you want to do, and do it for an entire month. Something like wake up at 4am or not eat any meat, etc.

I've done many of these and they were great, including giving up the internet and computers/phones, with the exception of work. This was back when I went into the office and I limited myself to only work related tasks there. Wifi and computers were turned off and phones put away at home for a month.

It's actually somewhat difficult to break the habit and very boring. But after about the third week you start to get used to it. By the end of the month when I could go back to "normal" I realized there was very little I actually wanted back. Mostly it was a few games I wanted to play and texting with friends on a regular basis. I lost all interest in social media after it (and that's still true) and I became very annoyed by the "ding" of the phone so I turned off all notifications and set it on DND full time (and that is also still true).

Another thing that's I got out of it is I notice how much time everyone else is distracted by their phones. It's really crazy to see when you're not one of them.


Shocking your normal schedule is a good way to get out of a depression loop, agreed. I at one point developed an obsession with finding Parrot shops in my area, and drove far and wide to catalogue every one and learn more about the birds. After owning 2 Cockatoos I developed a new life-long passion for rescuing the talking birds as pets. I also make music, and have begun working on dating, and reaching out to old friends for life updates. All of it has really grounded me in the past 5 years, and I'm pretty sure it's what's gotten me through the pandemic years without depression.


I'm in my mid-30s. And your advice is really what I've come to realize. When I took my last promotion into engineering management, I had ambitions to become the next CTO in this company. But having been in the job for a couple of years, I have come to realize that the CTO role in this specific company would take me into more of the things I enjoy less in my current role.

At the same time, my current role has a neat balance of technical and softer aspects. So why chase the "career" if I'm already in a good place. Focusing on my family and out of work priorities just makes so much more sense.

It's probably the first time, I've rejected a path to a promotion.


Our anecdata: we store telemetry per thing. After loading a month worth of data - timescaldb as hosted by their cloud ran a difference aggregation in seconds. Clickhouse routinely did it in 20 millis.

Simple avg, etc were better, but always clickhouse was an order of magnitude faster than timescale. We didn't invest a whole bunch into optimization other than trying some indexing strategies in timescaledb.

So for our use case the choice is clear.


(N.B. post author)

Thanks for the feedback. Without knowing your situation, one of the things we show in the blog post is that TimescaleDB compression often changes the game on those kinds of queries (data is transformed to columnar storage when you compress). You don't mention if you did that or not, but it's something we've seen/noticed in every other benchmark at this point - that folks don't enable it for the benchmark.

And second point of the article is that you have lots of options for whatever works in your specific situation. But, make sure you're using the chosen database features before counting it out. :-)


I wonder if it's worth taking a page out of the MongoDB book and enabling these kinds of benchmark altering settings by default. We certainly selected clickhouse over tailscale internally because of major performance differences in our internal testing that might have gone the other way had we "known better".


Indeed. Lots of discussion over this in the last few months. There are nuances, but I think you'll see some progress in this area over the next year.


From my experience of benchmarking these databases on scientific data (highly regular timeseries) and looking at the internals of both, these kinds types of number happen when answering the query needs crunching through many rows, but the output has few. i.e. the queries are filtering and/or aggregating a ton of input rows, that can't be excluded by indexes or queried from preaggregations.

From what I can tell it comes down to execution engine differences. TimeScale, even with compressed tables, uses a row by row execution engine architecturally resembling IE6 era JS engines. ClickHouse uses a batched and vectorized execution engine utilizing SIMD. Difference is one to two orders of magnitude of throughput in terms raw number of rows per core pushed through the execution engine.

Postgres/TimeScale could certainly also implement a similar model of execution, but to call it an undertaking would be an understatement considering the breadth and extensibility of features that the execution engine would need to support. To my knowledge no one is seriously working on this outside of limited capability hacks like vops or PG-Strom extensions.


(post author)

You do a great job summarizing some of the benefits of ClickHouse we mentioned in the post, including the vectorized engine!

That said, I'm not sure I'd refer to PostgreSQL/TimescaleDB engine architecture as resembling IE6 JS support. Obviously YMMV, but every release of PG and TimescaleDB bring new advancements to query optimizations for the architecture they are designed for, which was the focus of the post.

I'm personally still impressed, after 20+ years of working with SQL, relational databases, when any optimization engine can use statistics to find the "best" plan among (potentially) thousands in a few ms. Maybe I'm too easily impressed. :-D


The optimization engine is of course great (despite occasionally missing hard), but I am not referring to it. I am referring to the way that PostgreSQL executes query plans, the way rows are pulled up the execution tree, is very similar to first iterations JavaScript engines - a tree based interpreter. Picking out columns from rows and evaluating expressions used to work the same until PG11, where we got a bytecode based interpreter and a JIT for those. But so far rows are still working the same way, and it hurts pretty bad when row lookup is cheap and the rows end up either thrown away or aggregated together with basic math.


With TimescaleDB compression, 1000 rows of uncompressed data are compressed into column segments, moved to external TOAST pages, and then pointers to these column segments are stored in the table's "row" (along with other statistics, including some common aggregates).

So while the query processor might still be "row-by-row", each "row" it processes actually corresponds to a column segment for which parallelization/vectorization is possible. And because these column segments are TOASTed, the row itself are just pointers, and you only need to read in those compressed column segments that you are actually SELECTing.

Anyway, might have known this, just wanted to clarify. Thanks for discussion!


yeah very interesting, i was wondering how timescale pushed postgres more towards columnar without rewriting a bunch of postgres itself.

My understanding of TOAST is that it itself is just a bunch of rows in a toast table that split the compressed "row" or in this case "1000 rows of 1 column" across as many rows as required to store the data whilst remaining within the postgres page size limits (normally 8kb).

With the often quoted postgres per row overhead of 23 bytes~ which you would have to pay for each TOAST row as well, does this not add up and eat into your storage efficiencies? or does compression work so well that the 23 bytes x N rows (1 row pointing to toast + N toast rows) required to store the "row" isn't important?


The compressed column segment is stored in a single row in TOAST.

More info: https://blog.timescale.com/blog/building-columnar-compressio...


Does timescale do it’s own compression alg too? I see in pg 14 toast column compression can be lz4 instead of ootb pglz which has a few probs appr, I see mentions on the mailing list of significant possible optimizations. When dealing with EBS style storage where read latencies can be multi millis compression is always going to be a win, but is an easy optimization either way I’d think.


Timescale implements its own compression algorithms. It includes several ones, and automatically applies the choice of algorithm based on the data types of columns.

- Gorilla compression for floats

- Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types

- Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)

- LZ-based array compression for all other types

This means within even the same table, different columns will be compressed using different algorithms based on their type (or inferred entropy).

More information for those interests:

- General TimescaleDB compression post: https://blog.timescale.com/blog/building-columnar-compressio...

- Deep dive on compression algorithms it employs: https://blog.timescale.com/blog/time-series-compression-algo...


Ah so only costs 1 row for pointer and 1 row for toast? Well that’s much more deterministic


Was this for your primary source-of-truth, or more of a downstream data warehouse, or something else?

I'm struggling to imagine a case where these are the two things being considered; Timescale is the obvious choice for a primary database, Clickhouse the obvious choice for a warehouse. I wouldn't let my user-facing app write to Clickhouse, and while I could potentially get away with a read-only Timescale replica for internal-facing reports I would expect to eventually outgrow that and reach for Clickhouse/Snowflake/Redshift.


> I wouldn't let my user-facing app write to Clickhouse

I’ve been thinking of doing exactly that. What are your concerns?


https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

has some good thoughts. The main thing you'll likely need is some sort of a buffer layer so you can do bulk inserts. Do not write a high-volume of single-row inserts into Clickhouse.


Chproxy is designed to handle this

https://github.com/Vertamedia/chproxy


Thanks for sharing the link! I’ve heard the bulk insert thing before and to be honest I’ve always thought that RDBMSs don’t love single row inserts either. Seems clickhouse takes that to a new level.

In our case we are using sqs and usually insert 20-100 rows into the db at a time so I’m going to benchmark how that does in clickhouse.


With Clickhouse you can use a "buffer table", which uses just RAM and sits on top of a normal table: https://clickhouse.com/docs/en/engines/table-engines/special...

Rows inserted into the buffer table are then flushed to the normal/base table when one of the limits (defined when the buffer table is created) is reached (limits are max rows, max bytes, max time since the last flush), or when you drop the buffer table.

I'm using it and it works (performance difference can be huge compared to perform single inserts directly into a real/normal table), but be careful - the flushed rows don't give a guarantee of which row is flushed in which sequence, so using a buffer table is a very bad idea if your base table is something which relies on the correct sequences of rows that it receives.


On a project I worked on we found the sweet spot to be 20k-60k rows per insert.


I suppose it depends what you're going to let your user do, but OLAPs in general and Clickhouse in particular don't do well under row-oriented workloads, as described in the post here. I'm imagining users primarily operating on small numbers of rows and sometimes making updates to or deleting them, a worst-case scenario for Clickhouse but best-case for an OLTP like Postgres.


Ah totally. Thanks for sharing your thoughts! In my case I’m evaluating clickhouse as a source of truth for customer telemetry data. Totally agree about the OLTP limitations.


(Remember that clickhouse is not reliable. It doesn’t pretend to be.

Clickhouse is great for lots of common query workloads, but if losing your data would be a big deal then it makes a lot of sense to have your data in a reliable and backed up place (eg timescale or just s3 files or whatever) too.

Of course lots of times people chuck stuff into clickhouse and it’s fine if they lose a bit sometimes. YMMV.)


I have not found this to be the case. Like any system you need to take precautions (replicas and sharding) to ensure no data loss, but I didn't find that to be challenging. In what way have you found ClickHouse particularly risky in this way?


It’s basic computer science. Clickhouse doesn’t fsync etc.

Clickhouse (and other systems with the same basic architecture, like elastic search and, shudder, mongodb) work very well on happy path. They are not advertising themselves as ACID.


You can enable fsync in ClickHouse. And it will not decrease bandwidth.


MongoDB has ACID support.


> Clickhouse the obvious choice for a warehouse > Clickhouse/Snowflake/Redshift.

but clickhouse is very unlike the other two. when i think of a warehouse i think star schema, data modeling ect not something that hates joins.


Agreed, I wouldn't use Clickhouse for usual warehouse stuff either, mostly because I can't imagine it plays well with dbt which is a non-starter these days.

I'd still argue Clickhouse is closer to Snowflake/Redshift than anything OLTP, and their name is intentionally chosen to evoke warehouse-like scenarios.


What makes you think CH doesn’t like joins?

Having used Redshift, Snowflake and CH for similar workloads, I’d much prefer ClickHouse to the other 2.

Snowflake is hideously expensive for the subpar perf it offers in my experience and Redshift is mediocre at best in general.


Clickhouse is nothing like snowflake. A Postgres user can be useful in Clickhouse in minutes, not even close for snowflake.


Is your comment on ClickHouse and DBT based on using the DBT ClickHouse plugin? [0] If so I would be very interested in understanding what you or others see as deficiencies.

[0] https://github.com/silentsokolov/dbt-clickhouse


How many data points were those aggregations being computed over? How much memory does your Postgres server have, and are you using SSD storage (with associated postgres config tweaks)?


(Post author)

Howdy! We provided all of those details in the post and you're welcome to join us next week when we live-stream our setup and test!

https://blog.timescale.com/blog/what-is-clickhouse-how-does-...


I was responding to @andrejserafim, asking about their scenario, not the article.


Gotcha! My apologies for not seeing the thread nature. HN threads get me sometimes. :-)


Works fine for me ;)


Then it's no longer just your data. Someone else now also has a copy. How do you know they don't leak it or provide it to someone? There's value in hhavingyour data only local with some off-site arrangement.


Backblaze's personal backup has a feature to use your own private key to encrypt your backup data before transmission.


I trust Google and Apple to secure my data more than I trust myself.


I trust Google to randomly lock me out because their stupid AI determined that I'm a suspicious geek instead of a normal person. It's happened before, it will happen again.

Very secure but not in my hands. No thanks.


If the government wants my data, they can just raid me and take my home server. I trust that google can secure it from random hackers better than I can.


Arq Backup will encrypt your data (supports a bunch of different backends Google, AWS, etc, including your own)


Syncthing is great! Are you using zerotier for transport security. Or does it improve speed as well? Off-site syncs are far too slow for me. Luckily, it's very rare that I have to do them.


I'm using Zerotier so that my devices can stay in sync with my raspberry pi at home when I'm outside of my local network. I found this a simpler solution to dynamic DNS.


Disclaimer: used to work at GS. Left to work remotely in another region.

GS is a big company. There's lots of people there. Some 8k people are in the tech division. And the employer competition there is not just with other financial firms. But also with the rest of the tech sector.

So as other commenters say, if other places offer similar conditions, but also remote work. It may be a differentiator. In practise though - working from home for tech some days a week has been totally normal for years. So I expect little actual impact on the company.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: