Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A lot of excellent information in that blog post and linked from it... but if you're wondering where to start:

1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

2. Instrument metrics using Prometheus, there are libraries that make this easy: https://prometheus.io/docs/instrumenting/clientlibs/ . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this https://prometheus.io/docs/practices/histograms/ . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.

Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.

Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) https://grafana.com/products/cloud/ so you can see the results of your work immediately.

If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.



I’d add to no. 1 by saying include a correlation id or request id so you have a way to filter the logs into a single linear stream related to the same action.


Absolutely constantly being able to get a single linear stream is when tracing becomes super powerful.


Also adding a context tag can very powerful, especially in times of Microservices and some kind of event driven stuff like monthly payments.

Imagine: user registers, the post gets and id, and context registering. Then he adds a credit card. (a new id, context credit payments) after 14 days the bill goes out, same id, same context.


> Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki)

I haven't looked at their pricing before, but for small-ish environments, their standard plan looks really good and simple. None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).


> None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).

I was thinking "God, this is exactly why I hate Datadog" as I was reading your description and got a great laugh when I reached the end. Their billing is absolutely byzantine.

I don't know that I've ever seen a company that had such a stark difference between great engineering/product and awful business/sales practices. Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill. Their sales teams are some of the worst I've dealt with, and I deal with a lot of vendors. They're starting to get a really bad reputation as well.


I'm a customer that uses most of their tools (no network performance monitoring since it's less useful than a service mesh and no logging because we need longer history than most and cost would be prohibitive).

Is it really that expensive when compared to other vendors? Thought their newer logging tool was a lot cheaper than splunk and their apm tool for distributed tracing is also pretty cheap when compared with something like new relic. Sure it's more expensive than free tools that you need to setup yourself. But the velocity it lets your teams have is so much better than having to use something like grafana with tools like Prometheus. Again, sure it can be done for cheap, but the time it takes to manage those tools and the velocity that you lose when doing that doesn't seem like it's worth it for smaller companies but I can see it making more sense as you scale a company.


It's not the cost per se, though I do think they're pretty high for some features. It's the pricing models and the associated patterns.

For instance, you have to pay Datadog per host you install the agent on. In addition to the per host cost, you have to pay per container you run on that host (past a very small baseline per host), and the per contain cost turns out to be nearly as high as the per host cost if you have reasonable density. Why am I paying Datadog per container I run? Aside from a not particularly useful dashboard, why does a process namespace and some cgroup metrics nearly double my bill? They are literally just processes on a server. Because Datadog wants you to run more hosts, so you install more agents.

Every feature they add also seems to be charged separately, but is not behind any sort of feature gate. This means new features just show up for my developers, and they have no clue if it costs money to use them. I can't just disable or cap, for example, their custom metrics per user, per project, or at all. So when my developers see a useful feature and start using it, all of a sudden I have an extra $10k on my monthly bill. Even more fun are features that show up and are initially free but then start charging.

This is such a pain that we've had to tell dev teams not to use Datadog features outside of a curate list. Every product has some rough edges, but with Datadog the patterns are all setup such that you end up paying them thousands of extra dollars. Again, great product, but not a business I would be interested in associating with again given the choice.


It's not so much the total cost, but the fact that there's so much nickel and diming. When Trace Analytics came out they tried getting us to turn it on, and its like...we're already paying for APM and you want to charge us more, at least tell us how much more and they couldn't. I think it probably ended up not being a ton of money, but just the question was enough for us to not do it. From working with other providers, it's also much easier working with our finance if we can say 'it costs at most this' instead of 'it costs at least this'


> Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill.

You might want to check out New Relic One, especially with the new pricing model. I think they even added a Prometheus integration recently?


For APM and Front-end monitoring, you can try https://www.atatus.com/


Twice I fell into the trap of datadog, having paid more than $300 for a single month in each case.

The simplicity of it, dashboards, notebooks, logs etc, is what makes it so appealing though.


If it saves you more than an hour or two, $300 seems super worth it for a system making you money...


It depends where you are in the world. When I was working in Switzerland, most SaaS pricing were no-brainers for us. But since I work in Latin America for small companies with local costumers, all the different services and tools you might want to use, with prices targeted at "western" customers, much more quickly add up to the equivalent of having multiple people on staff full time.

Still it is often not worth to roll your own, so it is nice to have alternatives for different price points and company scales.


> It depends where you are in the world.

Exactly this. We operate in Eastern Europe with local clients, offering on-prem SaaS. If I added all my clients' servers on datadog it would very easily eat through our profit margins.

> Still it is often not worth to roll your own

I tried hard not to, but at at the end, after spending 1 week trying to setup netdata and failed, I decided not to spend another week trying to setup grafana/influx/prometheus (lot's of docs to go through), and just have some bash scripts send metrics on a $10 digital ocean node service that sends me emails/sms when something "looks bad" (eg. high cpu temperature, stopped docker containers, etc).

I gave up on aggregated logging for the time being, since I can just ssh on each server and check journal and docker logs if I need to (as long as the hard drives don't crash).


The trick with the Netdata agent is to try the install script if running into trouble.


Yeah, having looked at what the script does I decided to 'containerize' the agent, and that led to other issues like configuring email alerts etc.

I was already a week deep into looking at various options and had to deliver on basic metrics and alerting, so I figured a couple of bash scripts, that log into local files with log rotation, systemd, and a dump/memory only receiving end running on nodejs for the alerts would be much faster and easier to maintain.

So far so good.


Tangent: what's the difference between on-prem SaaS and just "software"?


I guess it refers to them servicing (updating, reacting to downtime, etc.) the software, while it being deployed on premise? In contrast to the clients IT department doing so.


Yes that's right.


We provide both the hardware and the software. Our clients are pretty small, with no IT staff, and no technical expertise.


it used to be called "private cloud", wasn't it?


If you're in an established company making money - yes. If you're bootstrapping a service and counting on $50 total monthly cost while initial users are signing up - no.


From the post "The key, he says, is using the right transaction identifiers so that calls can be traced across components, services, and queues".

I think this is a key feature not many people implement especially in today's world of over blown micro services, having a transaction id from the time the request hits the reverse-proxy till the database write is so helpful in debugging, saves a ton of time.


100% if you manage to get opentrace to work, it is a brilliant debug tool


You can also just propagate a uuid throughout your system. I've used both uuids and opentrace. Just don't get hung up if opentrace seems too complex.


I agree with this wholeheartedly. You can even define a standard and let downstream services opt in over time. Simple wins like this should not be put off because "someday" we're going to implement a complex distributed tracing solution.


So, in terms of full implementation, is the constraint the dev time to implement, or getting various factions to agree to something?

I guess, is it political, or technical?

Asking for a friend.

Thanks!


Often it's political, but political friction can feed into technical friction if there are also a half dozen different half wrappers around half-baked HTTP libraries. Also, if someone has included OpenTracing as a shadow library in a company-wide library (JVM territory), but you want it as a top-level dependency, you have to write translators.


I agree with 2. I have a presentation at https://www.polibyte.com/2019/02/10/pytennessee-presentation... which goes into how to get started with Prometheus. Not as in "how to set it up", but more about what to instrument and why, how to name things, etc. Despite the title, there's very little in it that's specific to Python.


> Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.

if you include all these information and the logs are not structured, you won't get much information out of them.


Why should you have to use Prometheus? There are plenty of options, and good reasons why you might want to push data rather than pull. Measurements should minimize perturbation of the system being measured, and the (computer) system generating data is likely best placed to determine when and how, when that matters -- e.g. in HPC, where jitter is important.


I would add that if you add a traceId to #1 and use something like https://www.jaegertracing.io/. You get even more.


I don't think loki is production ready. Needs more work. It's going to be great with grafana though


What else do you think it needs. I’ve been playing around with v1.5 the past few days in Grafana and I like it’s simplicity. Grafana now has a Loki datasource which is nice.


Gavin from Zebrium here. Completely concur with #1. We are big advocates of writing good logs and not having to worry about structured vs unstructured (and even if you structure your logs, you'll still probably have to deal with unstructured logs in third party components).

Our approach to deal with logs is to use ML to structure them after the fact (and we can deal with changing log structures). You can read about it in a couple of our blogs like: https://www.zebrium.com/blog/using-ml-to-auto-learn-changing... and https://www.zebrium.com/blog/please-dont-make-me-structure-l....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: