Vector: A high-performance observability data pipeline

miller_joe · on March 18, 2024

Vector is fantastic software. Currently running a multi-GB/s log pipeline with it. Vector agents as DaemonSets collecting pod and journald logs then forwarding w/ vector's protobuf protocol to a central vector aggregator Deployment with various sinks - s3, gcs/bigquery, loki, prom.

The documentation is great but it can be hard to find examples of common patterns, although it's getting better with time and a growing audience.

My pro-tip has been to prefix your searches with "vector dev <query>" for best results on google. I think "vector" is/was just too generic.

A nice recent contribution added an alternative to prometheus pushgateway that handles counters better: https://github.com/vectordotdev/vector/issues/10304#issuecom...

dan-robertson · on March 18, 2024

What volume of data do have going through this process, roughly?

amluto · on March 17, 2024

I feel like the ecosystem is very, very close to ready for what I would consider to be a really nice medium-to-long-term queryable log storage system. In my mind, it works like this:

1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)

2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.

3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.

4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.

Has anyone actually put the pieces together to make something like this work?

atombender · on March 18, 2024

I work on something where we use Vector similar to this.

The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.

The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.

This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.

There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.

Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.

sroussey · on March 18, 2024

Depending on your use case and if you miss a few logs, sending log data via udp is helpful so you don’t interrupt the app. I have done this to good effect, though not with vector. Our stuff was custom and aggregated many thing into 1s chunks.

atombender · on March 18, 2024

Dropping messages occasionally can be fine, but the problem with UDP is that the packet loss is silent.

I believe UDP can be lossy even on localhost when there's technically no network, so you'd have to track the message count on both the sender and recipient sides. It's also more sensitive to minor glitches, whereas TCP + a very small buffer would allow you to smooth over those cases.

I use NATS (which is UDP-based) for a similar kind of firehose system, and the amount of loss can sometimes reach 6-7%.

memset · on March 18, 2024

I have! My product literally does this. I’m actually piping logs to motherduck with my software.

Here’s a post on how to do this with fly.io which uses vector: https://scratchdata.com/blog/fly-logs-to-clickhouse/

This is my actual production vector.yaml: https://gist.github.com/poundifdef/293bf2c4cd5aaa734b0b8e25e...

You could literally download my product (it’s open source) and set it up in 5 minutes: scratchdata.com

fulmicoton · on March 18, 2024

Quickwit is very similar to what is described here.

Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.

We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.

We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.

amluto · on March 18, 2024

For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?

sebosp · on March 18, 2024

Would this fit your medium to long term? It's a weekend work to automate: json logs go to Kafka, logstash consumer to store batches in hive partitioned data in s3 with gzip compression, Athena tables over these s3 prefixes and prestodb language used to query/cast/aggregate the data

alexisread · on March 18, 2024

You can simplify this setup- vector can write to Uptrace which is backed by Clickhouse, which itself can use tiered storage eg. S3.

Very easy to get setup locally too as a POC.

FridgeSeal · on March 18, 2024

Yeah go check out what the QuickWit guys are doing.

mantiq · on March 17, 2024

Much more reliable than beats and vendor specific forwarders (chronicle forwarder and fdr) in our experience. Vrl is also pretty useful at "preparsing" massive logs e.g. aws cloudtrail and imperva abp

jakewins · on March 18, 2024

But this is a vendor specific forwarder? Isn’t this just Datadogs attempt at slowing down community focus on OTEL?

jauntywundrkind · on March 18, 2024

Datadog bought Timber Technologies (creators of Vector) two years ago. https://www.datadoghq.com/blog/datadog-acquires-timber-techn...

Timber definitely intended to just rock out & demolish everything else out there with their agent/forwarder/aggregator tech. But it wasn't a competitive play against OTel, in my humble opinion. Timber's whole shtick is that it integrates with everything, with really flexible/good glue logic in-between. A competent multi-system (logging, metrics, eventually traces) fluentd++. OTel - I want to believe - would have been part of that original vision.

It's just taking a really really long time. One can speculate how direction & velocity might have changed since the Datadog acquisition. The lack of tracing (anywhere except Datadog, so far) materializing has been a hard hard hard & sad thing to see. OG https://github.com/vectordotdev/vector/issues/1444 and newer https://github.com/vectordotdev/vector/issues/17307

arccy · on March 18, 2024

they're dragging their feet on any feature that doesn't result in shipping data into datadog.

NewJazz · on March 18, 2024

Yep. Does Influx/Telegraf have a better story?

epinephrinios · on March 18, 2024

I also love VRL. Such a joy to work with.

FridgeSeal · on March 17, 2024

I’ve used this before, to great success. Nice and straightforward to configure, the vrl language is just powerful enough for its needs, the cli’s handy “check” feature helps you catch a bunch of config issues. Performance wise it’s never missed a beat and it’s resource efficient, strongly recommend.

tedk-42 · on March 17, 2024

Same here.

We had to push metrics we scrape via Prometheus into DataDog (coincidence that they acquired this) and do a custom transform to map to a set of custom metrics.

Very straightforward in how it runs and the helm chart had all the right things in there

vasco · on March 17, 2024

Not sure if coincidence, it was going to make it super easy to migrate off Datadog, to me it looked like a defensive acquisition.

esafak · on March 17, 2024

Better than otel?

jauntywundrkind · on March 17, 2024

Otel support in Vector is an often requested feature. Across multiple threads. There seems to be good noises & some occasional we'll get to it, but so far there's just otel log ingest support, which has been there for a while now. https://github.com/vectordotdev/vector/issues/17307

I'm excited for these front-end telemetry routers to keep going. Really hoping Vector can co-evolve with and grow with the rest of the telemetry ecosystem. Otel itself has really started in on the next front with OpAMP, Open Agent Management Protocol, to allow online reconfiguration. I'd love to know more about Vector's online management... quick scan seems to say it's rewriting your JSON config & doing a SIGHUP. https://opentelemetry.io/docs/specs/opamp/

Vectors configurability & fast-and-slim promise looks amazing. Everyone would be so much better off if it can grow to interop well with otel. Really hoping here.

flurie · on March 18, 2024

In a sense, OTel is a big threat to Datadog, so I can imagine slow-rolling support is one way to manage that without looking actively hostile to it, similarly to how Datadog has other OTel "support" that doesn't play nicely with a lot of their more valuable tools/features.

jauntywundrkind · on March 18, 2024

Datadog has had a number of "come to jesus" moments in the past couple years where they've had to embrace OTel, but yeah, I indeed confess to a inner fear telling me that the Vector acquisition is to prevent tech as much as it is to develop it.

FridgeSeal · on March 17, 2024

I’m personally still waiting for Otel stuff to just…evolve a bit more? There’s some sharp edges, and a bunch of “bits” in that ecosystem that aren’t clear how we’re supposed to hold them and things don’t _quite_ work well enough yet.

Don’t get me wrong, I want to use OTEL, but it’s a struggle. In the meantime, I’ve still got normal apps and libraries outputting normal logs and normal prom metrics, so I’ve got to stick with that.

jakewins · on March 18, 2024

What aspects were you missing from OTEL? We swapped the agent side out from New Relic to OTEL in NodeJS, .NET and Python - I’ve not found any major missing feature?

Or are you thinking more on the UI/analysis/collection side?

FridgeSeal · on March 18, 2024

It’s just confusing AF.

Oh you need a collector? But maybe you don’t - because some libs will push it? Ok so we got that setup, but now half the traces don’t turn up? Or they do, but they’re missing the ids to link them together? I’ve got 30m traces from an AWS lib we use, but none of ours? Oh also our logs don’t come across? Because logs require some different handling or something and some intermediary didn’t support them yet? Grafana seemed to support some things, and not others. To say nothing of the absolute plethora of config options available on the collectors and exporters: there’s like 3 or 4 different ways to define sampling and filtering, in a different layer each and they all appear to cross interact, so you can accidentally choose configs that prevent you from getting data with no indication of where it’s gone missing.

I’m keen for it to all shake down a little bit, because I’d love to be able to just bang #[instrument] on all our functions, and derive logs and metric from trace data, but seems things are a while off that yet.

souvlakee · on March 17, 2024

Much more flexible.

jeltz · on March 18, 2024

The otel ecosystem still seems very immature to me.

thunfisch · on March 18, 2024

Vector to me is more than just "high-performance" - It's a true swiss army knife for metrics and logging. We regularly use it to transform logs to metrics, metrics to different format, push them to different datastores, filter them, etc. It's wild how flexible this program is. It has become my first choice for anything regarding gathering/aggregating/filtering/preprocessing observability data.

piterrro · on March 19, 2024

Is there a way to temporarily connect to Vector and select either sources or sinks to be duplicated into that stream (let's say stdout or TCP socket)? I'd love to find a use case for Logdy where I can just stream whatever is landing in Vector to Logdy[1] and literally see everything through a web UI. The use case would be to debug complex observability pipeline as Logdy serves a UI for everything that lands on it and allows to parse and filter easily.

[1] https://logdy.dev/

thinking_banana · on March 18, 2024

I'm just getting to know about vector. I have noticed that most Vector examples and discussions are targeted towards databases or complex multi-tenant applications. And looks really cool!

Has anyone tried Vector in the context of autonomous vehicle, essentially distributed system, where vector would serve the purpose of aggregating the op-logs, system state, input and output of every application at every instance?

lycos · on March 18, 2024

I only learned about vector after I had setup a new fluent-bit pipeline, and have to say there's a lot of stuff that looks interesting in vector and wish I had time to play with it earlier. Might still do it when I have some downtime, it looks very interesting and capable, could be fun to try on a new project.

orthecreedence · on March 18, 2024

Vector is great. Using it for log shipping and it was always performed wonderfully, and replaced a logstash setup that was not really doing what we needed it to. I also feel like I'm only scratching the surface of Vector and would love to use it more.

What are some use cases people have had with it besides log shipping?

_pob · on March 17, 2024

No offense meant to the contributors or authors, but I don’t know if I trust Datadog (the company) to steward what looks to be an OTEL competitor.

FridgeSeal · on March 17, 2024

IIRC this work predates acquisition by datadog, and it’s continued well since then.

NewJazz · on March 18, 2024

https://github.com/vectordotdev/vector/issues/1444#issuecomm...

dan-robertson · on March 18, 2024

Without having thought much about this, surely datadog only want to store your data and have you pay for the storage/indexing/querying? I guess your worry is something like datadog making themselves the only possible backend? I don’t feel like that’s a very big risk – I think trying it would just lead to a fork of vector. Perhaps a more realistic risk is that vector would implicitly assume datadog’s constraints, eg (making these up without knowing much about datadog) field types or required information or the expected number of unique fields across all messages.

jakewins · on March 18, 2024

Yeah the trick is if they can lock you into a stack that sits everywhere in your apps, it’s very expensive to switch vendors, letting them extract high rents. This is what happened with the Datadog agents.

In that context, OTEL is an existential threat, because it makes them a commodity. Then it becomes relatively clear why they wouldn’t put OTEL support in the Vector roadmap.

dan-robertson · on March 18, 2024

I guess I’m surprised that your claim is basically that datadog’s advantage is in ingestion. I would have assumed they would be focusing on trying to make a product so good that people wouldn’t want to switch from it. Vector supporting multiple backends would be good for datadog if it can get more people in the door, so long as their product is compelling enough for people to stay.

I don’t know what exactly you mean about otel but elsewhere in this comments section someone linked to upscale, which uses vector to collect otel logs. Is that a counterexample?

jakewins · on March 18, 2024

My experience with using Datadog at (some) scale was that they focused on making it really, really easy to integrate their agent with your apps, and then once they had a large base of users with high switching costs they started rapidly raising prices.

In other words: My claim isn’t that they were better at ingestion but at onboarding and at creating switching costs.

Since that was how their leadership acted last time I used their code, I expect the same leadership to act the same way again with this other piece of code they own.

datadeft · on March 17, 2024

But you trust a committee to deliver one of the most crucial part of your infra?

jakewins · on March 18, 2024

To deliver a cross vendor narrow spec, you mean?

Given my experience with Datadogs pricing lock-in-and-switch, yeah 100% I’d rather run agents that allow me to pick the collection backend than another tool from Datadog.

datadeft · on March 18, 2024

Not sure what pricing you are talking abot in the context of Vector. Its license explicitly allows free of charge commercial use.

https://github.com/vectordotdev/vector/blob/master/LICENSE

For the record here is what Vector is:

https://github.com/vectordotdev/vector/blob/master/website/s...

lokar · on March 17, 2024

Same here. At work we were gearing up to use this, but after the acquisition changed focus to otelc.

nextworddev · on March 17, 2024

Is this a Splunk alternative?

atombender · on March 17, 2024

Not directly. Vector is a tool to build pipelines that receive, transform, and send data. But it doesn't index data to make it searchable. You can use it to ingest into Splunk, however.

Vector could be used to build something Splunk-like. For example, you can use it to ship logs into Kafka, then let it ingest that data into ClickHouse, and then use a frontend like Grafana to search logs using ClickHouse.

marco_z · on March 18, 2024

> you could ship logs into Splork, then let it ingest that data into FlorbHut, and then use a frontend like Glorply to search logs

Seriously though, is there a single OSS product that does all of this ? Like, for a small multitenant app (i.e not "web-scale"), and that doesn't force one to get a degree in observability just to get stuff done.

makmanalp · on March 17, 2024

It's more like a fluentd alternative, with a lot of improvements: much less sluggish throughput if you're doing anything nontrivial, nicer overall architecture and language, more 'batteries included' integrations ecosystem.

We're in the process of switching to this at work for some very high volume logs and I'm quite hopeful - other teams saw pretty decent improvements.

nextworddev · on March 17, 2024

It'd be awesome if a simple UI + hosting is provided if it's in your roadmap, I'd be interested in paying for that (as a startup segment)

loloquwowndueo · on March 17, 2024

I don’t think hosted (saas?) Vector makes much sense. It’s meant to sit next to your deployed workloads ingesting data and sending it elsewhere for storage/analysis whatever (and it does make sense for that target to be hosted - think datadog, betterstack, etc). So you’d deploy it as a sidecar in a pod, or as another container in your deployment or as a plain old executable in your server/vm or whatever.

Also a UI is relatively pointless since you only mess with the config occasionally and otherwise just leave Vector running doing its thing. Want to see how it’s doing? Have your Prometheus scrape Vector for its own metrics and set up alerts/analysis using Prometheus itself or Grafana.

ikut3hva · on March 17, 2024

No, it is an opensource of timber.io but now Timber was acquired by Datadog ;)

mavam · on March 18, 2024

We're building something similar at Tenzir, but more for operational security workloads. https://docs.tenzir.com

Differences to Vector:

- An agent has optional indexed storage, so you can store your data there and pick it up later. The storage is based on Apache Feather, Parquet's little brother.

- Pipelines operators both work with data frames (Arrow record batches) or chunks of bytes.

- Structured pipelines are multi-schema, i.e., a single pipeline can process streams of record batches with different schemas.

pdimitar · on March 18, 2024

Pretty cool but without traces I can't use this. Will monitor it in the next months because I'd have a good use for it.

camel_gopher · on March 20, 2024

Traces are not looking like the near future, and metrics/logs are not looking that way in the Otel collector.

kbouck · on March 18, 2024

vector's VRL language seems more expressive than otelcol OTTL. Has anyone experience creating complex transformations using OTTL?

I guess my example would be non-trivial log transformation requiring lookup tables (which vector can do with enrichment tables)