Vector is fantastic software. Currently running a multi-GB/s log pipeline with it. Vector agents as DaemonSets collecting pod and journald logs then forwarding w/ vector's protobuf protocol to a central vector aggregator Deployment with various sinks - s3, gcs/bigquery, loki, prom.
The documentation is great but it can be hard to find examples of common patterns, although it's getting better with time and a growing audience.
My pro-tip has been to prefix your searches with "vector dev <query>" for best results on google. I think "vector" is/was just too generic.
I feel like the ecosystem is very, very close to ready for what I would consider to be a really nice medium-to-long-term queryable log storage system. In my mind, it works like this:
1. Logs get processed (by a tool like vector) and stored to a sink that consists of widely-understood files in an object store. Parquet format would be a decent start. (Yscope has what sounds like a nifty compression scheme that could layer in here.)
2. Those logs objects are (transactionally!) enrolled into a metadata store so things can find them. Delta Lake or Iceberg seem credible. Sure, these tools are meant for Really Big Data, but I see so reason they couldn’t work at any scale. And because the transaction layer exists as a standalone entity, one could run multiple log processing pipelines all committing into the same store.
3. High-performance and friendly tools can read them. Think Clickhouse, DuckDB, Spark, etc. Maybe everything starts to support this as a source for queries.
4. If you want to switch tools, no problem — the formats are standard. You can even run more than one at once.
Has anyone actually put the pieces together to make something like this work?
I work on something where we use Vector similar to this.
The application writes directly to a local Vector instance running as a daemon set, using the TCP protocol. That instance buffers locally in case of upstream downtime. It also augments each payload with some metadata about the origin.
The local one then sends to a remote Vector using Vector's internal Protobuf-based framing protocol. That Vector has two sinks, one which writes the raw data in immutable chunks to an object store for archival, and another that ingests in real time into ClickHouse.
This all works pretty great. The point of having a local Vector is so applications can be thin clients that just "firehose" out their data without needing a lot of complex buffering, retrying, etc. and without a lot of overhead, so we can emit very fine-grained custom telemetry data.
There is a tiny bit of retrying logic with a tiny bit of in-memory buffering (Vector can go down or be restarted and the client must handle that), but it's very simple, and designed to sacrifice messages to preserve availability.
Grafana is a nice way to use ClickHouse. ClickHouse is a bit more low level than I'd like (it often feels more like a "database construction kit" than a database), but the design is fantastic.
Depending on your use case and if you miss a few logs, sending log data via udp is helpful so you don’t interrupt the app. I have done this to good effect, though not with vector. Our stuff was custom and aggregated many thing into 1s chunks.
Dropping messages occasionally can be fine, but the problem with UDP is that the packet loss is silent.
I believe UDP can be lossy even on localhost when there's technically no network, so you'd have to track the message count on both the sender and recipient sides. It's also more sensitive to minor glitches, whereas TCP + a very small buffer would allow you to smooth over those cases.
I use NATS (which is UDP-based) for a similar kind of firehose system, and the amount of loss can sometimes reach 6-7%.
Quickwit is very similar to what is described here.
Unfortunately, the files are not in Parquet so even though Quickwit is opensource, it is difficult to tap into the file format.
We did not pick Parquet because we want to actually be able to search and do analysis efficiently, so we ship an inverted index, a row-oriented store, and a columnar format that allows for random access.
We are planning to eventually add ways to tap into the file and get data in the Apache arrow format.
For a quick skim through the docs, it wasn’t clear to me: can I run a stateless Quickwit instance or even a library to run queries, such that the only data accessed is in the underlying object store? Or do I need a long-running search instance or cluster?
Would this fit your medium to long term? It's a weekend work to automate: json logs go to Kafka, logstash consumer to store batches in hive partitioned data in s3 with gzip compression, Athena tables over these s3 prefixes and prestodb language used to query/cast/aggregate the data
Much more reliable than beats and vendor specific forwarders (chronicle forwarder and fdr) in our experience. Vrl is also pretty useful at "preparsing" massive logs e.g. aws cloudtrail and imperva abp
Timber definitely intended to just rock out & demolish everything else out there with their agent/forwarder/aggregator tech. But it wasn't a competitive play against OTel, in my humble opinion. Timber's whole shtick is that it integrates with everything, with really flexible/good glue logic in-between. A competent multi-system (logging, metrics, eventually traces) fluentd++. OTel - I want to believe - would have been part of that original vision.
I’ve used this before, to great success. Nice and straightforward to configure, the vrl language is just powerful enough for its needs, the cli’s handy “check” feature helps you catch a bunch of config issues. Performance wise it’s never missed a beat and it’s resource efficient, strongly recommend.
We had to push metrics we scrape via Prometheus into DataDog (coincidence that they acquired this) and do a custom transform to map to a set of custom metrics.
Very straightforward in how it runs and the helm chart had all the right things in there
Otel support in Vector is an often requested feature. Across multiple threads. There seems to be good noises & some occasional we'll get to it, but so far there's just otel log ingest support, which has been there for a while now. https://github.com/vectordotdev/vector/issues/17307
I'm excited for these front-end telemetry routers to keep going. Really hoping Vector can co-evolve with and grow with the rest of the telemetry ecosystem. Otel itself has really started in on the next front with OpAMP, Open Agent Management Protocol, to allow online reconfiguration. I'd love to know more about Vector's online management... quick scan seems to say it's rewriting your JSON config & doing a SIGHUP. https://opentelemetry.io/docs/specs/opamp/
Vectors configurability & fast-and-slim promise looks amazing. Everyone would be so much better off if it can grow to interop well with otel. Really hoping here.
In a sense, OTel is a big threat to Datadog, so I can imagine slow-rolling support is one way to manage that without looking actively hostile to it, similarly to how Datadog has other OTel "support" that doesn't play nicely with a lot of their more valuable tools/features.
Datadog has had a number of "come to jesus" moments in the past couple years where they've had to embrace OTel, but yeah, I indeed confess to a inner fear telling me that the Vector acquisition is to prevent tech as much as it is to develop it.
I’m personally still waiting for Otel stuff to just…evolve a bit more? There’s some sharp edges, and a bunch of “bits” in that ecosystem that aren’t clear how we’re supposed to hold them and things don’t _quite_ work well enough yet.
Don’t get me wrong, I want to use OTEL, but it’s a struggle. In the meantime, I’ve still got normal apps and libraries outputting normal logs and normal prom metrics, so I’ve got to stick with that.
What aspects were you missing from OTEL? We swapped the agent side out from New Relic to OTEL in NodeJS, .NET and Python - I’ve not found any major missing feature?
Or are you thinking more on the UI/analysis/collection side?
Oh you need a collector? But maybe you don’t - because some libs will push it? Ok so we got that setup, but now half the traces don’t turn up? Or they do, but they’re missing the ids to link them together? I’ve got 30m traces from an AWS lib we use, but none of ours? Oh also our logs don’t come across? Because logs require some different handling or something and some intermediary didn’t support them yet? Grafana seemed to support some things, and not others. To say nothing of the absolute plethora of config options available on the collectors and exporters: there’s like 3 or 4 different ways to define sampling and filtering, in a different layer each and they all appear to cross interact, so you can accidentally choose configs that prevent you from getting data with no indication of where it’s gone missing.
I’m keen for it to all shake down a little bit, because I’d love to be able to just bang #[instrument] on all our functions, and derive logs and metric from trace data, but seems things are a while off that yet.
Vector to me is more than just "high-performance" - It's a true swiss army knife for metrics and logging. We regularly use it to transform logs to metrics, metrics to different format, push them to different datastores, filter them, etc. It's wild how flexible this program is. It has become my first choice for anything regarding gathering/aggregating/filtering/preprocessing observability data.
Is there a way to temporarily connect to Vector and select either sources or sinks to be duplicated into that stream (let's say stdout or TCP socket)? I'd love to find a use case for Logdy where I can just stream whatever is landing in Vector to Logdy[1] and literally see everything through a web UI. The use case would be to debug complex observability pipeline as Logdy serves a UI for everything that lands on it and allows to parse and filter easily.
I'm just getting to know about vector. I have noticed that most Vector examples and discussions are targeted towards databases or complex multi-tenant applications. And looks really cool!
Has anyone tried Vector in the context of autonomous vehicle, essentially distributed system, where vector would serve the purpose of aggregating the op-logs, system state, input and output of every application at every instance?
I only learned about vector after I had setup a new fluent-bit pipeline, and have to say there's a lot of stuff that looks interesting in vector and wish I had time to play with it earlier. Might still do it when I have some downtime, it looks very interesting and capable, could be fun to try on a new project.
Vector is great. Using it for log shipping and it was always performed wonderfully, and replaced a logstash setup that was not really doing what we needed it to. I also feel like I'm only scratching the surface of Vector and would love to use it more.
What are some use cases people have had with it besides log shipping?
Without having thought much about this, surely datadog only want to store your data and have you pay for the storage/indexing/querying? I guess your worry is something like datadog making themselves the only possible backend? I don’t feel like that’s a very big risk – I think trying it would just lead to a fork of vector. Perhaps a more realistic risk is that vector would implicitly assume datadog’s constraints, eg (making these up without knowing much about datadog) field types or required information or the expected number of unique fields across all messages.
Yeah the trick is if they can lock you into a stack that sits everywhere in your apps, it’s very expensive to switch vendors, letting them extract high rents. This is what happened with the Datadog agents.
In that context, OTEL is an existential threat, because it makes them a commodity. Then it becomes relatively clear why they wouldn’t put OTEL support in the Vector roadmap.
I guess I’m surprised that your claim is basically that datadog’s advantage is in ingestion. I would have assumed they would be focusing on trying to make a product so good that people wouldn’t want to switch from it. Vector supporting multiple backends would be good for datadog if it can get more people in the door, so long as their product is compelling enough for people to stay.
I don’t know what exactly you mean about otel but elsewhere in this comments section someone linked to upscale, which uses vector to collect otel logs. Is that a counterexample?
My experience with using Datadog at (some) scale was that they focused on making it really, really easy to integrate their agent with your apps, and then once they had a large base of users with high switching costs they started rapidly raising prices.
In other words: My claim isn’t that they were better at ingestion but at onboarding and at creating switching costs.
Since that was how their leadership acted last time I used their code, I expect the same leadership to act the same way again with this other piece of code they own.
Given my experience with Datadogs pricing lock-in-and-switch, yeah 100% I’d rather run agents that allow me to pick the collection backend than another tool from Datadog.
Not directly. Vector is a tool to build pipelines that receive, transform, and send data. But it doesn't index data to make it searchable. You can use it to ingest into Splunk, however.
Vector could be used to build something Splunk-like. For example, you can use it to ship logs into Kafka, then let it ingest that data into ClickHouse, and then use a frontend like Grafana to search logs using ClickHouse.
> you could ship logs into Splork, then let it ingest that data into FlorbHut, and then use a frontend like Glorply to search logs
Seriously though, is there a single OSS product that does all of this ? Like, for a small multitenant app (i.e not "web-scale"), and that doesn't force one to get a degree in observability just to get stuff done.
It's more like a fluentd alternative, with a lot of improvements: much less sluggish throughput if you're doing anything nontrivial, nicer overall architecture and language, more 'batteries included' integrations ecosystem.
We're in the process of switching to this at work for some very high volume logs and I'm quite hopeful - other teams saw pretty decent improvements.
I don’t think hosted (saas?) Vector makes much sense. It’s meant to sit next to your deployed workloads ingesting data and sending it elsewhere for storage/analysis whatever (and it does make sense for that target to be hosted - think datadog, betterstack, etc). So you’d deploy it as a sidecar in a pod, or as another container in your deployment or as a plain old executable in your server/vm or whatever.
Also a UI is relatively pointless since you only mess with the config occasionally and otherwise just leave Vector running doing its thing. Want to see how it’s doing? Have your Prometheus scrape Vector for its own metrics and set up alerts/analysis using Prometheus itself or Grafana.
We're building something similar at Tenzir, but more for operational security workloads. https://docs.tenzir.com
Differences to Vector:
- An agent has optional indexed storage, so you can store your data there and pick it up later. The storage is based on Apache Feather, Parquet's little brother.
- Pipelines operators both work with data frames (Arrow record batches) or chunks of bytes.
- Structured pipelines are multi-schema, i.e., a single pipeline can process streams of record batches with different schemas.
The documentation is great but it can be hard to find examples of common patterns, although it's getting better with time and a growing audience.
My pro-tip has been to prefix your searches with "vector dev <query>" for best results on google. I think "vector" is/was just too generic.
A nice recent contribution added an alternative to prometheus pushgateway that handles counters better: https://github.com/vectordotdev/vector/issues/10304#issuecom...