Time Series Database

dterry · January 26, 2017

Hello all!

I've recently been looking for a better way to store time series data. The traditional solutions I've seen for this are either "SQL DB with BLOBs" or "SQL DB with file paths and TDMS or other binary file type on disk somewhere".

I did discover a few interesting time series (as opposed to hierarchical) databases in my googling. The main ones I come up with are:

daq.io - Cloud based, developed explicitly for LabVIEW
InfluxDB - Open source, runs locally on Linux. Basic LV Interface on Github
openTSDB
graphite

I haven't delved too much into any of them, especially openTSDB and graphite, but was wondering if anyone had any insights or experience with them.

Thanks!

Drew

Edited January 26, 2017 by dterry
formatting

smithd · January 27, 2017

I've been looking into this recently as well. The other TDMS alternative is HDF5 which has been discussed around here not long ago.

For my situation I'm responsible for two conceptual customers -- people who want to maintain the machine, and people who want to see the immediate output of the machine. I'm thinking HDF5 is a good fit for the immediate output data set, as it is easily movable (flat file), can contain lots of data types (even images), and is cross platform enough that anyone can open the data from python or matlab or whatever.

The other customer (long-term maintenance) requires more of a data warehousing scheme to store basically everything, potentially with compaction of older data (ie average over a day, a week, a month). This is feasible with flat files but it seems very unwieldy, so I've been also looking into time series databases. Here is essentially the evaluation I've made so far (which is inconclusive but may be helpful):

Basic DBs, requiring significant client dev work. The advantage being that they are all old, heavily used, and backed by large organizations.

mysql can provide this, but it doesn't seem to be great at it. What we've tried is that each row is (ID, value, timestamp) where (id, timestamp) is the unique primary key. What I've found is that complex queries basically take forever, so any analysis requires yanking out data in small chunks.
Postgres seems to handle this structure (way, way, way) better based on a quick benchmark but I need to evaluate more.
Cassandra seemed like it would be a better fit, but I had a lot of trouble inserting data quickly. With an identical structure to mysql/postgres, cassandra's out of box insert performance was the slowest of the three. Supposedly it should be able to go faster. The slow speed could also be due to the driver, which was a tcp package off the tools network of less then ideal quality.
- There is an alternative called Scylla which i believe aims to be many times faster with the same query language/interface, but I havent tried it.

More complete solutions:

Kairos DB seems cool, its a layer on top of cassandra, where they've presumably done the hard work of optimizing for speed. It has a lot of nice functions built-in including a basic web UI for looking at queries. I ran this in a VM since I don't have a linux machine but it was still quite fast. I need to do a proper benchmark vs the above options.
InfluxDB seemed like a good platform (they claim to be very fast, although others claim they are full of crap), but their longetivity scares me. Their 1.0 release is recent, and it sounds like they rebuilt half their codebase from scratch for it. I read various things on the series of tubes which make me wonder how long the company will survive. Could very well just be doom and gloom though.
Prometheus.io only supports floats, which is mostly OK, and allows for tagging values with strings. They use levelDB which is a google key-value pair storage format thats been around a few years. However its designed as a polling process which monitors the health of your servers and periodically fetching data from them. You can push data to it through a 'push gateway' but as far as the overall product goes, it doesn't seem designed for me.
Graphite, from what I read, is limited to a fixed size database (like you want to keep the last 10M samples in a rolling buffer) and expects data to be timestamped in a fixed interval. This is partially described here.
opentsdb: InfluxDB's benchmarks show it as slower then cassandra. It has to be deployed on top of hbase or hadoop and reading through the set-up process intimidated me, so I didn't investigate further, but I think the longetivity checkmark is hit with hadoop given how much of the world uses it.
Heroic, same as above except it requires cassandra, elasticsearch, and kafka, so I never got around to trying to set it up.

This may also help, i found it during my search: https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit

Long story short, we don't need this half of the requirement right now, so we're tabling the decision until later when we have a better idea of a specific project's needs. For example, some of the tools have automatic compression and conversion over time, while others are geared more towards keeping all data forever. I don't know right now which is going to be the better fit.

Edited January 27, 2017 by smithd

dterry · January 27, 2017

WOW! Great post, glad I asked! You've done significantly more research than I have.

13 hours ago, smithd said:

Basic DBs, requiring significant client dev work. The advantage being that they are all old, heavily used, and backed by large organizations.

mysql can provide this, but it doesn't seem to be great at it. What we've tried is that each row is (ID, value, timestamp) where (id, timestamp) is the unique primary key. What I've found is that complex queries basically take forever, so any analysis requires yanking out data in small chunks.

Postgres seems to handle this structure (way, way, way) better based on a quick benchmark but I need to evaluate more.

Cassandra seemed like it would be a better fit, but I had a lot of trouble inserting data quickly. With an identical structure to mysql/postgres, cassandra's out of box insert performance was the slowest of the three. Supposedly it should be able to go faster. The slow speed could also be due to the driver, which was a tcp package off the tools network of less then ideal quality.
There is an alternative called Scylla which i believe aims to be many times faster with the same query language/interface, but I havent tried it.

Agree with the idea that proven technology is safer. But the amount of work to get something like this working (and working efficiently) seems to negate the advantage of older tech. I hadn't looked into Cassandra or Scylla yet, but they seem promising (as long as write performance can be managed).

13 hours ago, smithd said:

More complete solutions:

Kairos DB seems cool, its a layer on top of cassandra, where they've presumably done the hard work of optimizing for speed. It has a lot of nice functions built-in including a basic web UI for looking at queries. I ran this in a VM since I don't have a linux machine but it was still quite fast. I need to do a proper benchmark vs the above options.

InfluxDB seemed like a good platform (they claim to be very fast, although others claim they are full of crap), but their longetivity scares me. Their 1.0 release is recent, and it sounds like they rebuilt half their codebase from scratch for it. I read various things on the series of tubes which make me wonder how long the company will survive. Could very well just be doom and gloom though.

Prometheus.io only supports floats, which is mostly OK, and allows for tagging values with strings. They use levelDB which is a google key-value pair storage format thats been around a few years. However its designed as a polling process which monitors the health of your servers and periodically fetching data from them. You can push data to it through a 'push gateway' but as far as the overall product goes, it doesn't seem designed for me.

Graphite, from what I read, is limited to a fixed size database (like you want to keep the last 10M samples in a rolling buffer) and expects data to be timestamped in a fixed interval. This is partially described here.

opentsdb: InfluxDB's benchmarks show it as slower then cassandra. It has to be deployed on top of hbase or hadoop and reading through the set-up process intimidated me, so I didn't investigate further, but I think the longetivity checkmark is hit with hadoop given how much of the world uses it.

Heroic, same as above except it requires cassandra, elasticsearch, and kafka, so I never got around to trying to set it up.

Great additions to the list! I'll have to dive in and do some more research based on these.

One thought I had when performance or LabVIEW driver compatibility is concerned is potentially using .NET or C libraries to access the databases. My guess is that since those are MUCH more popular, they might have had more resources/development time/optimization than a LabVIEW driver (as you mentioned for Cassandra). Have you tried this?

How did you find that spreadsheet by the way? I feel like my google-fu is lacking.

smithd · January 28, 2017

8 hours ago, dterry said:

Agree with the idea that proven technology is safer. But the amount of work to get something like this working (and working efficiently) seems to negate the advantage of older tech. I hadn't looked into Cassandra or Scylla yet, but they seem promising (as long as write performance can be managed).

Well in this case (or at least for me) the data is whats important. If someone has to come along and re-write every UI and sql query I make it could still be cheaper than losing the data because influx's view of reliability is different from mine.

8 hours ago, dterry said:

One thought I had when performance or LabVIEW driver compatibility is concerned is potentially using .NET or C libraries to access the databases. My guess is that since those are MUCH more popular, they might have had more resources/development time/optimization than a LabVIEW driver (as you mentioned for Cassandra). Have you tried this?

How did you find that spreadsheet by the way? I feel like my google-fu is lacking.

I havent yet tried the alternative drivers, but yeah I would expect them to be better. I know there is an ODBC driver for cassandra (even though its not quite sql) which I would expect to work, but it looks like the folks who maintain the windows builds of cassandra are having some sort of fight with apache, and I had trouble finding the appropriate download. The tools network driver did its job but regularly dropped the connection, had errors, etc.

As for the spreadsheet...I definitely found it on google, but it was like last november and I bookmarked it so I couldn't tell you what I was looking for :/

dterry · January 30, 2017

On 1/27/2017 at 8:30 PM, smithd said:

Well in this case (or at least for me) the data is whats important. If someone has to come along and re-write every UI and sql query I make it could still be cheaper than losing the data because influx's view of reliability is different from mine.

I guess I just assumed the data could be reliably backed up, but that's probably a generous assumption.

On 1/27/2017 at 8:30 PM, smithd said:

it looks like the folks who maintain the windows builds of cassandra are having some sort of fight with apache, and I had trouble finding the appropriate download.

that sucks!

smithd · January 31, 2017

10 hours ago, dterry said:

I guess I just assumed the data could be reliably backed up, but that's probably a generous assumption.

Sure, but in what format. These cool little open source applications come and go all the time. Theres the risk of losing the right version of the code and not being able to find it again (as mentioned influx changed its entire data storage from <1 to v1) as well as the cost of migrating a data set to something new. I'm not saying that if prometheus for example went away tomorrow the data would become unusable, but I am saying I would immediately feel the need to migrate it to something else to avoid those risks.

dterry · January 31, 2017

That's fair. Developing new applications is really a balancing act between stable technology and staying relevant. Think of all those poor XP reliant applications after it got sunsetted!

JoeyG · April 29, 2019

Don't use KairosDB if you care for performance, use Cassandra directly.

I've been working with it and it's nothing but problems. My advice is to use Cassandra directly and be done with it

The first problem with KairosDB is that the documentation is awful. To be fair, Cassandra has some issues there as well though if you do work out how to do things in Cassandra, what it's good at and apply it to appropriate problems then I can't say I really have any reason to complain.

Documentation with KairosDB is very vague and confusing about a lot of things such as how do tags work (does a match have to have all of the tags specified or only one), vague about performance and some parts I think might just be wrong.

Smells are present in the API itself with things such as rather than using a type for alignment, having strange use of mutually exclusive keys implying type.

It's not fast and it doesn't optimise things on top of Cassandra. It makes an effort to do this but it fails. The only thing it might be faster than is CSV or flat files. It's not Cassandra fast.

Its query performance in anything but the most trivial of circumstances is absolutely appalling. It makes terrible use of Cassandra and you're liable to be able to do much better.

It doesn't fully use indexes and just has one very simple crude schema that tries to do everything (it can't performance wise). It's three tables each with three columns, practically what you get if you search for blogs with tutorials on how to use Cassandra for time series. Almost everything is a blob (a couple are text). Some columns appear unused, they're just noise, junk data.

If you use tags it has to scan all the tag values. It takes all the rows from Cassandra for a metric then scans the tags. If you do a query for the temperature over time in New York, it'll load every possible tag combination for the temperature metric from the database. This isn't lightweight either, it has to deserialize all the values using a custom scheme.

It'll not only load combinations you don't need but will also load rows up to three weeks worth of tag combinations more than it needs. This isn't even getting at datapoints yet. They binary encode the values making it impossible to use things such as aggregate functions, instead KairosDB is replacing Cassandra for this and reimplementing the wheel. That means it must send every row, every data point back to KairosDB which must then aggregate those data points itself.

If you use tags, then you will probably have big problems. If you use a lot of metrics then you might have problems. If you use aggregates or group by then you will probably have big problems.

If you want something for very simple cases or where you don't mind terrible performance, instability in some scenarios, losing data (for a long time until it was fixed it was just dropping data), etc but only want ease of use then you might get by with KairosDB but if you're serious about your usage and big data, don't touch it with a barge pole.

It also has a lot of problems with junk data. I've found databases 99.99% full of data put there by the KairosDB reporter. After the hell of clearing it all out (Cassandra really isn't really suited to mass deletes), turning off the reporter as the instructions state, it's still populating the database with GB of data. I cleared it all out, when and checked Cassandra to find that just a few days later over half the data was KairosDB's reporting junk.

There's an public ticket from IBM about this (oddly they're using H2, not meant for production) you can find with a search and lots of people concerned about read performance which can get pretty bad in KairosDB.

Data bloat can get hairy where it also stores all tag values and metric names ever inserted. There's not a brilliant set of strategies available for managing garbage collection of this data. I've implemented my own solution and it doesn't even need to use that table for anything any of my use cases with the exception of being able to know the metrics that are in the system without a full table scan.

I replaced the KairosDB daemon with a client in my language that just connects to Cassandra directly, taking KairosDB out of the picture which is on average faster without optimizations. In the cases where it doesn't do as well I'm fairly certain that it's down to quirks in the Cassandra driver for the programming language I'm using. That language is much slower than Java and loses the benefit of not being a daemon, yet it's still faster on average, just porting is enough. Not being a daemon means it can't use resource pools so easily. It's not multi-threaded either but it's still faster. Profiling also seems to point to the Cassandra driver which probably needs a bit of tuning. It's very apparent KairosDB does very poorly at caching despite using a lot of memory and has a very poor ability to expose any opportunity to optimise.

When I add a very basic cache for the first phase lookup my script always performs much faster for all of my benchmarks and stress tests. Ultimately I'll need to fix the schema in Cassandra to have full performance but even before then, adding a cache for data points will speed things up a thousand times as well as substantially reduce resource usage. Simply fiddling the queries for one data case will allow it to only fetch the rows it needs in the first stage.

If I can use the KairosDB schema better than KairosDB does in a thousand lines of code then why am I using KairosDB at all? If I take it out entirely then I can also create a far better suited schema to my use cases.

postgresql (perhaps with timescaledb), MySQL, Cassandra and MongoDB in my experience can all do not too badly for a range of use cases as long as you know their ins and outs, they're relatively flexible. From what I understand (I've only used it a little) things like Graphite will be very specific to certain use cases so you'd need to evaluate it for that. KairosDB suffers a similar problem though it exposes features it just can't handle internally while giving the impression of having at least a few decent features. Others, I don't know anything. If in doubt always check the source code to ensure it's sane and search for disaster stories.

Edited April 29, 2019 by JoeyG

Sign In

Time Series Database

Recommended Posts

dterry

smithd

dterry

smithd

dterry

smithd

dterry

JoeyG

Join the conversation

Similar Content

ANV Database Toolkit

Re-establishing TestStand Database Connections: Does anyone know exactly how TestStand maintains its database connnection?

Using SQL Databases and LabVIEW exe

ADODB possible LabVIEW ActiveX Bug

Copy and Modify Database Toolkit

Browse

Activity

Important Information