Jump to content
dterry

Time Series Database

Recommended Posts

Hello all!

I've recently been looking for a better way to store time series data.  The traditional solutions I've seen for this are either "SQL DB with BLOBs" or "SQL DB with file paths and TDMS or other binary file type on disk somewhere". 

I did discover a few interesting time series (as opposed to hierarchical) databases in my googling.  The main ones I come up with are:

  • daq.io - Cloud based, developed explicitly for LabVIEW
  • InfluxDB - Open source, runs locally on Linux.  Basic LV Interface on Github
  • openTSDB
  • graphite

I haven't delved too much into any of them, especially openTSDB and graphite, but was wondering if anyone had any insights or experience with them.

Thanks!

Drew

Edited by dterry
formatting

Share this post


Link to post
Share on other sites

WOW!  Great post, glad I asked!  You've done significantly more research than I have.

13 hours ago, smithd said:

Basic DBs, requiring significant client dev work. The advantage being that they are all old, heavily used, and backed by large organizations.

  • mysql can provide this, but it doesn't seem to be great at it. What we've tried is that each row is (ID, value, timestamp) where (id, timestamp) is the unique primary key. What I've found is that complex queries basically take forever, so any analysis requires yanking out data in small chunks.
  • Postgres seems to handle this structure (way, way, way) better based on a quick benchmark but I need to evaluate more.
  • Cassandra seemed like it would be a better fit, but I had a lot of trouble inserting data quickly. With an identical structure to mysql/postgres, cassandra's out of box insert performance was the slowest of the three. Supposedly it should be able to go faster. The slow speed could also be due to the driver, which was a tcp package off the tools network of less then ideal quality. 
    • There is an alternative called Scylla which i believe aims to be many times faster with the same query language/interface, but I havent tried it.

Agree with the idea that proven technology is safer.  But the amount of work to get something like this working (and working efficiently) seems to negate the advantage of older tech.  I hadn't looked into Cassandra or Scylla yet, but they seem promising (as long as write performance can be managed).

 

13 hours ago, smithd said:

More complete solutions:

  • Kairos DB seems cool, its a layer on top of cassandra, where they've presumably done the hard work of optimizing for speed. It has a lot of nice functions built-in including a basic web UI for looking at queries. I ran this in a VM since I don't have a linux machine but it was still quite fast. I need to do a proper benchmark vs the above options.
  • InfluxDB seemed like a good platform (they claim to be very fast, although others claim they are full of crap), but their longetivity scares me. Their 1.0 release is recent, and it sounds like they rebuilt half their codebase from scratch for it. I read various things on the series of tubes which make me wonder how long the company will survive. Could very well just be doom and gloom though.
  • Prometheus.io only supports floats, which is mostly OK, and allows for tagging values with strings. They use levelDB which is a google key-value pair storage format thats been around a few years. However its designed as a polling process which monitors the health of your servers and periodically fetching data from them. You can push data to it through a 'push gateway' but as far as the overall product goes, it doesn't seem designed for me.
  • Graphite, from what I read, is limited to a fixed size database (like you want to keep the last 10M samples in a rolling buffer) and expects data to be timestamped in a fixed interval. This is partially described here.
  • opentsdb: InfluxDB's benchmarks show it as slower then cassandra. It has to be deployed on top of hbase or hadoop and reading through the set-up process intimidated me, so I didn't investigate further, but I think the longetivity checkmark is hit with hadoop given how much of the world uses it.
  • Heroic, same as above except it requires cassandra, elasticsearch, and kafka, so I never got around to trying to set it up.

Great additions to the list!  I'll have to dive in and do some more research based on these.

One thought I had when performance or LabVIEW driver compatibility is concerned is potentially using .NET or C libraries to access the databases.  My guess is that since those are MUCH more popular, they might have had more resources/development time/optimization than a LabVIEW driver (as you mentioned for Cassandra).  Have you tried this?

How did you find that spreadsheet by the way?  I feel like my google-fu is lacking.

Share this post


Link to post
Share on other sites
8 hours ago, dterry said:

Agree with the idea that proven technology is safer.  But the amount of work to get something like this working (and working efficiently) seems to negate the advantage of older tech.  I hadn't looked into Cassandra or Scylla yet, but they seem promising (as long as write performance can be managed).

Well in this case (or at least for me) the data is whats important. If someone has to come along and re-write every UI and sql query I make it could still be cheaper than losing the data because influx's view of reliability is different from mine.

8 hours ago, dterry said:

One thought I had when performance or LabVIEW driver compatibility is concerned is potentially using .NET or C libraries to access the databases.  My guess is that since those are MUCH more popular, they might have had more resources/development time/optimization than a LabVIEW driver (as you mentioned for Cassandra).  Have you tried this?

How did you find that spreadsheet by the way?  I feel like my google-fu is lacking.

I havent yet tried the alternative drivers, but yeah I would expect them to be better. I know there is an ODBC driver for cassandra (even though its not quite sql) which I would expect to work, but it looks like the folks who maintain the windows builds of cassandra are having some sort of fight with apache, and I had trouble finding the appropriate download. The tools network driver did its job but regularly dropped the connection, had errors, etc.

As for the spreadsheet...I definitely found it on google, but it was like last november and I bookmarked it so I couldn't tell you what I was looking for :/

 

Share this post


Link to post
Share on other sites
On 1/27/2017 at 8:30 PM, smithd said:

Well in this case (or at least for me) the data is whats important. If someone has to come along and re-write every UI and sql query I make it could still be cheaper than losing the data because influx's view of reliability is different from mine.

I guess I just assumed the data could be reliably backed up, but that's probably a generous assumption.

On 1/27/2017 at 8:30 PM, smithd said:

it looks like the folks who maintain the windows builds of cassandra are having some sort of fight with apache, and I had trouble finding the appropriate download.

that sucks!

Share this post


Link to post
Share on other sites
10 hours ago, dterry said:

I guess I just assumed the data could be reliably backed up, but that's probably a generous assumption.

Sure, but in what format. These cool little open source applications come and go all the time. Theres the risk of losing the right version of the code and not being able to find it again (as mentioned influx changed its entire data storage from <1 to v1) as well as the cost of migrating a data set to something new. I'm not saying that if prometheus for example went away tomorrow the data would become unusable, but I am saying I would immediately feel the need to migrate it to something else to avoid those risks.

Share this post


Link to post
Share on other sites

That's fair.  Developing new applications is really a balancing act between stable technology and staying relevant.  Think of all those poor XP reliant applications after it got sunsetted!

Share this post


Link to post
Share on other sites

Don't use KairosDB if you care for performance, use Cassandra directly.

I've been working with it and it's nothing but problems. My advice is to use Cassandra directly and be done with it

The first problem with KairosDB is that the documentation is awful. To be fair, Cassandra has some issues there as well though if you do work out how to do things in Cassandra, what it's good at and apply it to appropriate problems then I can't say I really have any reason to complain.

Documentation with KairosDB is very vague and confusing about a lot of things such as how do tags work (does a match have to have all of the tags specified or only one), vague about performance and some parts I think might just be wrong.

Smells are present in the API itself with things such as rather than using a type for alignment, having strange use of mutually exclusive keys implying type.

It's not fast and it doesn't optimise things on top of Cassandra. It makes an effort to do this but it fails. The only thing it might be faster than is CSV or flat files. It's not Cassandra fast.

Its query performance in anything but the most trivial of circumstances is absolutely appalling. It makes terrible use of Cassandra and you're liable to be able to do much better.

It doesn't fully use indexes and just has one very simple crude schema that tries to do everything (it can't performance wise). It's three tables each with three columns, practically what you get if you search for blogs with tutorials on how to use Cassandra for time series. Almost everything is a blob (a couple are text). Some columns appear unused, they're just noise, junk data.

If you use tags it has to scan all the tag values. It takes all the rows from Cassandra for a metric then scans the tags. If you do a query for the temperature over time in New York, it'll load every possible tag combination for the temperature metric from the database. This isn't lightweight either, it has to deserialize all the values using a custom scheme.

It'll not only load combinations you don't need but will also load rows up to three weeks worth of tag combinations more than it needs. This isn't even getting at datapoints yet. They binary encode the values making it impossible to use things such as aggregate functions, instead KairosDB is replacing Cassandra for this and reimplementing the wheel. That means it must send every row, every data point back to KairosDB which must then aggregate those data points itself.

If you use tags, then you will probably have big problems. If you use a lot of metrics then you might have problems. If you use aggregates or group by then you will probably have big problems.

If you want something for very simple cases or where you don't mind terrible performance, instability in some scenarios, losing data (for a long time until it was fixed it was just dropping data), etc but only want ease of use then you might get by with KairosDB but if you're serious about your usage and big data, don't touch it with a barge pole.

It also has a lot of problems with junk data. I've found databases 99.99% full of data put there by the KairosDB reporter. After the hell of clearing it all out (Cassandra really isn't really suited to mass deletes), turning off the reporter as the instructions state, it's still populating the database with GB of data. I cleared it all out, when and checked Cassandra to find that just a few days later over half the data was KairosDB's reporting junk.

There's an public ticket from IBM about this (oddly they're using H2, not meant for production) you can find with a search and lots of people concerned about read performance which can get pretty bad in KairosDB.

Data bloat can get hairy where it also stores all tag values and metric names ever inserted. There's not a brilliant set of strategies available for managing garbage collection of this data. I've implemented my own solution and it doesn't even need to use that table for anything any of my use cases with the exception of being able to know the metrics that are in the system without a full table scan.

I replaced the KairosDB daemon with a client in my language that just connects to Cassandra directly, taking KairosDB out of the picture which is on average faster without optimizations. In the cases where it doesn't do as well I'm fairly certain that it's down to quirks in the Cassandra driver for the programming language I'm using. That language is much slower than Java and loses the benefit of not being a daemon, yet it's still faster on average, just porting is enough. Not being a daemon means it can't use resource pools so easily. It's not multi-threaded either but it's still faster. Profiling also seems to point to the Cassandra driver which probably needs a bit of tuning. It's very apparent KairosDB does very poorly at caching despite using a lot of memory and has a very poor ability to expose any opportunity to optimise.

When I add a very basic cache for the first phase lookup my script always performs much faster for all of my benchmarks and stress tests. Ultimately I'll need to fix the schema in Cassandra to have full performance but even before then, adding a cache for data points will speed things up a thousand times as well as substantially reduce resource usage. Simply fiddling the queries for one data case will allow it to only fetch the rows it needs in the first stage.

If I can use the KairosDB schema better than KairosDB does in a thousand lines of code then why am I using KairosDB at all? If I take it out entirely then I can also create a far better suited schema to my use cases.

postgresql (perhaps with timescaledb), MySQL, Cassandra and MongoDB in my experience can all do not too badly for a range of use cases as long as you know their ins and outs, they're relatively flexible. From what I understand (I've only used it a little) things like Graphite will be very specific to certain use cases so you'd need to evaluate it for that. KairosDB suffers a similar problem though it exposes features it just can't handle internally while giving the impression of having at least a few decent features. Others, I don't know anything. If in doubt always check the source code to ensure it's sane and search for disaster stories.

Edited by JoeyG
  • Thanks 2

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Similar Content

    • By ATE-ENGE
      Background:
      I've been using LabVIEW for a few years for automation testing tasks and until recently have been saving my data to "[DescriptorA]\[DescriptorB]\[test_info].csv" files. A few months ago, a friend turned me on to the concept of relational databases, I've been really impressed by their response times and am reworking my code and following the examples with the Database Connectivity Toolkit (DCT) to use "[test_info].mdb" with my provider being a Microsoft jet oldb database.
      However, I'm beginning to see the limitations of the DCT namely:
      No support for auto-incrementing primary keys No support for foreign keys Difficult to program stored procedures and I'm sure a few more that I don't know yet.
      Now I've switched over to architecting my database in MySQL Workbench. Suffice to say I'm a bit out of my depth and have a few questions that I haven't seen covered in tutorials
       Questions (General):
       Using Microsoft jet oldb I made a connection string "Data Source= C:\[Database]\[databasename.mdb]" in a .UDL file. However, the examples I've seen for connecting to MySQL databases use IP addresses and ports.
      Is a MySQL database still a file? If not, how do I put it on my networked server \\[servername\Database\[file]? If so, what file extensions exist for databases and what is the implication of each extension? I know of .mdb, but are there others I could/should be using (such as .csv's vs .txt's)  My peers, who have more work experience than me but no experience with databases, espouse a 2GB limit on all files (I believe from the era of FAT16 disks). My current oldb database is about 200mB in size so 2GB will likely never happen, but I'm curious:
      Do file size limits still apply to database files? If so, how does one have the giant databases that support major websites?  Questions (LabVIEW Specific):
      I can install my [MainTestingVi.exe], which accesses the jet oldb database, on a Windows 10 computer that is fresh out of the box. When I switch over to having a MySQL database, are there any additional tools that I'll need to install as well? 
    • By GregFreeman
      I think I have found a fundamental issue with the DB Toolkit Open connection. It seems to not correctly use connection pooling. The reason I believe it's an issue with LabVIEW and ADODB ActiveX specifically is because the problem does not manifest itself using the ADODB driver in C#. This is better shown with examples. All I am doing in these examples is opening and closing connections and benchmarking the connection open time.
      Adodb and Oracle driver in LabVIEW.

       
      ADODB in C#
       
      namespace TestAdodbOpenTime { class Program { static void Main(string[] args) { Stopwatch sw = new Stopwatch(); for (int i = 0; i < 30; i++) { ADODB.Connection cn = new ADODB.Connection(); int count = Environment.TickCount; cn.Open("Provider=OraOLEDB.Oracle;Data Source=FASTBAW;Extended Properties=PLSQLRSet=1;Pooling=true;", "USERID", "PASSWORD", -1); sw.Stop(); cn.Close(); int elapsedTime = Environment.TickCount - count; Debug.WriteLine("RunTime " + elapsedTime); } } } } Output:
      RunTime 203
      RunTime 0
      RunTime 0
      RunTime 0
      RunTime 0
      RunTime 0
      RunTime 0
      RunTime 0
      RunTime 0
       
      Notice the time nicely aligns between the LabVIEW code leveraging the .NET driver and the C# code using ADODB. The first connection takes a bit to open then the rest the connection pooling takes over nicely and the connect time is 0. 
       
      Now cue the LabVIEW ActiveX implementation and every open connection time is pretty crummy and very sporadic. 
       
      One thing I happened to find out by accident when troubleshooting was if I add a property node on the block diagram where I open a connection, and if I don't close the reference, my subsequent connect times are WAY faster (between 1 and 3 ms). That is what leads me to believe this may be a bug in whatever LabVIEW does to interface with ActiveX.
       
      Has anyone seen issues like this before or have any idea of where I can look to help me avoid wrapping up the driver myself?
       
    • By GregFreeman
      I am running calls to a various stored procedures in parallel, each with their own connection refnums. A few of these calls can take a while to execute from time to time. In critical parts of my application I would like the Cmd Execute.vi to be reentrant. Generally I handle this by making a copy of the NI library and namespacing my own version. I can then make a reentrant copy of the VI I need and save it in my own library, then commit it in version control so everyone working on the project has it. But the library is password protected so even a copy of it keeps it locked. I can't do a save as on the VIs that I need and make a reentrant copy, nor can I add any new VIs to the library.
      Does anyone have any suggestions? I have resorted to taking NIs library, including it inside my own library, then basically rewriting the VIs I need by copying the contents from the block diagram of the VI I want to "save as" and pasting them in another VI.
    • By Night Koh
      Greetings, 
      I am currently learning LabVIEW Database which I had encountered the error stated on the title. I am using Xampp software for the database. May I know what is the problem behind this? Despite I had provided name for my table. Thank you.
       
       

      States Control.ctl
      User_Info.vi
    • By GregFreeman
      I have been getting some hard crashes in my built application and I have some strange feeling (although it's just a guess) that it is related to a combination of using the database toolkit in a pool of workers that are launched by the start async call node. Their job is to sit there and monitor a directory then parse data files and throw their contents at the database through a stored procedure call. All my hardware comms (2 instruments) are using scpi commands through VISA GPIB and TCP so I think the risk that those are causing crashes is relatively low.
      Active X inside a bunch of parallel threads seems far more risky, even though as far as I can tell adodb should be thread safe, and each thread manages its own, unshared connection. These crashes are happening ~once a day on multiple test stations. I have sent the crash dumps to NI but am waiting to hear back. Right now I am grasping at straws because when I look at the dump file it's pointing me to function calls in the lvrt dll which does nothing for me. 
      I am mostly just looking for any debugging suggestions or direction to getting this resolved more efficiently than just disabling code one loop at a time. I'm also curious if anyone has seen something similar. For the time being, I have reduced my number of workers to 1 in case it is a thread safety issue. I have also considered getting rid of that DB toolkit and leveraging .NET, even though I think that is just a wrapper around the same calls. Looking inside the database toolkit VIs alone scares me.
      FWIW I am using LabVIEW 2013 in this application.
      Thanks!
       
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.