Jump to content

smithd

Members
  • Posts

    763
  • Joined

  • Last visited

  • Days Won

    42

Everything posted by smithd

  1. primary is (id, time) so I suppose it makes some sense if you don't give it any ID in the where clause it would have to do a table scan, but postgres is still taking 5 minutes even with an ID (or range of IDs). I dont know how postgres handles indices, but I think mysql just concatenates the values, so in this case it would be sorted by id then timestamp. That may also contribute to the slow insert performance as I'm inserting more as a function of time (so it may have to fiddle with the index quite a bit), but querying more as a function of ID. Its worth mentioning I also tried to add an index to postgres which is a hash of just timestamp (no change in performance), but it occurs to me that if you're asking for a range of timestamps, it probably can't use that index. Hw that its running on is a relatively dinky test server, 32 GB of ram and a single unraided spinning disk (relatively fast processor, but in this case the disk is the bottleneck). The reason I think this is an ok approximation is that this is for a long term data storage application, so tables will never fit entirely in ram and disk I/O will always be the bottleneck until purchasing several multi-tb SSDs becomes affordable.
  2. small update: I did a more complex query and the results were less happy. Both tests ran to something on the order of 70 million rows in the table, and a given 10 minute period should have about 6000 entries (100 ms apart) for about 3 million queried rows. In both databases a query just for a specific ID and specific time range is pretty quick. However this breaks down terribly if you select a time range + an id range, which is what you might do if you wanted to, say, see all of average values of all the values associated with system x. So what I ran was: select avg, stddev, max, min (val) where t between x and x+10 and id between 250 and 260 or something along those lines. Mysql performed this query very slowly, about a minute for a warmed up database server. In fact, it even did fine on the whole range of IDs (no where clause for ID) and didn't have much of a slowdown at all. I don't think I saw a query take more than 80 seconds. Note that I still think this is ridiculous for 3 million rows (lets guess its something like 100 MB of data it has to process in the worst case I could imagine) but its fantastic in comparison to postgres. With postgres, I tried the query with no ID first -- just timestamp, and gave up after maybe 20 or so minutes. I did some searches and found that using the query "vacuum analyze" might help so I ran that. Still no go. I added 'where id between 250 and 260' back in and it took about 6 minutes to run. I even used the explain tool to understand the issue. It looks like it is using the index, but if you leave out the ID filter it doesn't do so--it does a full table scan. So the concerning thing is that even in the case where it uses the index, it performs several times worse than mysql (which again is already bad). Its literally faster to select for each ID individually than to do a bulk query in postgres. I'm not sure where to go from here, but again I thought I'd share.
  3. So I took some time to do this right today. I applied the same basic optimizations to both mysql and postgres. Still need to find the right driver for cassandra, not going to happen today. For reference: Both dbs use id (int) t (time) and val (dbl). I have more complex data but most is dbl or bool. mysql installed as part of wampserver (easy 64-bit install). Used innodb as database type. Use several (5) connections using the TCP library (which I already know to be faster for small queries, especially those returning 0 rows). Set buffer pool size to 10G, bump log file size to 256M. Most critically, change innodb_flush_log_at_trx_commit to 2, so it flushes to disk slowly. This is the best configuration I've been able to find thus far, although I'm always welcome to hear of additional optimizations. postgres installed standalone. Use settings here: http://pgtune.leopard.in.ua/ for windows/data warehouse/32G. This matches basically with what I've found elsewhere. For postgres I ran both your driver and the ODBC driver through dbconn toolkit. Yours was about 10% faster. For the run I kept both on the same spinning disk (non-raid) and totally stopped each service while the other was running. Both applications were identical except the driver. I have one loop generating inserts of 500 elements and shoving that into a size-10 queue. I have a parallel for loop opening 5 connections to the db, beginning a transaction, writing 5 large inserts, committing the transaction, repeat. The 500 element insert is very simple, insert into dbldate (id, t, val) Values (.......). From what I understand, this should be the fastest way to insert data in both dbs short of using an intermediate file (which isn't my use case). Timestamp is fixed for each packet of 500, forcibly incremented by 100 ms for each iteration. id is set to the i terminal (id=0->499). For both I measured two values: how many elements have I shoved into the queue divided by total execution time (this will obviously be off by up to 10/N, but N is huge) and i charted the execution time of each transaction of 5x500 elements. Mysql 22 ms/packet after 8738 packets. It started low, <10 ms but this slowly rose up. Most critically for a streaming app, there were HUGE jitter spikes of up to 11.5 seconds and a significant amount of noise in the 2 second range. These spikes were contributing to the slow growth in execution time which, being an average over the whole run, lags vs the instantaneous numbers. This matches my experience in production, where the db keeps up on average, but the spikes lead to data loss as buffers overflow. Its worth saying here that I'd bet its possible to find and reduce the source of these spikes, but I'm at the end of my skills for optimization of the database. Postgres is a lot more cheerful. Time per packet was 10 ms after 6445 iterations (vs 11 ms for ODBC) and while there are a small number of large jitter spikes, they go up to...1-1.4 seconds. Which isn't anywhere near as bad. The 'noise' floor (which was 2 seconds for mysql) was about half a second for postgres. Whats more, I'm not at the end of my rope for optimization, as I read there are ways to disable some of the safety checks to make it go faster (this is the same I think as what I did for mysql with "flush log at transaction commit"). I know this isn't a perfect test, but keep in mind its also a reflection of the level of effort required for each. Mysql has always been the super easy to get up and running database, but for this particular use case postgres was just significantly easier to configure and configure well. On the query side I didn't do as much work. I just generated a simple query (id=250, t between blah and blah+5). In mysql this took 5 ms (using heidisql). Postgres took significantly longer -- 500 ms (using pgadmin). I then generated a more complex query (id between x and y, val > 0.5, time between x and x+5) and this took about 1 sec with postgres and 750 ms with mysql.
  4. I'm guessing its the image refnums, because you can't really include them in a general purpose library. If I remember correctly, the imaq refnum is itself licensed, so if you don't have image acq sw installed and activated the library won't work. I've seen this issue with quick drop shortcuts and image refnums as well. It probably just needs a special case, but then you lose the image if thats something you care about.
  5. Even without oop NI sometimes changes the internal type information. Data from 2010 to 2014 should (probably) be unflattened fine, but 2014 to 2010 may not work at all. This KB is what I've used to handle 2014->2013 communication. http://digital.ni.com/public.nsf/allkb/45E2D7BE36CE3E8B86257CCF0074D89B I dont know how this works with oop. However with objects I almost always add a 'flatten' and 'unflatten' for the object which generates a custom format string and never deal with the automatic stuff.
  6. Sure, but in what format. These cool little open source applications come and go all the time. Theres the risk of losing the right version of the code and not being able to find it again (as mentioned influx changed its entire data storage from <1 to v1) as well as the cost of migrating a data set to something new. I'm not saying that if prometheus for example went away tomorrow the data would become unusable, but I am saying I would immediately feel the need to migrate it to something else to avoid those risks.
  7. Ah yes, thats right. What we do is more or less a migration from an ini file, so its a,b,c are section, key, and value, with section.key being the unique identifier
  8. What we've got on my current project is nothing too special, its just if your original table is columns (a, b, c) you ad (t0, tF) to the end. t0 would default to current_timestamp. A trigger would be run on every insert that says update table set tF=current_timestamp where tF=null and a=NEW.a and b=NEW.b and c=NEW.c. Another trigger would run if you want to update a row which replaces the update with an insert. Another trigger would replace a delete with a call to update tF on a row. Then your query would be either: (for most recent) select a,b,c from table where (filter a,b,c) and tF=null (for selected time) select a,b,c from table where (filter a,b,c) and tF > selectedTime and t0 <= selectedTime for both, you can add "order by t0 desc limit 1" but from my recent experience this leads to sloppy table maintenance -- for example we have tons of configurations with tF=null, but we just pick the one with the most recent t0. It works, but its annoying to read and make queries for, and plus it makes me feel dirty. I may have some details wrong but thats the gist of it. I couldn't find a nice article with a lot of actual sql in it, but it looks like 'temporal database' might be the right phrase to hunt for: https://en.wikipedia.org/wiki/Temporal_database Books (or, downloadable PDF) section: https://www2.cs.arizona.edu/~rts/publications.html https://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture/
  9. Well in this case (or at least for me) the data is whats important. If someone has to come along and re-write every UI and sql query I make it could still be cheaper than losing the data because influx's view of reliability is different from mine. I havent yet tried the alternative drivers, but yeah I would expect them to be better. I know there is an ODBC driver for cassandra (even though its not quite sql) which I would expect to work, but it looks like the folks who maintain the windows builds of cassandra are having some sort of fight with apache, and I had trouble finding the appropriate download. The tools network driver did its job but regularly dropped the connection, had errors, etc. As for the spreadsheet...I definitely found it on google, but it was like last november and I bookmarked it so I couldn't tell you what I was looking for :/
  10. I would say the easiest way to think about sql databases is as an excel workbook. If you can represent it in a workbook you are very likely to be able to represent it, similarly, in a db. This is a nice intro along the same theme: http://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ If you're looking to try things out, you'll likely go along one of three free routes: postgres/mysql: server-oriented databases, everything is stored on a central server (or network of servers) and you ask it questions as the single source of truth. sqlite: file database, still uses sql but the engine is implemented as a library that runs on your local machine. To share the db across computers you must share the file. I can't say this with certainty, but if you need to use a vxworks cRIO and don't like suffering, mysql is the only option. One recommendation I have is to use a schema that allows for history. A nice writeup of the concept is here: https://martinfowler.com/eaaDev/timeNarrative.html The implementation should be pretty simple...basically add a "valid from" and "valid to" field to any table that needs history. If "valid to" is empty, the row is still valid. You can use triggers to create the timestamps so you never have to even think about it. You can also skip a step if you always have a valid field, as then you just have a "valid from" field, and select the newest row in your query. An alternative but more complex implementation would be to have a table which only represents the current state, and use a trigger to copy the previous value into an append-only log. The first option is more valid if you regularly need to ask "what was the value last week" or specify "the new value will be Y starting on monday" while the log is more appropriate just for tracking -- something got fat fingered and you want to see the previous value.
  11. I've been looking into this recently as well. The other TDMS alternative is HDF5 which has been discussed around here not long ago. For my situation I'm responsible for two conceptual customers -- people who want to maintain the machine, and people who want to see the immediate output of the machine. I'm thinking HDF5 is a good fit for the immediate output data set, as it is easily movable (flat file), can contain lots of data types (even images), and is cross platform enough that anyone can open the data from python or matlab or whatever. The other customer (long-term maintenance) requires more of a data warehousing scheme to store basically everything, potentially with compaction of older data (ie average over a day, a week, a month). This is feasible with flat files but it seems very unwieldy, so I've been also looking into time series databases. Here is essentially the evaluation I've made so far (which is inconclusive but may be helpful): Basic DBs, requiring significant client dev work. The advantage being that they are all old, heavily used, and backed by large organizations. mysql can provide this, but it doesn't seem to be great at it. What we've tried is that each row is (ID, value, timestamp) where (id, timestamp) is the unique primary key. What I've found is that complex queries basically take forever, so any analysis requires yanking out data in small chunks. Postgres seems to handle this structure (way, way, way) better based on a quick benchmark but I need to evaluate more. Cassandra seemed like it would be a better fit, but I had a lot of trouble inserting data quickly. With an identical structure to mysql/postgres, cassandra's out of box insert performance was the slowest of the three. Supposedly it should be able to go faster. The slow speed could also be due to the driver, which was a tcp package off the tools network of less then ideal quality. There is an alternative called Scylla which i believe aims to be many times faster with the same query language/interface, but I havent tried it. More complete solutions: Kairos DB seems cool, its a layer on top of cassandra, where they've presumably done the hard work of optimizing for speed. It has a lot of nice functions built-in including a basic web UI for looking at queries. I ran this in a VM since I don't have a linux machine but it was still quite fast. I need to do a proper benchmark vs the above options. InfluxDB seemed like a good platform (they claim to be very fast, although others claim they are full of crap), but their longetivity scares me. Their 1.0 release is recent, and it sounds like they rebuilt half their codebase from scratch for it. I read various things on the series of tubes which make me wonder how long the company will survive. Could very well just be doom and gloom though. Prometheus.io only supports floats, which is mostly OK, and allows for tagging values with strings. They use levelDB which is a google key-value pair storage format thats been around a few years. However its designed as a polling process which monitors the health of your servers and periodically fetching data from them. You can push data to it through a 'push gateway' but as far as the overall product goes, it doesn't seem designed for me. Graphite, from what I read, is limited to a fixed size database (like you want to keep the last 10M samples in a rolling buffer) and expects data to be timestamped in a fixed interval. This is partially described here. opentsdb: InfluxDB's benchmarks show it as slower then cassandra. It has to be deployed on top of hbase or hadoop and reading through the set-up process intimidated me, so I didn't investigate further, but I think the longetivity checkmark is hit with hadoop given how much of the world uses it. Heroic, same as above except it requires cassandra, elasticsearch, and kafka, so I never got around to trying to set it up. This may also help, i found it during my search: https://docs.google.com/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit Long story short, we don't need this half of the requirement right now, so we're tabling the decision until later when we have a better idea of a specific project's needs. For example, some of the tools have automatic compression and conversion over time, while others are geared more towards keeping all data forever. I don't know right now which is going to be the better fit.
  12. I kind of read the opposite, but fair enough. The specific use case is for large time series data so I inserted data ~1000 rows at a time for 100s of GB of data. I used the same schema for both but otherwise did not touch any setting or enable partitioning on either database (to make this work at all with the mysql database on our 'production' system I spent several days reading through various methods for improving performance and implementing them -- limiting flush to disk, using >20 separate connections to insert data, partitioning, etc). I wanted to see the baseline, and mysql's out of box speed was embarrassingly slow once the data size (or the index) got larger than available ram. Postgres got slower but kept on going. I'm assuming they have better default management of the index, and it looks like they also have an interesting partitioning scheme.
  13. Not if its non-reentrant, that will throw an error, but you can I believe specifically allocate 1 instance, inside a loop, by opening a new reference with 0x80. I believe I've seen other people do that (and I've always been surprised it works). Personally, I always do exactly 1 pattern: -Use static strict VI ref to get VI Name -Wire static strict VI ref + VI name to Open VI ref once and only once ever during the execution of the entire program. Use 0xC0 (40,80). -Call start async call as needed. I've never had an issue with this pattern. The reason I suggested the 1-button dialog as an additional check is because I've seen situations where the debug information doesn't get set in a reentrant function -- for example I definitely wouldn't look at the front panel, as I believe I've seen the front panel show the default value in some situations. I've also mis-placed breakpoints to properly debug, so putting in the dialog is a quick way to be doubly sure. Long story short, its definitely not a limitation of the feature, but I have no clue why you're seeing the behavior you're seeing.
  14. I've never seen this issue myself. To simplify debugging, maybe just put in a 1-button dialog tied to the value (30) and make sure it pops up as you expect. Also, the open vi reference inside the loop concerns me. The point of x40-80 is to create a clone pool which can be executed in parallel. You're executing N clone pools but only using 1 clone from each pool.
  15. I did some testing with postgres and holy crap is it fast*. I'd be interested in trying out your library if the offer still stands. *for my use case, vs mysql, without any optimization
  16. Wow I had no idea fractional times were so recent in mysql. Thats crazy. As for timestamps I had no issues using the standard db connectivity toolkit with timestamps at up to 1 ms of precision (using var to data) but if you get it as a string the format is different from that returned by the database itself, if I remember correctly. We were using both the odbc driver and the tcp driver and the string formats were different in some way. Or maybe it was something with time zones? I could see that being an issue if you let labview handle the conversion for you.
  17. If you went for the DVR-inside-class route, you'd need a 'create' and 'destroy' function. Create would initialize the references (and potentially call the parent 'create' method) while destroy would release the DVRs. You would pass the 'create'-d object into the task loop, where it could dispatch normally. Enum or string + variant is a reasonable way to implement a message system, but it doesn't work as well for just transferring data around. A better fit is using user-defined events, which let you subscribe to several different data types in a single structure. As a nice bonus, the producer of the data doesn't have to be aware of the existence of any consumer -- there could be 10, there could be 0, it doesn't know or care. As a downside, if you're doing difficult processing and nobody is listening to the result, thats kind of a waste of processing power.
  18. Hrm... http://digital.ni.com/public.nsf/allkb/70C818F9E5CFCFDB8625703E007EEAAD This explains how I've seen it sometimes update. Maybe the older code I was looking at had "snapshot" turned off.
  19. Images are bizarre. My experience using them lately is that I basically don't trust anything that they do without testing. In the first case, I don't think you need to copy the image into a local reference unless you want to do several processing steps on it (where the source reference could be updated between steps). If you just want to display the image I think the image display makes a copy of the image data pointed to by the refnum. Usually. I also could swear I've seen it automatically update sometimes, but in the code I was working on last week, it didn't. See below. That having been said, unless there is a performance concern I'd just make the copy. I've personally found that its easier and less painful to re-create dataflow manually through several data copies than it is to deal with the references -- copying a sequential chunk of memory is not expensive. Its worth mentioning my biggest problem with images -- the references are named in a global namespace. I had several months of issues related to flattening and unflattening images at runtime and causing camera A to be mixed up with camera B, or the display suddenly getting populated with week-old data because I read from a file where the flattened images had the same name. Anyway this experience taught me to be super aggressive with copies, and all my images use GUIDs for names. Copy, copy, copy. Where performance is less of a concern I started using ImageToArray and passing around the 2D pixel array dataflow-style. Because seriously screw globally named references. This seems more straightforward. If you dispose of all images, acquisition should fail since there is no memory space to put new images into. Am I misunderstanding?
  20. From your description, it doesn't sound like you want to do what you are doing. None of your code in the image has any labels so I can't really understand what bits are important. However, you said you have code chunks which talk to their device and generate data. I would argue that you just want to create N parallel loops and use a more dataflow-y communication mechanism to move data around (queues, events, notifiers, or channel wires). Any process which needs to get the measurement data can subscribe to the queue/notifier/event and receive its own copy of the data. If you decide you do want to use DVRs, I'd suggest one workaround being that you introduce DVRs within the class rather than around the class. That is, the private data you're setting could be a big DVR cluster and then your task can have finer grain control over when it accesses that data. If it turns out that task just needs to read a copy of the information, it needs to obtain the lock just long enough to copy the data out, and then it can release the lock. Finally, the property node syntax for data accessors does allow you to operate directly on a DVR, but the access is only as fine-grained as the data access method.
  21. Well I was going to say that labview rendering relies on the CPU so I'd think memory and cpu would be a bigger deal (cache on the atoms is tiny, so for even moderately sized images I'd bet processing would be expensive even with prefetching) but it turns out that on one benchmark, my old atom tablet (about 2x the power of the one listed) gets a rating of "1%" where 100% is a mid-priced current gen discrete card (~$250)....so yeah, maybe the GPU :/ Also doing a simple network benchmark uses 33% of the cpu (about 10% for "system" which I believe on windows 10 is the process actually reading from the network before it gets forwarded to the metro app). This was for a whopping 3 MB/s (although, over wifi)
  22. I thought build array did some preallocation but not very much. I rearranged the code with manual preallocation and you had to allocate thousands and thousands of elements for it to be faster. I had a stray thought about allocating a percent of the existing array with a baseline and tried it out. For elements 0-10000 it allocates 1000 when it runs out, then for elements 10000+ its 10% of current array size. That made it about 30% faster than the indexing terminals. But, on the other hand, who cares in this situation? This use case doesn't seem like a big performance bottleneck.
  23. Evergreen is a term I've heard applied to browsers since they auto-update. I went ahead and applied it to web page applications.
  24. Those are all good features but theres other aspects you're missing just with regards to the general design: Encrypted/authenticated communication via https (since apparently NI can't provide a TLS api on the diagram) Pretty much every popular language can talk https with json or xml payloads Browser is sandboxed Evergreen UI code Accessibility everywhere, even across corporate networks through reverse proxies I definitely agree that theres a pretty big hump to get over. I'm building something that uses a lot of the above features but still provides a labview UI simply because I don't need it to be fancy and I know labview. That having been said, I don't see a ton of risk to the concept. People have been building client server labview code for a long time, javascript is a proven technology (even if []+{} != {}+[]), and far far far far far far far far more people are successful with javascript than have ever even heard of labview (sadly for some reason success eludes newspaper website developers, who love javascript but are horribly bad at it, but I don't think their incompetence means the whole platform is bad).
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.