shoneill Posted September 18, 2015 Report Posted September 18, 2015 I'm currently investigating using TDMS as a data storage for a new measurement method. In our routine, we sweep up to 3 outputs (with X, Y and Z points each) and record up to 24 channels so we have XxYxZx24 datapoints. We create the following: Up to X data points for 24 channels of data interleaved in the first dimension (multichannel 1D) Up to Y times this first dimension (making the data multichannel 2D) Up to Z times this second dimension (making the data multichannel 3D) So in a sense, we create 4D data. Trying to use our old method of storing the data in memory fails quickly when the number of steps in each dimension increases. So we want to store them in TDMS files. But looking at the files and trying to imagine what read speed will be like, I'm unsure how to best store this data. DO I need multiple TDMS files? A single file? How to map the channels / dimensions to theinternal TDMS structure? In a further step to my efforts, I would be investigating having the routine for retrieving any sub-set of this data (1D or 2D slices from any combination of dimensions but almost always one channel at a time. Can anyone with more experience with TDMS files give some input and help a TDMS noob out? Quote
shoneill Posted September 18, 2015 Author Report Posted September 18, 2015 One other point to consider is if SQLite wouldn't be a better idea taking the high level of flexibility and efficiency we would be trying to achieve when visualising the data. Quote
Mads Posted September 18, 2015 Report Posted September 18, 2015 Unless you need to run this on a target that does not have SQLite support I would use that (the performance of Shaun's API in particular is impressive).TDMS is fine if you can write the data as large continuous blocks. If you need to save different groups of data frequently, in smaller write operations, I would use separate files for each group if using TDMS, otherwise the file will be too fragmented, and the read performance gets really really bad. - We use proprietary binary formats ourselves due to this, as we a) need to support different RT targets, b) frequently write small fragments of group data into one and the same file, and c) need to search for and extract time periods fast...(It is 1500x (!) times faster than the equivalent TDMS-solution). Quote
shoneill Posted September 18, 2015 Author Report Posted September 18, 2015 I'm less worried about file fragmentation, I should be able to write the data in more or less sensible chunks. I'm more worried about how to get the data back I want. I want to be able to request data for display by specifying which channel(s) and whether I want X vs Y or Y vs Z or Z vs X and so on. Coupled with the display scale (max-min X) I want to be able to do a memory-efficient processing of the raw data before passing it back to be displayed. This should help significantly reduce the memory footprint when dealing with large datasets (and large means up to 1GB). We never need to display so much data so the actual decimation in this approach will be significant (although I'd prefer a max-min decimation). My worry is how to manage reading from file to get the data into my decimation algorithm as efficiently as possible (both speed-wise and memory footprint-wise). I'll have to benchmark them I suppose. I looked at SQLite before and because I have very limited SQL experience, it's the queries and proper data structure I'm unsure about there. Especially when dealing with custom data reading schemes, I have the feeling a SQL-like approach offers signifant benefits. Quote
hooovahh Posted September 18, 2015 Report Posted September 18, 2015 Huge fan of TDMS over here, so personally I'd probably go with that, but I've heard good things with SQLite so that is probably an option. With TDMS the write is very fast in just about all cases. The read is where it can be less efficient. As mentioned before file fragmentation is the biggest cause of long read and open times. In my logging routines I would have a dedicated actor to logging, and among doing other things, it would periodically close, defrag, and re-open the file to help with this issue. But if you write in decent sized chunks you might not have an issue. There are probably lots of ways to write a 4D array to a TDMS file. Obviously it is only supposed to be a 2D type of structure, where you have something like an Excel work sheet. But just like Excel you can have another layer which is groups. So here we have a way of logging a 3D array, where you have groups, channels, and samples. How you decide to implement that 4th dimension is up to you. You could have many groups, or many channels in a group. Then your read routine you'd want to encapsulate that so as you said you request X vs Y and it takes care of where in the file it needs to read. Another neat benefit of TDMS is the offset and length options on read. So you can read chunks of the file if it is too large, or just as a way to be efficient if the software can only show you part of it at a time anyway. Conceptualizing a 3D array of data can be difficult, let alone a 4D. Regardless of file type an method, you are going to probably have a hard time knowing if it even works right. I wanted to write a test but I can't tell if it works right because I'm using made up data, and am unsure if it even works. Quote
ShaunR Posted September 18, 2015 Report Posted September 18, 2015 (edited) I'll have to benchmark them I suppose. I looked at SQLite before and because I have very limited SQL experience, it's the queries and proper data structure I'm unsure about there. Especially when dealing with custom data reading schemes, I have the feeling a SQL-like approach offers signifant benefits. There is a benchmark in the SQLite API for LabVIEW with which you can simulate your specific row and column counts and an example of fast datalogging with on-screen display and decimation. The examples should give you a good feel whether SQLite is an appropriate choice. Generally. If it is high speed streaming to disk (like video) I would say TDMS. Nothing beats TDMS for raw speed. For anything else; SQLite* What is your expected throughput requirement? Edited September 18, 2015 by ShaunR Quote
eberaud Posted September 18, 2015 Report Posted September 18, 2015 My application writes thousands or samples for approximately 1000 channels in a single group. The Read/Write operations can be a bit slow on a regular hard-drive, but we use SSD drives or Ramdisk, and then it works perfectly and at very high speed. I'm a big fan of the TDMS now. Was tough to get around the Advanced API palette at the beginning, it took a bit of understanding... Quote
OlivierL Posted September 18, 2015 Report Posted September 18, 2015 We are quite fans of TDMS here as well. Read speeds can definitely be an issue but as pointed out by Manu, SSD helps a lot. Also,we have not tested it yet but the 2015 version but the API now includes a "TDMS In Memory" palette which should offer very fast access if you need it in your application without having to install external tools such as "RAM Disk". As an aside, another tool we really like for viewing TDMS files is Diadem. We use it mostly as an engineer tools as we've had issues with the reporting feature in the past. It is a LOT faster and easier to use than Excel when it comes time to crunching a lot of data and looking at many graphs quickly. Unfortunately, at the moment, it doesn't support display of 4D graphs but I posted a question on the NI Forum a question about a possible way to implement such a feature through scripts. We don't have the skills or time to do it internally at the moment but I would really like to know if anyone created such a function and wants to share it. There is also a KB that you can look at here but I do not think that it will meet your requirement for 4D display. Quote
ShaunR Posted September 18, 2015 Report Posted September 18, 2015 (edited) Just as an afterthought. SQLite supports RTree spatial access methods too Maybe relevant to your particular use case. Edited September 18, 2015 by ShaunR Quote
drjdpowell Posted September 18, 2015 Report Posted September 18, 2015 I note that the SQL code to do arbitrary planer cuts through a 3D cube seems relatively straightforward, with a single large table and a simple WHERE statement ("SELECT … WHERE ABS(1.2*X+0.3*Y+.9Z) < tolerance", for example). So you should prototype that with a large dataset and see if the performance is sufficient. Also, don’t neglect the 3D picture control for visualization. Quote
shoneill Posted September 22, 2015 Author Report Posted September 22, 2015 Hmm, my initial testing seemed not to bode too well for TDMS. I was getting miserable write speeds..... I was iterating through the data I wanted to write and appending new channels as required, creating new groups as required and writing point for point. This yields terrible results. I have since found the all-important "TDMS Set Channel Information" function which allows me to tell the TDMS function what I'm going to be writing which actually allows it to write in the most efficient way. Seems to be the very important missing piece of my puzzle. It's a much more involved thing than I was expecting and I find resources for really explaining how to get the best out of any given situation (how your data is received versus how you want it saved) rather lacking on the internet. I suppose I'll have to just get my hands dirty and experiment. I think I have a much better grasp of how to optimise things now. Shane Quote
hooovahh Posted September 22, 2015 Report Posted September 22, 2015 I suppose I'll have to just get my hands dirty and experiment. I think I have a much better grasp of how to optimise things now. Yeah I'm a big fan of TDMS and I still learn things every once in a while by experimenting. One thing that helps is as you already noticed, writing chunks of data. Basically calling the Write function as few times as possible. If you are getting samples one at a time, put it into a buffer, then write when you have X samples. Got Y channels of the same data type which get new data at the same rate? Try writing X samples for Y channels in a 2D array. I think writing all the data in one group at a time helps too but again that might have been a flawed test of mine. I think it made for fragmented data, alternating between writing in multiple groups. Because of all of these best practices I usually end up writing an actor that takes care of the TDMS calls which can do things like buffer, periodic defrag, and optimized writing techniques. A bit of a pain for sure when you are used to just write to text file appending data, but the benefits are worth it in my situations. Quote
JKSH Posted September 23, 2015 Report Posted September 23, 2015 One thing that helps is as you already noticed, writing chunks of data. Basically calling the Write function as few times as possible. If you are getting samples one at a time, put it into a buffer, then write when you have X samples. Got Y channels of the same data type which get new data at the same rate? Try writing X samples for Y channels in a 2D array. I think writing all the data in one group at a time helps too but again that might have been a flawed test of mine. I think it made for fragmented data, alternating between writing in multiple groups. You're right, writing in chunks reduces fragmentation and improves read/write performance. However, you can let TDMS driver handle this for you instead of writing your own buffer code: http://zone.ni.com/reference/en-XX/help/371361M-01/lvconcepts/fileio_tdms_file_buffering/ http://zone.ni.com/reference/en-XX/help/371361J-01/lvhowto/setting_tdms_buffersize/ Quote
hooovahh Posted September 23, 2015 Report Posted September 23, 2015 However, you can let TDMS driver handle this for you instead of writing your own buffer code: I never had good luck with this. Maybe it was older versions of TDMS but it never seemed to work right. I can try it again and see if it works right. Quote
eberaud Posted September 23, 2015 Report Posted September 23, 2015 I suppose I'll have to just get my hands dirty and experiment. I think that's the key. I spent a lot of time tinkering around but thanks to that I now have a good understanding of the TDMS API and how to optimize the R/W operations. If, like me, you do your own decimation after the read operation, there is a sweet spot where it starts being more efficient to read each sample you're interested in one by one instead of reading a big block and decimating it. Quote
shoneill Posted September 25, 2015 Author Report Posted September 25, 2015 Gah, problem time. Our data requires the ability to pass back a running average at any time. This is proving to be a bit difficult. I'm able to save all of our "static" data into a TDMS at really good speeds with no fragmentation, so far so good. I want to maintain a runninng average somewhere in the file and I thought I could pre-allocate a group for this and fill it with dummy data and then update (overwrite) by setting the file write pointer as required and overwriting the already written data with newly calculated data (Read, modify, write). Problem is, setting the file pointer requires the file to have been opened via the advanced Open primitive. If I do this, the "normal" functions don't seem to work. We need a running average because some of our measurements last several hours and giving no feedback during this time is not cool. As it is, generating the average when the full dataset is present is no problem, it's the running average I have trouble with. The data required for this running average could run into several hundred megabytes, we're dealing with potentially very large datasets here. I know this mixed-mode behaviour isn't what TDMS is supposed to do but does anyone have any smart ideas how to do this without having to utilise a temporary external file (and copy the results over when finished. This requires my routine to get the data to be aware of this extra file and pull in the current averaged data when required. More work that I was hoping for..... Quote
hooovahh Posted September 25, 2015 Report Posted September 25, 2015 Is there a reason the running average needs to be in the file? Just curious why you don't have a circular buffer in your program and calculate the average with that. Even if you do really want it to be in a file for some reason, is there a reason you can't have two files? It sounds like the running average is filled with dummy data anyway and could be saved in a temp location. Quote
eberaud Posted September 28, 2015 Report Posted September 28, 2015 Also, I would abandon the regular API and rely only on the Advanced API. The regular API lacks a lot of flexibility, it's only intended for very basic usage. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.