So many binary file options!

torekp · December 18, 2006

I've got a project going in (now) Labview 8.20 that uses Datalogs to store and review several MB of data. It works, it conserves disk space, and it's fast, but it's not very flexible. If I change the cluster typedef that defines a record, I lose the ability to read older data files, unless I save an outmoded version of the file reader using the old typedef.

I'm thinking of changing to TDMS files for more flexibility. I don't mind paying significantly more disk space, if that's an issue, but I would mind if the speed dropped more than, say, 10-20%. Does anybody know what the tradeoffs are -- or would you recommend an entirely different alternative?

I would just go ahead and do it, and report the results, but it will be a lot of work, so I thought I'd ask first. In the larger one of my datalogs, each record consists of one 2-D and nine 1-D arrays, some i32 and some DBL. Each record uses about 500kB of disk space. The data is decimated before it ever gets to a file, and I alternate between two files, overwriting as needed, to keep the total size reasonable, while having a reasonably representative sample of all data.

This thread could be a good place to expound your general approach to filing data. Try not to be shocked by my ignorance ... :unsure:

LAVA 1.0 Content · December 18, 2006

would you recommend an entirely different alternative?

I wrote a well performing HDF5 1.8 API for LabVIEW (Windows). It is flexible, well performing and there is an API for C,C++,Java, Fortran, Matlab etc. The API is for LV 8.0 or newer and it's currently at pre-release stage. We're planning to release it when HDF5 1.8 is released. HDF5 1.8 is currently a prerelease version, although it's very stable. If you want to beta test the package, please let me know.

Gary Rubin · December 18, 2006

Try not to be shocked by my ignorance ...

Same goes for me.

Am being naive in thinking that the fastest you can achieve is just writing raw binary to disk using the OS's file write functions? Of course, you would have to optimize this so you're writing optimally-sized chunks of data at a time. My thinking has always been that this involves the least amount of code/manipulation between the data source and the disk, and would therefore be the most efficient. Is this not true?

Gary

Grampa_of_Oliva_n_Eden · December 18, 2006

Advice on speed.

Pre-write your files. When ever you are going to spewing to disk at a high rate, you get some better performance if you wite the entire file with dummy data before the the real data writting starts. Sure you have to know how big it will be.

This pre-writting gets the OS to allocate all of the disk space ahead of time and reduces the I/O to just the data going to disk.

Ben

Gary Rubin · December 18, 2006

Advice on speed.
...

This pre-writting gets the OS to allocate all of the disk space ahead of time and reduces the I/O to just the data going to disk.

Interesting! It makes sense, but it never occured to me that you should preallocate disk space for the same reason as you would preallocate memory. :worship:

Michael Aivaliotis · December 18, 2006

I'm thinking of changing to TDMS files for more flexibility. I don't mind paying significantly more disk space, if that's an issue, but I would mind if the speed dropped more than, say, 10-20%. Does anybody know what the tradeoffs are -- or would you recommend an entirely different alternative?

Actually, the S in TDMS stands for "Streaming". So I assume it should be optimized for speed, however we shouldn't assume anything. With version 8.0 and previous, the format was called TDM. It was changed and enhanced in 8.2 and is now called TDMS. I would be curious to see some benchmarks.

Aristos Queue · December 18, 2006

If you're logging/reading stable data, I'm going to recommend you use TDMS. That's what it was designed for and it is very very good at it. However, your comment was on data that changes definitions -- you change your cluster and suddenly you can't read old data. I'm not sure if TDMS has any functions for automatically handling such changes to the data stored. But there is one feature in LabVIEW 8.2 that does recognize versions of data and automatically provide the ability to unflatten and mutate old data into the current format -- LabVIEW classes. When you edit a LabVIEW class' cluster, the .lvclass file creates a recipe of how to get from the old cluster to the new cluster based on the edits you make in the control editor. When LV class data is flattened, it writes down the version number of the class as part of the flattened string. So when you unflatten, LV knows which recipe to apply. Unflattening LVClass data is pretty fast if it is already in the current version. I can't really give benchmarks for unflattening old version data since it depends heavily upon what edits were made, but it tries to be as efficient as possible.

torekp · December 18, 2006

I wrote a well performing HDF5 1.8 API for LabVIEW (Windows). [...] If you want to beta test the package, please let me know.

Whoosh! That's the sound of your website going over my head. Well, mostly. Still, I might come back to this, depending on how I fare with TDMS.

If you're logging/reading stable data, I'm going to recommend you use TDMS. That's what it was designed for and it is very very good at it. However, your comment was on data that changes definitions -- you change your cluster and suddenly you can't read old data. I'm not sure if TDMS has any functions for automatically handling such changes to the data stored.

Actually, TDMS file write primitive can't handle my cluster as-is. I'd have to separate it back into its component arrays, though I could group a few sets of arrays that are the same length. I'd also have to add a numeric to keep track of the variable length of certain arrays. For some of my arrays, it's always a certain number of points per write, for others, not. Datalog was very forgiving and user-friendly about that. The good side of TDMS is that every channel has a name. When I modify my cluster, it tends to be by adding more arrays (more "channels"). So I guess I could just add a little error-handling routine for channel name not found, whenever I modify/upgrade my file reader.

Alternatively, if I stick with datalogs, I can first examine the file date stamp and then decide which cluster typedef to use. Hmm. Sounds ugly, yet less work in the short and medium run ...

LAVA 1.0 Content · December 18, 2006

Whoosh! That's the sound of your website going over my head. Well, mostly. Still, I might come back to this, depending on how I fare with TDMS.

Well, it's not my web site but the web site of HDF5 group that manages HDF5 standard. I just wrote a LabVIEW wrapper for the library.

robijn · December 18, 2006

But there is one feature in LabVIEW 8.2 that does recognize versions of data and automatically provide the ability to unflatten and mutate old data into the current format -- LabVIEW classes. When you edit a LabVIEW class' cluster, the .lvclass file creates a recipe of how to get from the old cluster to the new cluster based on the edits you make in the control editor. When LV class data is flattened, it writes down the version number of the class as part of the flattened string. So when you unflatten, LV knows which recipe to apply. Unflattening LVClass data is pretty fast if it is already in the current version. I can't really give benchmarks for unflattening old version data since it depends heavily upon what edits were made, but it tries to be as efficient as possible.

Well Aristos, that's a very nifty feature !

All other methods have the problem of "what do you rename to what" and like this you have a perfect solution for that. My compliments for the excelent idea (to whoever at NI that came up with it).

Joris

Jim Kring · December 19, 2006

When you edit a LabVIEW class' cluster, the .lvclass file creates a recipe of how to get from the old cluster to the new cluster based on the edits you make in the control editor. When LV class data is flattened, it writes down the version number of the class as part of the flattened string. So when you unflatten, LV knows which recipe to apply. Unflattening LVClass data is pretty fast if it is already in the current version. I can't really give benchmarks for unflattening old version data since it depends heavily upon what edits were made, but it tries to be as efficient as possible.

Stephen,

Is there any way to inspect the object version the flattened data (either on disk or after reading it into memory)? This would be nice, since it would allow us to implement our own upgrade logic, on top of what LabVIEW gives us for free.

Also, how does this feature handle modifications to type definition elements of the object data? For example, if my object data contains a sub-cluster which is also a type definition. Editing the cluster will effectively edit the object data.

This sounds like a very nice feature! I'm looking forward to giving this a try.

Thanks,

-Jim

Aristos Queue · December 19, 2006

Is there any way to inspect the object version the flattened data (either on disk or after reading it into memory)? This would be nice, since it would allow us to implement our own upgrade logic, on top of what LabVIEW gives us for free.

Didn't I put the flat string format into the online help for LV8.2? I'm not going to go check at the moment, but I think so. In the section of one of the manuals where we talk about the flat format of every other LV data type. If I didn't, remind me after New Years so I can post the format on DevZone.

Eventually I'd like to post a VI toolkit for examining the mutation records themselves and tweaking them, but that'll be a while.

Also, how does this feature handle modifications to type definition elements of the object data? For example, if my object data contains a sub-cluster which is also a type definition. Editing the cluster will effectively edit the object data.

If you have a subcluster of three elements, and you drag one of those elements out of the inner cluster and put it in the outer cluster, the data will move accordingly. If you popup and do a replace on a subelement, the data will be preserved if the types are compatible. Basically, the LVClass is fully cognizant of all the changes that you can do and creates a recipe based on the before-you-start-edit and when-you-Apply-Changes-or-save states of the control editor, noting where every control moved. Be aware that doing "popup and use Replace on the last element" is not the same as a "delete last element and then add a new element". The first preserves data from the original element to the next element (assuming type compatibility). The latter resets the last element to whatever default value you set in the added control. It's pretty intuitive, you just need to be aware of what you're doing. And it's another reason to eventually expose the VIs to get better feedback on the mutation records.

There are reported bugs in LV8.2 having to do with typedefs -- if you use a typedef inside the private data cluster, then edit the typedef, this constitutes an edit on the LVClass and bumps the version number. When you unflatten data of the old version, the mutation does not preserve as much data as they ought to be able to do, but I'm not sure yet if those are bugs or just the data being preserved to the same degree it gets preserved in any typedef. It's been CAR'd and I'm evaluating.

Michael Aivaliotis · December 19, 2006

I opened up the Read Write Class Data To File.lvproj Example and did some experimenting. It appears to work well. I assume the magic is in the datalog read function.

Aristos Queue · December 19, 2006

I opened up the Read Write Class Data To File.lvproj Example and did some experimenting. It appears to work well. I assume the magic is in the datalog read function.

Unflatten From String and Datalog Read use the same underlying funciton.

Donald · December 19, 2006

Actually, the S in TDMS stands for "Streaming". So I assume it should be optimized for speed, however we shouldn't assume anything. With version 8.0 and previous, the format was called TDM. It was changed and enhanced in 8.2 and is now called TDMS. I would be curious to see some benchmarks.

I use TDMS format for a SCADA like LV8.2 application. Each tag is stored at a configurable interval with its status and timestamp. Properties can be added to each group or channel.

I understood that we have now 2 file formats:

TDM = XML encoded header file holding channel-group config and all their properties + additional binary files holding the channel data (it is also the native DIADEM storage format, so I do not think it will disappear)

TDMS = binary header file + additional binary files holding channel data

TDMS has a very clean interface and is available on more LabVIEW platforms than TDM.

>>> DOES SOMEBODY KNOWS IF IT IS AVAILABLE UNDER LV8.2 for LINUX?

Their is a LabVIEW TDMS to TDM file converter available with LV8.2.

For TDMS there is an update for the 'viewer' VI on the NI website.

Their is also a TDM plugin available for Excel which works very well (at least with my test files that are not that complex of structure), TDMS Excel plugin is still under construction by NI

I use Diadem to visualise the TDMS file (import plugin needed for Diadem 10). They work very well together but of course Diadem is not so much used yet and not free.

Possible problem: Your runtime (in case you build an executable) will get (much?) bigger when you include USI library in the installer script (Universal Storage Interface I guess). My target is an embedded controller where each MB counts else it doesn't matter so much.

Conclusion: I'm very pleased with TDMS as it is very flexible and fast and the number of library VIs very limited. What I would like to test is performance for big chunks of data (+2Gb files), if these tests are OK I'm going to use TDMS for an EEG/EMG recording application that records about 32 x 2 GB files a day (EDF format), TDM was to slow for this application.

torekp · December 28, 2006

I decided to go with classes, partly because it was relatively easy, given that I was already using datalogs. Using TDMS would have required me to stop bundling disparate data types into clusters. And I just know I'm gonna love the automatic revision-tracking feature. Thanks Aristos!

Aristos Queue · December 28, 2006

I decided to go with classes, partly because it was relatively easy, given that I was already using datalogs. Using TDMS would have required me to stop bundling disparate data types into clusters. And I just know I'm gonna love the automatic revision-tracking feature. Thanks Aristos!

You're welcome. Let me know if you have any issues.

Herbert · January 19, 2007

Short version

TDMS was created so people wouldn't have to decide between so many different file formats any more. It's not all done yet, but as far as saving measured data (= anything that can be looked at as waveforms / 1D arrays) goes, TDMS beats all other formats in LabVIEW in writing and reading performance (see the fine print below ).

=> If it is reasonable for you to break down your data into properties and data types that TDMS can handle (1d/2d arrays of numerics, strings, timestamps + all kinds of waveforms), we clearly recommend TDMS. If your data types are too complex for that, Datalog / Bytestream is your best bet.

Long version

Prior to making TDMS, we put together a set of reference use cases (ranging from 1-channel-super-fast to 10000-channels-single-point) and ran benchmarks on these use cases with all the different file formats we had. The result was that most formats were good at something, but every format had significant disadvantages. Some examples:

HDF5 is great for saving few channels very fast. If you have 100 or more channels though, or if you keep adding objects to the file (e.g. when you're storing FFT results), performance decreases exponentially with the number of objects.
Both Datalog and HDF5 maintain trees for random access, which creates hick-ups in performance that usually exceed 0.5 seconds a piece. For streaming applications, 0.5 seconds is a very long time.
TDM was developed for DIAdem, where every file is loaded and saved in one piece. It stores every channel as a contiguous chunk of data. If you want to add a value to the end of a channel, you need to move all subsequent channels in order to make room for that value. We have done a few things to diminish this issue, but the bottom line is that TDM is not suitable for streaming at all.
TDM stores the descriptive data into an XML file. That creates the following issues:
- You always have 2 files you need to copy, delete, email or whatever.
- XML files are read, parsed and written in one piece. The performance of adding a new object decreases with the size of the XML file.
- XML is slow (think 10000 channels)
- The TDM component uses XPath to query for objects, which rules out using of pretty much any special character (including blanks)
[*][...]

TDMS was built to eliminate all the issues listed above. Even though the "S" stands for "Streaming", TDMS in LabVIEW 8.20 beats all other file formats we have in LabVIEW in writing and reading performance. There are some areas we're still working on though, as you can see in the following fine print.

Fine print

With very low channel numbers and high throughput, HDF5 is still writing about 10% faster than TDMS.
Unstructured LabVIEW binary files in some cases beat TDMS by some percent points (but try reading them...).
If you store single values, we recommend that you gather e.g. 1000 values in an array and store that array. Otherwise, reading performance will be very bad. Note that a LabVIEW version that is coming up really soon will be able to do that automatically.

If you figure out more of these, please don't hesitate to let me know.

Hope this helps,

Herbert

Sign In

So many binary file options!

Recommended Posts

torekp

LAVA 1.0 Content

Gary Rubin

Grampa_of_Oliva_n_Eden

Gary Rubin

Michael Aivaliotis

Aristos Queue

torekp

LAVA 1.0 Content

robijn

Jim Kring

Aristos Queue

Michael Aivaliotis

Aristos Queue

Donald

torekp

Aristos Queue

Herbert

Join the conversation

Browse

Activity

Important Information