File structures for large data sets

Gratch · May 16, 2007

I have an application that allows a user to perform a large number of repetative tests as part of an experiment. Typically 1,000 to 5,000 tests are performed at a time, though recently a colleague performed 40,000 tests in a single experiment. Multiple sets of tests may be performed on one specimen

Currently an ASCII summary data file is created for each experiment and the option exists for the user to save the data for each individual test, in ASCII again. A typical test may contain anywhere between 1,000 and 5,000+ data points for two channels, depending on test options selected.

The two options that are currently top of my list are to use databases (prob mysql) or HDF5.

HDF5 i have no experience of, other than the pdf file i read this afternoon

mysql i have some experience of, but have concerns over the design of the database structure.

A DB structure would probably have 1 table for calibration info, 1 table for test info and then I'm not decided on whether to have

a) one table for testdata, which potentially could end up with hundreds of thousands of data entries,

b) to create a new table for each experiment, or

c)to create a new DB for each specimen, with new test data tables for each experiment.

My initial guess for an HDF5 structure would be something akin to (b) but that is a guess, as at the moment I know very little about it.

Am I on the right track with any of these ?

Matt

Herbert · May 16, 2007

I think I answered this on Info-LabVIEW earlier today ... for the kind of dataset you describe, the TDMS file format and the TDM Streaming functions (subpalette on File I/O) would be a good solution. TDMS files are binary, so their disc footprint is going to be much smaller than ASCII. The file size is only limited by your hard disc size. Within the file, you can organize data in groups that you can assign names and properties to (so you can use a smaller number of files). LabVIEW comes with a viewer application for TDMS. TDMS is also supported in CVI, SignalExpress and DIAdem, plus we provide an Excel AddIn for TDMS as a free download. If you need more connectivity than that, there's also a C DLL and a documentation of the file format available on ni.com.

An SQL database might be a reasonable solution, too - if it is well designed. It'll certainly help you maintaining the large number of tests that you are storing. HDF5 is probably a bad idea. It is great for storing few signals at a high speed, but it has some really bad performance issues when it comes to storing large numbers of data sets.

Hope that helps,

Herbert

Tomi Maila · May 16, 2007

QUOTE(Herbert @ May 15 2007, 09:50 PM)

HDF5 is probably a bad idea. It is great for storing few signals at a high speed, but it has some really bad performance issues when it comes to storing large numbers of data sets.

Herbert, could you please specify the performance issues and if possible refer to the source.

Tomi

Herbert · May 16, 2007

QUOTE(Tomi Maila @ May 15 2007, 02:24 PM)

Herbert, could you please specify the performance issues and if possible refer to the source.
Tomi

Tomi,

prior to making TDMS, we ran a bunch of different benchmarks on a variety of file formats. Test cases included high-speed logging on 10 channels, single-value logging on 10000 channels, saving FFT results (the point being that you cannot append FFT channels) and more. HDF5 does great on small numbers of channels, but it started having issues when we had about 100 data channels, where a channel in HDF5 is a node with a bunch of properties and a 1D array. If you keep adding channels (as you have to in the FFT results use case), performance goes down exponentially (!) to the number of channels.

HDF5 furthermore tends to produce spikes in time consumption when writing. We contacted the HDF5 development team about that and they responded that it was a known issue they would be working on, but they couldn't give us a timeline for when it would be fixed.

Herbert

Tomi Maila · May 17, 2007

Herbert, this is valuable information to us. I need to test if this is still true for HDF5 1.8. We've been using HDF5 for small scale projects but were intending to use it to larger files as well. Perhaps I'll post performance comparison of HDF5 1.8, TDM and TDMS to my blog some day...

I'd love to use TDMS but it doesn't suit our needs as it is today with only two hierarchy levels and lacking support multidimensional arrays (3-15d) and scalars. Are you intending to extend tdms format to support these features?

Tomi

Herbert · May 17, 2007

Benchmarks

Every benchmark was run on a "clean" machine. The machine is what used to be a good office machine 2 years ago. It has software RAID, which is a minor influence on some of the benchmarks. Depending on what machine you use (e.g. what harddrive, singleproc vs. dualproc etc.) results may obviously vary. If you see spikes in time consumption where my benchmarks don't show any, you might need a better harddisc / controller. Harddisc on Windows needs to be defragmented and at least half empty in order to achieve reproducible results. No on-demand virus scanning. Better shut down any service that Windows can survive without. Load your benchmark VI, open the task manager, wait until processor performance stays at zero and hit run. Make sure you have plenty of memory, so your system never starts paging.

We did not care about the time it takes to write small amounts of data to disc. Windows will buffer that data and your application continues to run before the data is actually on disc. We only cared for sustained performance that you can hold up for an extended period of time. In order to achieve this "steady state", we stored at least 1000 scans in each of our benchmarks. The graphs in the attached PDF files show the number of scans stored on the x axis and the time it took for a single scan to be written on the y axis. The time consumed is only the time for the writing operation. Time consumed by acquisition and other parts of the application is not included.

There are several things we were looking for in a benchmark:

Overall time to completion (duh).
Number and duration of spikes in time consumption. Minor spikes are normal and will occur with any file format on pretty much any system. Larger spikes can be a killer for high-speed streaming.
Any dependency of performance on file size and/or file contents. This is where we eliminated most existing formats from our list. Performance often degrades linearly or even exponentially when meta data is added.

Source data always was a 1d array of analog waveforms with several waveform attributes set. Formats under test were:

TDMS
TDM
LabVIEW Bytestream
LabVIEW Datalog (datalog type 1d array of wfm)
NI HWS (NI format for Modular Instruments, reuses HDF5 codebase)
HDF5
LVM (ASCII based, Excel-friendly)

Some benchmarks only include a subset of these formats. The ones that are missing didn't perform well enough to fit in our graphs. HDF5 was tested only in the "Triggered Measurements" use case, because with the HDF5-based NI HWS format we already had a benchmark "on the safe side". The reason TDM goes down in flames in some benchmarks is that it stores channels as contiguous pieces of data.

Mainstream DAQ

First Benchmark is a mainstream DAQ use case. Acquire 100 channels with 1000 samples per scan and do that 1000 times in a row. Note the spikes when Datalog and HWS/HDF5 are updating their lookup trees. TDMS beats bytestream by a small margin because of a more efficient processing of waveform attributes.

http://forums.lavag.org/index.php?act=attach&type=post&id=5882

Modular Instruments

Acquire 10 channels with 100000 values per scan. Here's where HWS/HDF5 still has TDMS beat. They do that by using asynchronous, unbuffered Windows File I/O. According to MS, that's the fastest way of writing to disc on Windows. We're working on that for TDMS. An interesting detail is the first value in the upper diagram. Note that HWS/HDF5 takes almost a second to initially create the file.

http://forums.lavag.org/index.php?act=attach&type=post&id=5883

Industrial Automation

Acquire single values from 1000 channels. These are LabVIEW 8.20 benchmarks. With the 8.2.1 NI_MinimumBufferSize feature TDMS should look better than that, but I haven't run this test yet. Note that HWS/HDF5 takes about 3 seconds where all 3 native LabVIEW formats stay below 100ms.

http://forums.lavag.org/index.php?act=attach&type=post&id=5884

Triggered Measurements

In this use case, every scan creates a new group with a new set of channels. This typically occurs in triggered measurements, or when you're storing FFTs or other analysis results that you cannot just append. We acquire 1000 values from 16 channels per scan for this use case. From all things, I've lost the original data for the HDF5 test, so I need to attach 2 diagrams. The first one is the 8.20 benchmark without HDF5:

http://forums.lavag.org/index.php?act=attach&type=post&id=5885

The second one is an older benchmark that was done with a purely G-based prototype of TDMS (work title TDS). I attached it because it has the HDF5 data in it. The reason HWS is faster than the underlying HDF5 is that it stores only a limited set of properties.

http://forums.lavag.org/index.php?act=attach&type=post&id=5886

Reading

I also have a bunch of reading benchmarks, e.g. read all meta data from a file, read a whole channel from a file, read a whole scan from a file etc. These are less exciting to lock at though, because I only have aggregate numbers on that.

We also recently conducted a benchmark on how fast DIAdem can load and display data from multi-gigabyte files, where TDMS was the overall fastest reading format.

Hope that helps,

Herbert

QUOTE(Tomi Maila @ May 16 2007, 12:39 AM)

I'd love to use TDMS but it doesn't suit our needs as it is today with only two hierarchy levels and lacking support multidimensional arrays (3-15d) and scalars. Are you intending to extend tdms format to support these features?
Tomi

Yes, we are planning to add these features. The underlying infrastructure (TDMS.DLL) is already fully equipped to do that, the file format already has placeholders for all necessary information in it. The reason we don't have these things yet is that TDMS is used for data exchange with DIAdem, where deep hierarchies and multi-dimensional arrays are not supported. So everytime we add something like this, we need to coordinate with other groups that use TDMS (CVI, SignalExpress, DIAdem...) to make sure everybody has an acceptable way of handling whatever is in the file. We're working on that.

Herbert :headbang:

Gary Rubin · May 17, 2007

Does Labview Bytestream just refer to this? http://forums.lavag.org/index.php?act=attach&type=post&id=5887

Thanks,

Gary

Herbert · May 17, 2007

QUOTE(Gary Rubin @ May 16 2007, 10:53 AM)

Does Labview Bytestream just refer to this? http://forums.lavag.org/index.php?act=attach&type=post&id=5887 ''>http://forums.lavag.org/index.php?act=attach&type=post&id=5887 '>http://forums.lavag.org/index.php?act=attach&type=post&id=5887

Yes.

Herbert

Tomi Maila · May 17, 2007

Herbert, I've a LabVIEW interface for HDF5 1.8 alpha. Would you like to share the benchmark code so I could run the benchmarks with the new version of HDF5? I think you must have used HDF5 1.4 or earlier.

Tomi

Herbert · May 17, 2007

QUOTE(Tomi Maila @ May 16 2007, 11:28 AM)

Herbert, I've a LabVIEW interface for HDF5 1.8 alpha. Would you like to share the benchmark code so I could run the benchmarks with the new version of HDF5? I think you must have used HDF5 1.4 or earlier.
Tomi

Tomi,

I used HDF5 version 1.6.4. The LabVIEW API for that was never released to the public. I also don't have that code in my benchmark tool any more.

You might need to rip some stuff out of the code, e.g. DAQmx or HWS, depending on what you have on your machine. Adding a format is rather simple. Just add it to the typedef for the pulldown list and add new cases to the open, write and close case structs.

Some remarks:

Hope that helps,

Herbert

http://forums.lavag.org/index.php?act=attach&type=post&id=5888 ''>http://forums.lavag.org/index.php?act=attach&type=post&id=5888 '>http://forums.lavag.org/index.php?act=attach&type=post&id=5888

Tomi Maila · May 17, 2007

Thanks Herbert!

Tomi

Herbert · May 17, 2007

QUOTE(Tomi Maila @ May 16 2007, 12:39 PM)

Thanks Herbert!
Tomi

No problem. Let me know how version 1.8 holds up...

Herbert

CommonSense · June 23, 2008

QUOTE (Tomi Maila @ May 16 2007, 10:28 AM)

Herbert, I've a LabVIEW interface for HDF5 1.8 alpha. Would you like to share the benchmark code so I could run the benchmarks with the new version of HDF5? I think you must have used HDF5 1.4 or earlier.
Tomi

Tomi,

I am starting to build a LabVIEW v8.5.1 interface for HDF5 1.8 and I was wondering if it is possible (with the LabVIEW Vi's available from http://www.me.mtu.edu/researchAreas/isp/lvhdf5/ ) to have 1 writer and several (5 max) readers concerning the same hdf5 file ?

From reading the HDF5 literature it seems to be able to support this scenerio but I have noticed some caveats that the LabVIEW Vi's may not support 100% of the functionality.

I would greatly appreciate any pointers/lesson's learned when you were building your interface.

Thank you,

Karl

Sign In

File structures for large data sets

Recommended Posts

Gratch

Herbert

Tomi Maila

Herbert

Tomi Maila

Herbert

Gary Rubin

Herbert

Tomi Maila

Herbert

Tomi Maila

Herbert

CommonSense

Join the conversation

Browse

Activity

Important Information