SQLite Time Series Downsampling

Peter Grant · June 6, 2024

Good morning,

Firstly, thanks James for creating the excellent SQLite and JSONText libraries.

I've been working on a proof of concept for storing time series data into an SQLite database. As it stands I've decided upon a simple table, with two columns, Unix time (as my PK) and Data as a JSON string with an unknown number of channels e.g;

The schema is;

CREATE TABLE [Data_1s](
  [Time] PRIMARY KEY, 
  [Data]);

CREATE INDEX [idx_time] ON [Data_1s]([Time]);

Currently my example dataset is 50 channels at 1Hz, for 1 day. I'm aiming to test this for a years worth of data.

Point Values

I'm getting very good performance when extracting an individual time row (e.g. via a slider between the start and end time of the data set);

The current query I'm using is based on an older form of storing the time (as a decimal) so I searched for a 1s period;

SELECT Time, Data 
FROM Data_1s 
WHERE Time
BETWEEN 1717606845 AND 1717606846

I then collect the results, extract the individual channels from the JSON data and pop it into a map;

This of course can be optimised, but given it is sub 5ms, it is plenty quick enough for interacting with the data via a HMI.

Graph

Anyway, when it comes to extracting XY data to display on a graph, I use the following to for example only extract Channel 0;

SELECT Time, json_extract(data,'$."Channel 0"') AS Channel_0
FROM Data_1s 
WHERE Time BETWEEN 1717656137 AND 1718860565

In the above example I read 1892 elements from the database, and it takes ~ 19ms. Fast enough for the user to drag a slider around and change the time window.

However, if I go for my full example data window, e.g;

SELECT Time, json_extract(data,'$."Channel 0"') AS Channel_0
FROM Data_1s 
WHERE Time BETWEEN 1717571628 AND 1718860565

It takes 852ms to read 86400 elements. If I go for all 50 channels it increases to 8473ms.

Now with a graph with of a 2000 odd pixels, there isn't much point in loading all that data into LabVIEW, so I implemented an average down sampling query, based on an interval size;

WITH TimeGroups AS (
SELECT
(Time/60) *60 AS TimeGroup,
json_extract(data,'$."Channel 0"') AS Channel_0
FROM Data_1s
WHERE Time BETWEEN 1717571628 AND 1718860565
)
SELECT 
TimeGroup,
AVG(Channel_0) AS Avg_Channel_0
FROM TimeGroups
GROUP BY TimeGroup;

This takes 1535ms to run, returns 1441 elements. This is worse than reading the 86400 elements and letting LabVIEW manage the down sampling.

#

The questions I broadly have are;

Am I totally off base with the schema and using JSON to store the (unknown) amount of channels?
Should I be looking at a different way of down sampling?
How are large datasets like stock prices stored and distributed to display on the internet? Some must have very long time series data spanning over decades!
How do you store and quickly extract datasets for display?

Although I feel my questions are pretty broad rather than code specific, I can package it up and share after I strip a bunch of bloat out as it still very much a POC if that would help.

Thanks

Peter

LogMAN · June 7, 2024

19 hours ago, Peter Grant said:

This takes 1535ms to run, returns 1441 elements. This is worse than reading the 86400 elements and letting LabVIEW manage the down sampling.

It probably selects all elements before it applies the filter. You can get more insights with the EXPLAIN query: EXPLAIN (sqlite.org)

Without the database its difficult to verify the behavior myself.

19 hours ago, Peter Grant said:

Am I totally off base with the schema and using JSON to store the (unknown) amount of channels?

It may be more efficient to query channels from a table than from JSON, especially when the channel names are indexed. That way, SQLite can optimize queries more efficiently.

Find attached an example for a database to store each data point individually.

Here is a query that will give you all data points for all time stamps:

SELECT TimeSeries.Time, Channel.Name, ChannelData.Value FROM TimeSeries
INNER JOIN TimeSeriesChannelData ON TimeSeries.Id == TimeSeriesChannelData.TimeSeriesId
INNER JOIN ChannelData ON TimeSeriesChannelData.ChannelDataId == ChannelData.Id
INNER JOIN Channel ON ChannelData.ChannelId == Channel.Id

You can also transpose the table to get channels as columns. Unfortunately, SQLite does not have a built-in function for this so the names are hard-coded (not viable if channel names are dynamic):

SELECT
	TimeSeries.Time,
	MAX(CASE WHEN Channel.Name = 'Channel 0' THEN ChannelData.Value END) AS 'Channel 0',
	MAX(CASE WHEN Channel.Name = 'Channel 1' THEN ChannelData.Value END) AS 'Channel 1',
	MAX(CASE WHEN Channel.Name = 'Channel 2' THEN ChannelData.Value END) AS 'Channel 2'
	FROM TimeSeries
INNER JOIN TimeSeriesChannelData ON TimeSeries.Id == TimeSeriesChannelData.TimeSeriesId
INNER JOIN ChannelData ON TimeSeriesChannelData.ChannelDataId == ChannelData.Id
INNER JOIN Channel ON ChannelData.ChannelId == Channel.Id
GROUP BY TimeSeries.Time

20 hours ago, Peter Grant said:

Should I be looking at a different way of down sampling?

If query performance is important, you could perform the down sampling in the producer instead of the consumer (down sample as new data arrives). In this case you trade storage size with query performance. Whichever is more important to you.

20 hours ago, Peter Grant said:

How are large datasets like stock prices stored and distributed to display on the internet? Some must have very long time series data spanning over decades!

Probably in a database 🤣

Seriously, though, these kinds of data are stored an processed in large computing facilities that have enough computing power to serve data in a fraction of what a normal computer can do. They probably also use different database systems than SQLite, some of which may be better suited to these kinds of queries. I have seen applications for large time series data on MongoDB, for example.

20 hours ago, Peter Grant said:

How do you store and quickly extract datasets for display?

As computing power is limited, it is all about "appearing as if it was very fast". As mentioned before, you can pre-process your data so that the data is readily available. This, of course, requires additional storage space and only works if you know how the data is used. In your case, you could pre-process the data to provide it in chunks of 2000 data points for display on the graph. Store it next to the raw data and have it readily available.

There may be ways to optimize your implementation but there is no magic bullet that will make your computer magically compute large datasets in split-seconds on-demand (unless you have the necessary computing power, in which case the magic bullet is called "money").

dbtest.db.sql

ShaunR · June 7, 2024

You want a table per channel.

If you want to decimate, then use something like (rowid %% %d == 0) where %d is the decimation number of points. The graph display will do bilinear averaging if it's more than the number of pixels it can show so don't bother with that unless you want a specific type of post analysis. Be aware of aliasing though.

The above is a section of code from the following example. You are basically doing a variation of it. It selects a range and displays Decimation number of points from that range but range selection is obtained by zooming on the graph rather than a slider. The query update rate is approximately 100ms and it doesn't change much for even a few million data points in the DB.

It was a few versions ago but I did do some benchmarking of SQLite. So to give you some idea of what effects performance:

Edited June 7, 2024 by ShaunR

drjdpowell · June 9, 2024

A non-JSON option you could try is:

CREATE TABLE TestData (
  Channel,
  Time, 
  Data,  -- individual reading at Time for Channel
  PRIMARY KEY (Channel,Time)
  ) WITHOUT ROWID

This is every reading sorted by a Primary Key that is Channel+Time. This makes looking up a specific channel in a specific Time Range fast.

BTW, you don't need to make an index on a Primary Key; there is already an implicit index .

You would select using something like:

SELECT (Time/60)*60, Avg(Data) FROM TestData
   WHERE Channel=? 
   AND TIME BETWEEN ? AND 1717606846
   GROUP BY Time/60

Peter Grant · June 10, 2024

On 6/7/2024 at 8:31 AM, LogMAN said:

If query performance is important, you could perform the down sampling in the producer instead of the consumer (down sample as new data arrives). In this case you trade storage size with query performance. Whichever is more important to you.

This was what I was originally planning, but when I found out about down sampling in SQLite I wanted to see if I could avoid the duplicate data. I think a combination of down sampling and multiple tables will result in the best overall performance.

On 6/7/2024 at 8:31 AM, LogMAN said:

There may be ways to optimize your implementation but there is no magic bullet that will make your computer magically compute large datasets in split-seconds on-demand (unless you have the necessary computing power, in which case the magic bullet is called "money").

Shame getting the money can't be provided with another magic bullet... where did I put my magic gun?

#

Thank you for the overall table schema, I was thinking about reducing its size by using a channel index/name table as well, so thanks for the heads up on that one.

Peter Grant · June 10, 2024

On 6/7/2024 at 9:26 AM, ShaunR said:

If you want to decimate, then use something like (rowid %% %d == 0) where %d is the decimation number of points. The graph display will do bilinear averaging if it's more than the number of pixels it can show so don't bother with that unless you want a specific type of post analysis. Be aware of aliasing though.

I had considered this, but was worried about losing any important points between those decimations. I should have included in the original post, but the plan is to give the user a choice between average and min/max so any outliers can be viewed.

Doing a table per channel is an interesting idea. Will keep that in mind.

Thank you for sharing the benchmarking.

Peter Grant · June 10, 2024

13 hours ago, drjdpowell said:

This is every reading sorted by a Primary Key that is Channel+Time. This makes looking up a specific channel in a specific Time Range fast.

Great, this works with my first method where I wasn't storing in JSON (similar to how LogMAN said, just in a single table) and makes a huge difference. Didn't know a primary key could be like that.

This really helped thanks!

drjdpowell · June 10, 2024

Note the WITHOUT ROWID keyword also, as that could make a significant performance improvement with this kind of table.

Sign In

SQLite Time Series Downsampling

Recommended Posts

Peter Grant

LogMAN

ShaunR

drjdpowell

Peter Grant

Peter Grant

Peter Grant

drjdpowell

Join the conversation

Browse

Activity

Important Information