Jump to content

Newbie question regarding SQL queries


Recommended Posts

Posted

My question relates to retrieving decimated data from the database.

 

Given the case where I have 1M X and 1M Y steps (a total of 1000000M data points per channel) how do I efficiently get a decimated overview of the data?

 

I can produce the correct output data be using

 

Select X,Y,Z,Float1 from P

WHERE X=0

GROUP BY Y/1000

 

This will output only 1x1000 data instead of 1x1000000 datapoints, one dimension quickly decimated  Problem is that is iterates over all data (and this takes quite some time).  If I do a different query to retrieve only 1000 normal points, it executes in under 100ms.

 

I would like to use a CTE to do this, thinking that I could address directly the 1000 elements I am looking for.

WITH RECURSIVE  cnt(x) AS (     SELECT 0     UNION ALL     SELECT x+1000 FROM cnt      LIMIT 1000  )WHAT GOES HERE?;

So if I can create a cte with column x containing my Y Indices then how do I get from this to directly accessing

Float1 FROM P WHERE X=0 AND Y=cte.x

 

SELECT Float1 from P WHERE X IN (SELECT x FROM cnt) AND Y=0 AND Z=0

Using the "IN" statement is apparently quite inefficient (and seems to return wrong values).  Any ideas?

 

In addition, accessing even a small number of data points from within an 8GB SQL File (indexed!) is taking upwards ot 30 seconds to execute at the moment.  With 1.5GB files it seemed to be in the region of a few milliseconds.  Does SQLite have a problem with files above a certain size?

Posted (edited)

My question relates to retrieving decimated data from the database.

 

Given the case where I have 1M X and 1M Y steps (a total of 1000000M data points per channel) how do I efficiently get a decimated overview of the data?

 

I can produce the correct output data be using

 

Select X,Y,Z,Float1 from P

WHERE X=0

GROUP BY Y/1000

 

This will output only 1x1000 data instead of 1x1000000 datapoints, one dimension quickly decimated  Problem is that is iterates over all data (and this takes quite some time).  If I do a different query to retrieve only 1000 normal points, it executes in under 100ms.

 

I would like to use a CTE to do this, thinking that I could address directly the 1000 elements I am looking for.

WITH RECURSIVE  cnt(x) AS (     SELECT 0     UNION ALL     SELECT x+1000 FROM cnt      LIMIT 1000  )WHAT GOES HERE?;

So if I can create a cte with column x containing my Y Indices then how do I get from this to directly accessing

Float1 FROM P WHERE X=0 AND Y=cte.x

 

SELECT Float1 from P WHERE X IN (SELECT x FROM cnt) AND Y=0 AND Z=0

Using the "IN" statement is apparently quite inefficient (and seems to return wrong values).  Any ideas?

 

In addition, accessing even a small number of data points from within an 8GB SQL File (indexed!) is taking upwards ot 30 seconds to execute at the moment.  With 1.5GB files it seemed to be in the region of a few milliseconds.  Does SQLite have a problem with files above a certain size?

 

Is this SQLite only? Or a more general SQL question since performance varies markedly from platform to platform and some responses may not be valid for MSSQL, for example.

 

Group by is probably the wrong approach for decimation. You'd have to execute the Explain query to find out what the internal lookups are. I expect you are forcing one or more full table scans.

 

In the Data Logging Example I use the following expression to return decimated data. Decimating 1M datapoints is a few hundred ms on an SSD but I don't know what that file size is, off-hand. The "rowid" decimation is quite obvious and means that SQLite can skip entire sections without querying and testing the data.

 

WHERE (x between ? AND ?) AND (rowid % 500 == 0)

 

What does this approach yield with your dataset?

 

If X,Y,Z are coordinates, I would highly recommend using the R-TRee features of SQLite if you haven't tried it already. If your SQL queries are to find intersections and/or contained/overlapped polygons then the R-Tree module is far more performant that straight SQL.

Edited by ShaunR
Posted (edited)

At the moment, it's a SQLite specific question.  I'm amazingly inexperienced (in practical terms) when it comes to Databases so even pointing out something which might be completely obvious to others could help me no end.

 

Doesn't rowid % X not force a MOD calculation on each and every ROW?

 

PS It seems that rowid % 1000 is significantly more efficient that X % 1000 (Defined as Integer).  This although both have indexes.  Is this because rowid is "UNIQUE"?

Also, I need to correct my previous statement.  With multiple WHERE conditions, of course the MOD will only be executed on those records which are returned by the first WHERE condition.... so not EVERY row will be calculated.

Edited by shoneill
Posted (edited)

 Is this because rowid is "UNIQUE"?

 

Run Explain. (SQLite_Explain Example.vi)  on both queries. It will "explain" how it processes the query and what optimisations are performed (like if an index is used or not) and you can see what the differences are.

Edited by ShaunR
Posted

I have been looking into that and have seen that it has been using a - to put it politely - eccentric choice of index for the query.

 

Of course it may be my mistake that I created an Index on X, an Index on Y an Index on Z AND and Index on X,Y,Z.

 

Whenever I make a query like

SELECT Float1 FROM P WHERE (X BETWEEN 0 AND 50) AND (Y BETWEEN 0 and 50) AND Z=0;

it always takes the last index (Z here).  Using the "+" operator to ignore a given index ( AND +Z=0 will tell SQLITE NOT to use the Z index) I canf orce the "earlier" indices.

Only by actually DROPPING the X, Y and Z Indexes does SQLITE perform the query utilising the X,Y,Z Index (and is then really quite fast).

 

Is it a mistake to have a single table column in more than one Index?

Posted

Of course it may be my mistake that I created an Index on X, an Index on Y an Index on Z AND and Index on X,Y,Z.

 

Remember that indexes take up memory, and memory/disk-access can be bottlenecks.  If your table rows are just is X,Y,Z,Float then you may be losing out by indexing everything.  How many points can you store per GB?

 

I actually am working on a similar SQLite application at the moment.   Here is my “big†table, which stores spectrometry spectral points:

 

CREATE TABLE IF NOT EXISTS SpecPoints (  -- Individual points in the spectra

    Time INT,  -- Time in ms, Relative to Time_Offset Attribute (allows 4-byte Integers to store +-24days)

    WN INT,   -- Wavenumber, in units of 0.005 cm^-1, offset 40,000 cm^-1  (this makes it three bytes up to 125nm)

    Y INT,   -- y= 0.5-Transmission, units of 0.0000001 (allows -0.34 to 1.34 Transmission range to fit in 3 bytes)

    CONSTRAINT pk PRIMARY KEY (Time,WN) ON CONFLICT REPLACE

     ) WITHOUT ROWID;

 

I’m using a “WITHOUT ROWID†table to physically store the rows in order of Time and WN and I’m not creating any extra indexes.  I’m also using scaled integers to pack things tighter than 8-byte Floats (SQLite store integers in the smallest size of I8, I16, I24, I32, I48, or I64).   I’m doing averaging over Time bins and Wavenumber regions-of-interest, rather than decimation, so this may not apply to you, but note that a smaller database file makes everything faster.  

Posted (edited)

I have been looking into that and have seen that it has been using a - to put it politely - eccentric choice of index for the query.

 

Of course it may be my mistake that I created an Index on X, an Index on Y an Index on Z AND and Index on X,Y,Z.

 

Whenever I make a query like

SELECT Float1 FROM P WHERE (X BETWEEN 0 AND 50) AND (Y BETWEEN 0 and 50) AND Z=0;

it always takes the last index (Z here).  Using the "+" operator to ignore a given index ( AND +Z=0 will tell SQLITE NOT to use the Z index) I canf orce the "earlier" indices.

Only by actually DROPPING the X, Y and Z Indexes does SQLITE perform the query utilising the X,Y,Z Index (and is then really quite fast).

 

Is it a mistake to have a single table column in more than one Index?

 

Standard Indexes on individual columns isn't very useful. In its simplest form you can consider an index to be an alias to a group so creating an index for each column is a little pointless. An index on XY and/or XZ makes more sense.

 

Creating a partial index on single columns is very useful, though, since you are pre-mapping the alias to a subset of the rows of the table. Making an index on Z=0 would probably be useful in your case.

I’m using a “WITHOUT ROWID†table to physically store the rows in order of Time and WN

 

Can you explain that? I thought "WITHOUT ROWID" was purely an optimisation.

Edited by ShaunR
Posted (edited)

This seems to be working moderatly successfully:

 

WITH RECURSIVE CnT(a) AS
  (
     SELECT 400000
     UNION ALL
     SELECT a+200 FROM CnT
     LIMIT 1000
  )
  SELECT X,Y,Z,Float1 FROM P JOIN CnT ON CnT.a=P.X WHERE Y=0;

 

It gives me back 1000 decimated values from a 12.5GB file in just under 400msec (Int his example from X Index 400k to 600k).  I might be able to work with that.

Edited by shoneill
Posted

Can you explain that? I thought "WITHOUT ROWID" was purely an optimisation.

It is, but it’s a big one in this case.   Compare it to a regular table which is:

rowID, Time, WN, Y

Along with an index on (Time,WN) which is a second table, equal number of rows, of

Time, WN, rowID.

 

The index table is a significant fraction of the primary table in size.  The Without RowID table eliminates the need for this index.

Posted (edited)

It is, but it’s a big one in this case.   Compare it to a regular table which is:

rowID, Time, WN, Y

Along with an index on (Time,WN) which is a second table, equal number of rows, of

Time, WN, rowID.

 

The index table is a significant fraction of the primary table in size.  The Without RowID table eliminates the need for this index.

 

Right. Files size, space saving, I get (Uin64 per row). But you don't have to add an index and the bit I'm really not getting is how WITHOUT ROWID enables you to store "in order of Time and WN". Insertion order is the same with or without a rowID, is it not?

Edited by ShaunR
Posted

But you don't have to add an index and the bit I'm really not getting is how WITHOUT ROWID enables you to store "in order of Time and WN". Insertion order is the same with or without a rowID, is it not?

The point of a WITHOUT ROWID table is when you want a primary key that is non-integer or composite, so the alternative is a regular rowID table with an index.  Re the “order†stuff I was being sloppy; rather, I mean the primary data B-tree uses Time,WN as it’s index (rather than rowID), so one can directly search on Time,WN without needing an separate index.  

  • Like 1
Posted (edited)

The point of a WITHOUT ROWID table is when you want a primary key that is non-integer or composite, so the alternative is a regular rowID table with an index.  Re the “order†stuff I was being sloppy; rather, I mean the primary data B-tree uses Time,WN as it’s index (rather than rowID), so one can directly search on Time,WN without needing an separate index.  

 

Ahh. I get it. Yes that would be a useful optimisation for this scenario. The Time,WN might not be unique but if that's not an issue I can see it simplifies things greatly. It's taking advantage of the hash table lookup under the hood.. I can think of a few more uses for that too. I wonder what the performance difference is between that and a value table lookup like, say, LevelDB.

Edited by ShaunR
Posted (edited)

This seems to be working moderatly successfully:

 

WITH RECURSIVE CnT(a) AS

  (

     SELECT 400000

     UNION ALL

     SELECT a+200 FROM CnT

     LIMIT 1000

  )

  SELECT X,Y,Z,Float1 FROM P JOIN CnT ON CnT.a=P.X WHERE Y=0;

 

It gives me back 1000 decimated values from a 12.5GB file in just under 400msec (Int his example from X Index 400k to 600k).  I might be able to work with that.

 

To continue on from this, this method of getting 1D decimated data seems to be relatively efficient.  I am unaware of the internal specifics of the database operations but wouldn't a JOIN be more expensice than actually being able to directly SELECT the items I require from within the CTE and just return them?  Problem there is that it would normally use a "UNION" but this doesn't work in a CTE.

 

Is there any way to access directly the P.Float1 values I want (Once I have the index, I can directly access the item I require) instead of going the "JOIN" route, or is my understanding of the cost of a JOIN completely incorrect?

Remember that indexes take up memory, and memory/disk-access can be bottlenecks.  If your table rows are just is X,Y,Z,Float then you may be losing out by indexing everything.  How many points can you store per GB?

 

I actually am working on a similar SQLite application at the moment.   Here is my “big†table, which stores spectrometry spectral points:

 

CREATE TABLE IF NOT EXISTS SpecPoints (  -- Individual points in the spectra

    Time INT,  -- Time in ms, Relative to Time_Offset Attribute (allows 4-byte Integers to store +-24days)

    WN INT,   -- Wavenumber, in units of 0.005 cm^-1, offset 40,000 cm^-1  (this makes it three bytes up to 125nm)

    Y INT,   -- y= 0.5-Transmission, units of 0.0000001 (allows -0.34 to 1.34 Transmission range to fit in 3 bytes)

    CONSTRAINT pk PRIMARY KEY (Time,WN) ON CONFLICT REPLACE

     ) WITHOUT ROWID;

 

I’m using a “WITHOUT ROWID†table to physically store the rows in order of Time and WN and I’m not creating any extra indexes.  I’m also using scaled integers to pack things tighter than 8-byte Floats (SQLite store integers in the smallest size of I8, I16, I24, I32, I48, or I64).   I’m doing averaging over Time bins and Wavenumber regions-of-interest, rather than decimation, so this may not apply to you, but note that a smaller database file makes everything faster.  

That's interesting.

 

Are there any INSERT performance consequences of this?  Is having a UNIQUE index more efficient for an INSERT than a non-unique one?

 

PS Yeah, having a compund PRIMARY KEY significantly slows down INSERTS by a factor of 7-8 for my testing.  This rules it out for me unfortunately, shame really because that would have been cool.  INSERTING identical data packets (100k at a time) which represent 5 seconds of data takes 1.2 seconds with default index and starts off OK but ends up at 8 seconds for a compound index of X,Y and Z.  For slower data rates it would have been cool.

Edited by shoneill
Posted (edited)

To continue on from this, this method of getting 1D decimated data seems to be relatively efficient.  I am unaware of the internal specifics of the database operations but wouldn't a JOIN be more expensice than actually being able to directly SELECT the items I require from within the CTE and just return them?  Problem there is that it would normally use a "UNION" but this doesn't work in a CTE.

 

Is there any way to access directly the P.Float1 values I want (Once I have the index, I can directly access the item I require) instead of going the "JOIN" route, or is my understanding of the cost of a JOIN completely incorrect?

UNION and JOIN are two different things (JOIN is an alias for "LEFT JOIN" - you can have other types). A JOIN maps columns from one table to another for indirection. A UNION just appends data. The union is used in the WITH RECURSIVE so as to create an ordered queue which bestows the tree walking behaviour- it's a fortuitous slight of hand.

 

That's interesting.

 

Are there any INSERT performance consequences of this?  Is having a UNIQUE index more efficient for an INSERT than a non-unique one?

 

PS Yeah, having a compund PRIMARY KEY significantly slows down INSERTS by a factor of 7-8 for my testing.  This rules it out for me unfortunately, shame really because that would have been cool.  INSERTING identical data packets (100k at a time) which represent 5 seconds of data takes 1.2 seconds with default index and starts off OK but ends up at 8 seconds for a compound index of X,Y and Z.  For slower data rates it would have been cool.

How many columns? Benchmarking 100K x 3 columns (xyz) runs at about 250ms using my benchmark. Are you saving to a huge single table as if it were a spreadsheet? I get that INSERT rate (1.2 secs) at about 25 columns.

Edited by ShaunR
Posted

My INSERTS are with 3x Index (X,Y,Z) and 24 REAL values.  Initial INSERTS take about 1s, but rises rapidly with more data to 7-8 seconds.

 

Regarding the JOIN vs UNION..... I was simply pondering if it wouldn't somehow be possible to do something like:

 

WITH RECURSIVE CnT(a,Float) AS
  (
     SELECT 400000,0
     UNION ALL
     SELECT a+200 FROM CnT UNION SELECT Float1 FROM P WHERE X=CnT.a AND Y=0;
     LIMIT 1000
  )
  SELECT Float1 FROM CnT;

 

Unfortunately, the second UNION is not allowed.....

Posted

My INSERTS are with 3x Index (X,Y,Z) and 24 REAL values.  Initial INSERTS take about 1s, but rises rapidly with more data to 7-8 seconds.

There is something not quite right there. The file size should make no difference to the INSERT performance.

 

This is inserting 100K records with 28 columns. Inserting 244 times increases the file size from 0 to 12GB.(I just disabled the drop, create and select in the Speed Example). There is jitter due to other things happening but it is not increasing as the file grows.

 

post-15232-0-11290000-1452556384.png

Posted

There is something not quite right there. The file size should make no difference to the INSERT performance.

 

This is inserting 100K records with 28 columns. Inserting 244 times increases the file size from 0 to 12GB.(I just disabled the drop, create and select in the Speed Example). There is jitter due to other things happening but it is not increasing as the file grows.

 

attachicon.gifUntitled.png

My remarks on degrading performance relate purely to the suggested replacement for the PRIMARY KEY by drjdpowell.  I see similar write performance (fast, constant) when using a normal "rowid" primary key.  As soon as I introduce my X,Y,Z Primary key as suggested ealier in the thread, I see significantly worse performance to the point that it rules out SQLite for the required function.  I really need 20k INSERTS per second.  At the moment I'm resigned to writing with standard rowid Index and then adding the X,Y,Z afterwards for future read access.

 

 

Standard "rowid" Primary KeyIndex

post-3076-0-38318900-1452587686.png

 

Compund (X,Y,Z) Primary Key Index

post-3076-0-46730200-1452588133.png

  • Like 1
Posted

My remarks on degrading performance relate purely to the suggested replacement for the PRIMARY KEY by drjdpowell.  I see similar write performance (fast, constant) when using a normal "rowid" primary key.  As soon as I introduce my X,Y,Z Primary key as suggested ealier in the thread, I see significantly worse performance to the point that it rules out SQLite for the required function.  I really need 20k INSERTS per second.  At the moment I'm resigned to writing with standard rowid Index and then adding the X,Y,Z afterwards for future read access.

 

 

Standard "rowid" Primary KeyIndex

attachicon.gifSQLite write speed (100k).png

 

Compund (X,Y,Z) Primary Key Index

attachicon.gifSQLite write speed (100k) Compound Primary.png

 

That's interesting but not surprising. I might add some more benchmarks around this to those for rows and bulk inserts. It would be a useful metric to see what the performance overhead is for varying Indexes.

 

20K/sec bulk INSERT is easily achievable. I'm not sure if you missed a zero off of that but 20K x 27 cols is about 100ms for me.

Posted (edited)

 

20K/sec bulk INSERT is easily achievable. I'm not sure if you missed a zero off of that but 20K x 27 cols is about 100ms for me.

No, no zeros missing, just very large transactions.  Each datapoint actually represents 100k data points x 27 cols in one big transaction.

 

Overall throughput must be 20k per second or more.  I should have been more specific about that, sorry.

 

I'm currently running into all kinds of weird benchmarking results due to OS buffering of files.  One one hand it's good that Win7 uses as much RAM as possible to buffer, on the other hand, it makes benchmarking really difficult...... 1000 datapoints in 100ms from a magnetic HDD is VARY fast

Edited by shoneill
Posted (edited)

No, no zeros missing, just very large transactions.  Each datapoint actually represents 100k data points x 27 cols in one big transaction.

 

Overall throughput must be 20k per second or more.  I should have been more specific about that, sorry.

 

I'm currently running into all kinds of weird benchmarking results due to OS buffering of files.  One one hand it's good that Win7 uses as much RAM as possible to buffer, on the other hand, it makes benchmarking really difficult...... 1000 datapoints in 100ms from a magnetic HDD is VARY fast

If you are running in SYNC=FULL (the default) then SQLite is using Write-through and Windows buffering is bypassed since it breaks ACID. This makes a big difference on mechanical drives-not so much on SSDs.You can tweak more performance by not writing a journal (JOURNAL=OFF) and setting SYNC=OFF at the expense of catastrophic failure integrity.

Edited by ShaunR
Posted

PRAGMA synchronous=OFF;
PRAGMA count_changes=OFF;
PRAGMA journal_mode=WAL;
PRAGMA temp_store=MEMORY;
PRAGMA cache_size=100000;

 

That's my initialise step before creating tables.

Posted (edited)

You have all the easy bases covered (you are aware that not all of those are sticky and have to be used when opening, not just when tables are created?).

 

At this point I usually look for a more suitable alternative for the use case. TDMS would be far superior for streaming the data but I expect you have hit a brick wall there too with the decimation, making the idea of doing it in a query attractive by its sheer simplicity.

 

If SQLite is the "closest" of all the considered options then you would have to really dive in and get your hands dirty. I'm pretty sure you are already close enough that you could probably get there eventually but it's a whole domains worth of knowledge in and of itself. If we ignore physical constraints like disks, then  there is a whole raft of low level configurations of how SQLite operates so it would be a case of investigating forcing manual optimisation of query plans memory-mapped IO or even writing your own "decimation" function or extension to name just a couple.

 

Have you tried the SQLite mailing lists?

Edited by ShaunR
Posted

No not yet.  I wanted to get some idea of the workings of SQLite before going that route.  As it turns out, we may not be going SQLite after all.  Higher powers and all.  :throwpc:

 

Regarding the memory mapped I/O.... We will need concurrent access to the file (Reading while writing) and if I have understood your link properly, this may not work with mmap.  Additionally, I'm currently investigating how the Windows 7 own "standby" memory is massively influencing my results.  Caching of files accessed (currently 12GB on my machine) is making all of my disk reads WAY faster than they should be.  I mean, it's a cool feature, but I'm having trouble creating a "worst case" scenario where the data is not in memory.....  Well, I can benchmark ONCE but after that, bam Cached.

Posted

No not yet.  I wanted to get some idea of the workings of SQLite before going that route.  As it turns out, we may not be going SQLite after all.  Higher powers and all.  :throwpc:

 

Regarding the memory mapped I/O.... We will need concurrent access to the file (Reading while writing) and if I have understood your link properly, this may not work with mmap.  Additionally, I'm currently investigating how the Windows 7 own "standby" memory is massively influencing my results.  Caching of files accessed (currently 12GB on my machine) is making all of my disk reads WAY faster than they should be.  I mean, it's a cool feature, but I'm having trouble creating a "worst case" scenario where the data is not in memory.....  Well, I can benchmark ONCE but after that, bam Cached.

 

I'm not sure what bit you read that said memory mapped IO removes concurrency. That doesn't seem right.

 

Turn PRAGMA synchronous=FULL and turn off "Write Cache" in device manager for the worst case but the behaviour you describe sounds more like LabVIEW memory allocation than OS disk caching. I haven't seen this specific behaviour but it was a few years ago that I used SQLite with anything other than a SSD or USB memory stick.

 

Anyway. It's all academic if the meeting-monkeys deign another course.

Posted

I'm not sure what bit you read that said memory mapped IO removes concurrency. That doesn't seem right.

 

Turn PRAGMA synchronous=FULL and turn off "Write Cache" in device manager for the worst case but the behaviour you describe sounds more like LabVIEW memory allocation than OS disk caching. I haven't seen this specific behaviour but it was a few years ago that I used SQLite with anything other than a SSD or USB memory stick.

 

Anyway. It's all academic if the meeting-monkeys deign another course.

 

It was this part, although I'm learning a lot about the internals of Win7 the last few days.

The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.

 

 I have since also found the tool RamMap from Sysinternals which allows a user to delete file caches from Windows so that we can effectively do "worst-case" file access tests without having to restart the PC.  This has given me some very useful data )no more influence from file buffering).  If you open up the "performance monitor" in Windows it shows "Standby" memory.  This is memory which is technically NOT used but in which Win7 caches file reads and writes so that if the act is repeated, it can do so from RAM instead of having to go to the disk.  IT's basically a huge write and read cache for the HDD.  Mine was 12GB last time I looked!

 

In one sense, this allows 32-bit LV to "access" more than 4GB of memory (just not all at once).  Simply write to a file and re-read piece for piece.  It's still not as fast as actual memory access, but it's a whole lot faster than actual disk access.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.