Advice please: best way of interrogating large dataset?

DJI · December 1, 2023, 4:06pm

I want to adopt more accurate tidal heights by accessing a database of a year’s forecasts from the UKHO, a CSV file of around 8mB with tidal heights at 5 minute intervals for one year. I’ve got the choice of either executing a SQL query on this file every 5 mins or I could pre-process it into JSON format (which makes it about 100mB, although this could be split into 12 monthly datasets), then use this to define a huge array which I can filter to get the current period.

Can anyone offer any guidance on the best way and any problems I need to be aware of?

I’m currently accessing the UKHO API which only gives times & heights of high & low tides which I interpolate assuming the tidal curve is a simple sine wave - which it really isn’t in our estuary. The UKHO use the Admiralty Harmonic Method for forecasting which is as good as it gets.
My HA is running on a virtual machine on a PC, currently using about 20% of the 4GB RAM allocated and the processor load is normally light (<10%). I’m only using around 3% of the HD storage allocated (which is huge, about 1TB I think).
The CSV file is a very simple list of datetime, height pairs and the processed JSON file looks like this below (from a sample of old data), which I imagine I’d !include into some sensors but I haven’t worked out any details yet.

[
  {
    "Datetime": "01/01/2002 00:00",
    "Height": "6684"
  },
  {
    "Datetime": "01/01/2002 00:10",
    "Height": "6639"
  },

And in case you are interested, these are some of the things I’m doing with the data:

wigster · December 1, 2023, 8:19pm

Hi
I don’t have suggestions for what to do with a large data set in HA, but a couple of alternatives from the data analysis point of view:

Do you really need the 5 minute resolution? I would presume that at that kind of timescale the height is a pretty linear function of time. Two sides to this – (a) do you really need to look up the tide at that interval and (b) what precision to you really need?

You could maybe keep the real data with just an 1/hour frequency and interpolate (linear, cubic split whatever) in between?

You could fit some sort of analytic function to e.g. each day’s data and just have the parameters of the function in a lookup table for each day. You could try some genetic algorithm function fitters if you can’t guess reasonable function classes.

DJI · December 1, 2023, 8:53pm

Thanks for your thoughtful input. I could use cubic spline interpolation on a less frequent dataset but my experiments so far have not been promising. I tried 30mins data and didn’t get close enough. The trouble is launching the RNLI lifeboat over a shallow spot at spring low tides in order to get out to sea. Currently the station uses an dedicated programme that uses the proper admiralty method but it has proved impossible to get the data out into my rather useful interface. The UKHO do provide an API tonthe fine level data but at an unaffordable price. Their csv file has exactly the dame data at only 10% of that. Hence my plan…

wigster · December 1, 2023, 10:10pm

But then can’t you just get the exact threshold time for each day as your data point? The water is deep enough between time_a and time_b and a lower-res approximation for the actual tide height for the whole day? But maybe you need a different depth for different boats in which case maybe this isn’t sutainable

DJI · December 2, 2023, 9:57am

Yes, that’s one of the things I’d do - but that still necessitates interrogating the large dataset in one of the 2 formats. If I can do this for the restrictions I might as well do it for all the heights I feel, which then allows me to carry out another pet idea. I have in mind building up knowledge of how the wind speed & direction affects heights in this estuary. “Everone knows” it does in a relatively straighforward and significant way, but nobody that I know has quantified it. If I can pull off this trick, then we start to get really accurate knowledge of the no launching times.
I’m really impressed with the thoughful suggestions here, but I’m still hoping someone can tell me about what the limits are on file sizes or performance or whatever is going to trip me up!

wigster · December 2, 2023, 8:43pm

By the way – I assumed that you are aware of this, but maybe not – there is the SQL integration which can query roughly many databases. I have about 50 or so 60 SQL sensors which query the 2.5 GB HA database every minute and this doesn’t seem to have any serious load on the HA system running on an Odroid ARM-based machine. It does have an SSD though.

DJI · December 3, 2023, 8:19am

Thanks, just what I needed to hear. I’ll learn how to make sql sensors then.