Faster GTFS schedule lookups

renemarc · January 2, 2018, 8:05pm

Because SQLite’s write performance isn’t great, the initial database building time is rather slow and the schedule queries themselves are also pretty dismal. Here’s a workaround to alleviate the long SQLite database build process and speed up the schedule queries:

Background:

Converting my local transit company’s GTFS files takes around 10 minutes on HASS.io running on my Raspberry Pi 3 with its Samsung EVO+ SD card, and then the schedule queries freeze access to the card every time they run. Argh!

Doing the conversion using a Docker instance of HA running on Macbook Pro 2015 with its SSD still takes around 4 minutes yet the schedule queries take about 5 seconds from the command line. Bummer.

The component’s schedule query is pretty beefy with 5 JOINs on large tables:

SELECT trip.trip_id,
       trip.route_id,
       time(origin_stop_time.departure_time),
       time(destination_stop_time.arrival_time),
       time(origin_stop_time.arrival_time),
       time(origin_stop_time.departure_time),
       origin_stop_time.drop_off_type,
       origin_stop_time.pickup_type,
       origin_stop_time.shape_dist_traveled,
       origin_stop_time.stop_headsign,
       origin_stop_time.stop_sequence,
       time(destination_stop_time.arrival_time),
       time(destination_stop_time.departure_time),
       destination_stop_time.drop_off_type,
       destination_stop_time.pickup_type,
       destination_stop_time.shape_dist_traveled,
       destination_stop_time.stop_headsign,
       destination_stop_time.stop_sequence
FROM trips trip
INNER JOIN calendar calendar
           ON trip.service_id = calendar.service_id
INNER JOIN stop_times origin_stop_time
           ON trip.trip_id = origin_stop_time.trip_id
INNER JOIN stops start_station
           ON origin_stop_time.stop_id = start_station.stop_id
INNER JOIN stop_times destination_stop_time
           ON trip.trip_id = destination_stop_time.trip_id
INNER JOIN stops end_station
           ON destination_stop_time.stop_id = end_station.stop_id
WHERE calendar.tuesday = 1
      AND time(origin_stop_time.departure_time) > time('14:00:00')
      AND start_station.stop_id = 3333
      AND end_station.stop_id = 5555
      AND origin_stop_time.stop_sequence < destination_stop_time.stop_sequence
      AND calendar.start_date <= '2018-01-02'
      AND calendar.end_date >= '2018-01-02'
ORDER BY origin_stop_time.departure_time
LIMIT 1;

Initial solution:

Adding the following missing indexes for the large table stop_times didn’t help speed anything much since we’re doing a non-deterministic function operation time() on the column departure_time:

CREATE INDEX idx_departure_time ON stop_times(departure_time);
CREATE INDEX idx_stop_sequence ON stop_times(stop_sequence);

Better solution:

To speed things up, I still build the database on my laptop with its much faster storage device before trimming the (~90MB) database to the only stops that I care about. I then reduce its file size to a smaller footprint (~10MB) before moving it to the Raspberry Pi where my production instance of HASS.io runs. Alternatively, these steps could also be done directly on the Raspberry Pi’s SD card …if you like watching grass grow.

In SQLite:

.backup gtfs-backup.sqlite
DELETE FROM stop_times WHERE stop_id NOT IN (3333, 5555);
DELETE FROM stops WHERE stop_id NOT IN (3333, 5555);
VACUUM;
.backup gtfs-small.sqlite

At the command prompt:

cp ./gtfs-small.sqlite /Volumes/config/gtfs/NAME_OF_DATA_SOURCE.sqlite

Then restart HA. The above query then runs almost instantly with no freezing of the SD card.

kernald · January 4, 2018, 10:19am

I intended to do the same thing, after the GTFS component started to fail initializing because taking too much time… There’s definitely a huge room for improvement on the GTFS component itself. Thanks for the information regarding the index, though, that would have been my first attempt too

renemarc · June 27, 2018, 6:22pm

After a few more tests, I realized that I had not tried all possible missing indexes. Adding these helped increase the initial lookup time substantially:

CREATE INDEX idx_trips_service_id ON trips(service_id);

CREATE INDEX idx_stop_times_stop_id ON stop_times(stop_id);

CREATE INDEX idx_stop_times_trip_id ON stop_times(trip_id);

CREATE INDEX idx_stop_times_stop_sequence ON stop_times(stop_sequence);

CREATE INDEX idx_stop_times_departure_time ON stop_times(departure_time);

Still, reducing the data set to the only stops needed provides further increases and a smaller database file to handle. I’ve noted all my findings in my repo here.

Thanks to the improved speed, I went ahead and slightly modified the code for the GTFS component to add a few key features:

Added support for next few departures.
Simplified query, making it cacheable in order to increase IO performance.
Allow querying departures that will happen after midnight.
Correctly set arrival date to tomorrow instead of earlier today when a bus leaves before midnight but arrives after …no time-travel busses allowed!
Return Null instead of Zero when a departure is not found/available.

Romquenin · January 22, 2019, 10:13pm

Hello renemarc, thank you for your github repo. You have been a great inspiration for me and again thank you.

I’m a lot interested with your gtfs modifications, and i would like do like you and reduce my database size.

Could you please explain me how to connect to SQLite. You have said with a command line but unfortunately i can’t find it.

Thanks

Romquenin · January 23, 2019, 11:28am

I have manager to trim the database size with SQlite3 on the pc with your code.

Thanks.

renemarc · January 24, 2019, 8:16pm

Hey @Romquenin, glad you found my repo useful! I like what you did with your French adaptation.

For reference, using a command-line client on Windows, Mac or Linux, a database can be opened using something like sqlite3 databasefile.sqlite. Once you’re in, here’s a handy cheat sheet of useful SQLite-specific commands.

You’ve reminded me that I should be posting my GTFS sensor changes to the official Home Assistant code and documentation repos so that we can all benefit!

renemarc · February 13, 2019, 2:58am

Instead of just pushing my hack-ish changes, I’ve decided to try to improve this sensor’s performance. I’m gradually updating this component and submitting PRs on Home Assistant, with the following tasks in mind:

~~PR #20874: Clarify SQL query~~ merged
~~PR #20876: Set icon per route type~~ merged
~~PR #20919: Return None instead of “0 minutes” when no more departures~~ merged
~~PR #20916: Adjust arrival date to tomorrow when bus arrives after midnight~~ merged
~~PR #21053: Output a UTC timestamp instead of a countdown~~ merged
PR #20992: Extend lookup to include departures crossing midnight
- Includes: Search tomorrow when no more departures today (optional)
PR #20966: Cache metadata queries
- Includes: Expose easier to consume alternatives to cryptic attributes like Route Type: 5 (FYI, that would be a Cable Tram)
Fix default name handling to prevent entity_id changes (depends on above PRs)
Support multiple departures/sensors with internal cache (depends on above PRs)
Adding missing indexes to SQLite database
Prevent race conditions when building the database
Add type hints (in progress) + Code cleanup (in progress)

Anyone using this GTFS sensor would like to review the above PRs?

simonhq · March 20, 2019, 4:35am

Hi renemarc,

I am trying to get this working for the bus lines in Canberra, Australia, but I am getting some odd results. I added in your component in custom_component/gtfs/sensor.py.

I don’t have a separate environment so I just copied the gtfs.zip file provided by ACTION into the \gtfs directory of home assistant (I am using HASSIO) and did a restart after adding in the sensors.

I get a whole lot of errors on start up about how it can’t create the tables because they already exists but when I when I go into the sqlite file that is created the tables are there but they are all empty.

Any ideas?

renemarc · March 25, 2019, 2:52am

Hi @simonhq,

I tried Action’s GTFS data set (also TransitFeeds’) and the SQLite database does get created mostly properly, but that takes about 10 minutes give or take so just delete any prematurely aborted SQLite database and try again.

However, the routes table is missing data for the agency_id column. This happens with my local agency as well. The manual fix is to open the SQLite database and run the following query, it will fill in the gaps:

UPDATE routes
SET agency_id = (
   SELECT agency_id
   FROM agency
   WHERE feed_id = 1
);

Do tell me if that fixed it for you, to make sure that I’m not introducing bugs into the code base!

I’ll be submitting a code change that outputs more debug info to the log file during this initial build, and maybe auto-run the above query if needed (or allow custom post-creation queries).

simonhq · March 25, 2019, 3:59am

Hi @renemarc

Thanks for the help, I did get it working, what I did was I removed your custom_component and let it build with the ‘normal’ component, then followed your suggested steps to make the database more efficient (remove stops and add in the agency_id) and then I put your custom_component back to get the added functionality.

It is working well, though I have found that some of your cool customisation doesn’t work with lovelace, maybe a custom card could be an answer. Happy to collaborate if that is on your list of things to do.

Thanks again
Simon

simonhq · April 21, 2019, 1:22am

Hey @renemarc

My area just redid their bus routes and released a new file, so I went through and updated the new database like the old one, as per your instructions, and it is good, small and fast, but…

I kept getting the error in the logs about the position keys, so I ended up removing your custom_component, so I am only getting the next bus coming (not any of the future ones) and also for some reason it is using UTC to determine the next bus so I am getting information for 10 hours in the past.

so my questions are:

a) Do I have to do something special with the custom_component for it to not error? I saw you have been putting PRs in, I wondered if the custom_component was still necessary?

b) Do I have to add a setting for it to use the time_zone in configuration.yaml?

btw: I am using Hassio.

Thanks

lawsuitup · September 30, 2019, 6:53pm

Is there a way to use that card not using appdaemon? Is it possible to make it in lovelace? Right now I just have the countdowns - but I really like the look and information in your card and I would really like to implement it.