Alternative to Scrape - Docker-Mqtt?

Ostracizado · May 31, 2022, 1:45pm

Hardware:

RPi 4 - 4Gb Ram
SSD Hard drive

Core:

2022.5.5

Well, I’m hoping some one has a similar problem/solution.

I’ve been using HA’s Scrape to send a notification to my kid’s phone whenever some manga that he follows has an update.

For instance:

  - platform: scrape
    name: Yofukashi no Uta Last Chapter
    resource: https://readmangabat.com/read-km388705
    select: .story-info-right-extent > p:nth-child(4)

The problem is that I noticed that so many sensors of scrape were affecting my HA performance.
Even after only having 13, of such sensors, the Integration Startup Time, for instance, is still 331s for Scrape. It even reached a staggering 700s, before I start deleting some that he wasn’t reading that much.

So I thought of a solution to that problem. Having a Synology Nas On 24/7, couldn’t it be an alternative, using Scrapy-Splash on Docker, and Mqtt data as a sensor to HA?

Would that even be possible?

I admite I haven’t had much luck with Scrapy, beyond the Shell’s commands…

Any help or idea would be great!

vingerha · May 31, 2022, 4:11pm

My first thought would be to reduce the refresh interval as in my view… how often would one need to check on updates for this kind of information (once a day?) I however could not find out how to control this. The other part on performance is the page(s) you are analyzing, if these are big then it will still take a while per sensor

CentralCommand · May 31, 2022, 5:22pm

So before trying a new platform for scrape I’d look for literally any other option besides scraping the website personally. A few things come to mind for example:

Is there an RSS feed anywhere you can connect to feedreader? I didn’t see one at the site in that scrape sensor but maybe somewhere else in that website or maintained by someone else?
Is there a subreddit for the manga by any chance? Could use reddit to get data from it into HA and watch for posts with specific terms
Is there an app that people use to get notified about updates for their latest mangas? I’m not familiar with the scene but I would be pretty surprised if there wasn’t a go-to tool for this. And if so, does HA integrate with it?

Also maybe get creative with a rest sensor? For example, I notice at the site in the scrape sensor the URL for every chapter looks like this:
https://readmangabat.com/read-km388705-chap-117
Everything the same except the number at the end. So for this particular site what I might do something like this:

rest:
  - scan_interval: 3600
    resource_template: "https://readmangabat.com/read-km388705-chap-{{ states('input_number.manga_chapter') }}"
    sensors:
      - name: New manga chapter
        value_template: "OK"

And then an automation like this:

trigger:
  platform: state
  entity_id: sensor.new_manga_chapter
  not_to: ['unknown', 'unavailable']
action:
  - service: notify.kids_phone
    data:
      message: There's a new chapter!
  - service: input_number.increment
    data:
      entity_id: input_number.manga_chapter

And then obviously make that input number called “manga chapter”.

I mean it’s a little gross since this rest sensor will have an error like 99% of the time. Since its basically polling the endpoint it knows will eventually exist until it does. Once it does it notifies and then begins polling the next endpoint (so back to an error). But it should definitely be more performant then scraping.

This will generate a ton of log errors though. To handle that you have two options:

Set log level to critical or fatal for homeassistant.components.rest in logger to drop those
Use a command line sensor with curl instead. That way you can use the exit code of curl to set the state of the sensor properly instead of having it either error out or work.

This is all specific to this one site though, I don’t know what other sites your looking at. Perhaps a similar pattern could be found though?

vingerha · June 1, 2022, 5:07am

It is indeed … so what you basically need is to know if a page exists.
As another idea… you could run a command line, eg.

curl -s "https://readmangabat.com/read-km388705-chap-117" | grep something

and with the output define it is a new page or not?

Ostracizado · June 1, 2022, 9:19am

Yeah, the scan interval was my first thought. It seems to work every 30s, and that’s certainly too many. Once a day would be more than enough.

The automations I have only look at the change of status, so limiting the notification to a certain period of time would work quite nice. Alas, scrape’s documentation does not specify the option of scan_interval; and I’m not sure how I could even check its activity…

Ostracizado · June 1, 2022, 9:24am

Those seem quite some nice ideais!

I’ve never used Rss, but it sure could be a pragmatic solution.
The same with the rest sensor.

Will try some experiments with both.