Scrape dynamic socket.io data with python

In my other post here: Multiscrape/scrape help - #5 by tanderson1992, I’ve been trying to restore my previously-working multiscrape setup for a radio station after the station changed names and redesigned the website. The new site name is www.lifeomaha.com. After messing with multiscrape for 2 days and even installing Python/PyCharm to learn how to scrape with BeautifulSoup, I finally realized the scraper will never work because the content is generated dynamically.

Today I checked every link the Console until I determined the site is getting its artist/song/image data via socket.io. This is well outside my knowledge level, but I managed to cobble together a python script to retrieve the data I need. It’s not pretty, but it works:

import requests
from websocket import create_connection
import json

ws = create_connection("wss://myfaithmedia.org/socket.io/?EIO=3&transport=websocket")
ws.send('42["site","KGBI"]')
while True:
  result = ws.recv()
  print(result)
ws.close()

The output looks like this:

[user]\PycharmProjects\Websocket\.venv\Scripts\python.exe [user]\PycharmProjects\Websocket\main.py 
0{"sid":"zRSKxjUQU2RqW8OWABFL","upgrades":[],"pingInterval":25000,"pingTimeout":60000}
40
42["connected",{"date":"2023-12-23 20:58:30","name":"Go Tell It On The Mountain","artist":"Sarah Reeves","image":"https://www.lifeomaha.com/wp-content/themes/gravity-global-ktis/assets/images/placeholders/gravatar-life.png","type":"SONG"}]

The script will keep running and occasionally receive a song change like this:

42["song_change",{"date":"2023-12-23 20:24:14","name":"Hark!  The Herald Angels Sing","artist":"Building 429","image":"https://www.lifeomaha.com/wp-content/themes/gravity-global-ktis/assets/images/placeholders/gravatar-life.png","type":"SONG"}]

Eventually it times out, probably because it’s not sending a “2” every 25 seconds like the browser is. I can’t just run it and quit because it will receive the first “0” response and close the connection. The while TRUE keeps the connection open, but longer than I need.

Now that I know I can reliably get the data I want, how can I use this to pass the Artist, Song and Image to HA entities so that I can display it on the dashboard? I was hoping one of the json elements in the console would have the data, because there are enough REST examples here that I think I could get it to work, but that doesn’t seem to be the case. I don’t know where to go from here with the socket.io data I’ve received. Any ideas?

There could be other options, but I think you either need to write your own integration or run this separate script of yours and push the sensor values via the REST API.

1 Like

Thank you, that is where I landed as well after much searching here and the internet in general. The working version of my script is below.

For anyone interested, I couldn’t figure out how to get my payload directly into HA as the 42 and the ["connected", caused errors. I feel like I might have gone the long way, but I trimmed the JSON payload to remove the problematic characters, obtained each value I needed as plain text, then manually turned each one back into JSON. I then imported each via the api. After running it a few times and watching it immediately change the HA values to the current song playing, I removed my multiscrape integration from configuration.yaml, restarted HA, then moved the script to my RPi.

After installing the necessary items (pip, websocket-client, requests) on the Pi, I made a cron job that looks like this:

* * * * * kgbi.py
* * * * * sleep 30; kgbi.py

It’s been running about 6 hours now, and as far as I can tell the script is working well. I’d like to build in some error handling for the inevitable time the websocket fails (which it did once during my testing with a 502), but for now here’s my working script:

import time

import requests
from requests import post
from websocket import create_connection
import json

ws = create_connection("wss://myfaithmedia.org/socket.io/?EIO=3&transport=websocket")
ws.send('42["site","KGBI"]')
# while True: Execute the result 3 times and stop to wind up with the line that includes the song info
for i in range(3):
 result = ws.recv()
# print(result)
ws.close()
# print(result)

# result = ''.join(result.split())
newresult = result[15:-1]
json_object = json.loads(newresult)
# print(json_object)

artist = json_object["artist"]
songtitle = json_object["name"]
imgurl = json_object["image"]
# print(artist + " - " + songtitle)
# print('url: ' + imgurl)

# Send code to HomeAssistant
url1 = "http:/localhost:8123/api/states/sensor.life_100_artist"
url2 = "http://localhost:8123/api/states/sensor.life_100_song"
url3 = "http://localhost:8123/api/states/sensor.life_100_icon"
token = "[token] (obtained from my username menu in HA)"
headers = {
  "Authorization": "Bearer " + token,
  "content-type": "application/json",
}

artiststate = "{\"state\": \"" + artist + "\"}"
songtitlestate = "{\"state\": \"" + songtitle + "\"}"
imgurlstate = "{\"state\": \"" + imgurl + "\"}"
response1 = post(url1, headers=headers,data =artiststate)
response2 = post(url2, headers=headers,data =songtitlestate)
response3 = post(url3, headers=headers,data =imgurlstate)

I left in some of the intermediate steps I used while figuring this out. Many thanks to those here and across the web who actually know what they’re doing and gave me some working code to modify.

I’m interested in any suggestions to improve the script, and I’m intrigued about learning to write an integration. I’ll be looking at that next.

I just found another error I need to capture: the link for the image was correct but gave a null result. In HA I get a broken image icon. I need to learn how to detect this and replace it with their own placeholder image that’s used when the station doesn’t have cover art.