Extracting data from emails (text, PDF) - template repository or quivalent?

I was checking if HA has provisioning to get data from emails, and it has: IMAP - Home Assistant .

I set up the seventeentrack integration yesterday, and it seems that the updates are indeed still very delayed because 17track does not update the status very often.

Therefore using information from emails would be more to the point.

However every tracking service has their own email templates, status wording, etc. And quite a few sensors would be required to track services.

It would be better to have some kind of integration/component that would map mails to events or sensors based on a shared ocnfiguration file and a local configuration file. That way the shared configuration file would be updated over time by efforts from the community and reuse is made easier.

The first one is interesting because it share a method to get info from PDFs attached to emails.
Which makes me think about PDFs that are not directly attached to the email but that have to be downloaded from an account. Possibly that could also be in the evolved component/implementation but would be more complex as it requires logging in on the target platform, (then) follow the link in the email to get the PDF or go to the appropriate page to get the PDF.

Is anyone aware of some effort that has already been done for something like that (e.g. a HACS compatible component).

Need to extract email addresses from a PDF? Here’s a straightforward method using Python. With the PyMuPDF library, you can easily read email from PDFs, while regular expressions (regex) help identify email addresses.

Here’s a simple script to get you started:

python

Copy code

import fitz
import re

def extract_emails(pdf_path):
    pdf = fitz.open(pdf_path)
    emails = set()
    pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
    
    for page in pdf:
        text = page.get_text()
        emails.update(pattern.findall(text))
        
    pdf.close()
    return emails

# Usage
pdf_path = 'example.pdf'
print(extract_emails(pdf_path))

This script extracts email addresses from each page of a PDF and handles duplicates efficiently. Perfect for quickly gathering contact information from documents.