Parsing PDF email attachments

:wave: Hi folks, just wanted to share a little holiday project: 2 new components that together enable the extraction of data from PDFs attached to emails.

Here’s a screenshot of some data I’m pulling out of my utility bill:

The PDF component is brand new. It uses PyPDF2 to parse PDF files which was chosen because the other popular PDF libraries require gcc or musl-dev to build and those are not available in the Alpine Docker image. PyPDF2 exports all text from a given page as a single string, so a custom regex_search was added to allow for quick data extraction.

The IMAP attachment component is a simple extension of the official IMAP Email Content component, with the ability to store attachments in a local directory.

Here’s an example config of these components working together. In this example an email from [email protected] has a PDF attachment with a filename of utility-bill.pdf, and the PDF has a line of text like this:

Water Consumption Charge     15  x  $  2.2159             33.24 --------------Balance

Here’s the config:

# Example configuration.yaml entry
homeassistant:
  allowlist_external_dirs:
    - /config/attachments

sensor:
  - platform: imap_attachment
    name: Utility Bill Email
    server: imap.gmail.com
    port: 993
    username: [email protected]
    password: hunter2
    senders:
      - [email protected]

  - platform: pdf
    name: Water Usage Cost
    file_path: /config/attachments/utility-bill.pdf
    unit_of_measurement: $
    regex_search: 'Water Consumption Charge\s+([\d.]+)\s+x\s+\$\s+([\d.]+)\s+([\d.]+)\s-+'
    regex_match_index: 3

That’s about it. For those that are familiar with writing components, is there anything in the code that sticks out that could or should be fixed or changed? Something I’d like to do is dynamically link storage_path to file_path so that the PDF sensor could use a value like this:

file_path: states.sensor.imap_gmail_attachments.attributes['attachment_paths'][0]

… not quite sure how to use state in configuration.yaml though, or if it’s even possible. Feedback welcome!

Thanks for reading, happy holidays! :christmas_tree:

6 Likes

WOW! That’s very interesting.

This is really cool. I have a use case where our kids schedules and home work is published on a pdf attached to email. The file name would not be the same every time, as the weeknumber is included in the original file name. That however can be solved otherwise. My question is; how to I install the PyPDF2 component? I found the package and unpacked it to my custom_components folder. So do I just add it to my configuration.yaml to get it going??

@BitCyco if you have also installed https://github.com/emcniece/ha_pdf/blob/d439142da493d10b97788941b7c2473164ded306/manifest.json then HA will automatically pip-install any packages defined in the "requirements: []" list on reboot.

Hi @emcniece, this looks awesome! I’m trying use it to fetch image attachements from an email address (to use for a “camera”). However, when I run a configuration validation after setting up, it just spins without ever validation. I did get the standard IMAP (non-attachements) sensor working fine–any ideas?

Sure - do your logs say anything related to this plugin? If you can use the same IMAP details with the actual IMAP sensor then it shouldn’t be an authentication problem.

Nothing I can see in my logs abut it… but happy to be directed to other logs (I was just checking the log page under Configuration.

yaml is as follows (currently commented out):

#sensor for Reolink camera email
sensor:
  - platform: imap
    server: imap.***.com
    port: 993
    username: ***.***@***.com
    password: ***
    search: FROM <***@***.com>
#  - platform: imap_attachment
#    name: 'Reolink email images'
#    server: imap.***.com
#    port: 993
#    username: ***.***@***.com
#    password: ***

I do get the stand custom integration warning in my log, but that’s all:

WARNING (MainThread) [homeassistant.loader] You are using a custom integration for imap_attachment which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant.

Hm… perhaps the integration needs to handle a few more implementations. Docs are at https://github.com/emcniece/ha_imap_attachment - have you included the storage_path: parameter for the sensor, and does it match the path you provided in the homeassistant : allowlist_external_dirs: config?

I’ve opened the first issue to help track this report: https://github.com/emcniece/ha_imap_attachment/issues/1

looks very nice your project.
So, the pdf file should be downloaded manually to a specific directory ?
I tried standard imap content mail, but didn’t get any results (besides unread_mail sensor)

You can use the PDF file sensor on its own if you wish, in which case you must give it a specific file to open.

The default IMAP email sensor does not download attachments. If you use https://github.com/emcniece/ha_imap_attachment the email attachments will automatically download to the /config/attachments directory by default, if Home Assistant has been granted write access. If the attachments are always PDF files with the same name, then the PDF file sensor will automatically read new values each time the file is overwritten.

i see that my 3 main utility suppliers use pdf attachments named after inovice number, so each number is new.
i create 3 pdf sensors with different names, but i guess i need to make 3 subdirs and target them in my config file in order that each supplier populate it’s directory. Then pdf sensor will ready the most recent pdf file in that dir ?

/config/attachments/supplier1
/config/attachments/supplier2
/config/attachments/supplier3

and how can i tell regex to search the value needed below line and not in line? i attached how my important lines from invoice look like.

Thank you

Later edit: i restarted HA after cloning both custom components and after adding this sensor i cannot have a check config. it stays on" checking" like forever.

#  - platform: imap_attachment
#    name: vodafone_mail
#    storage_path: /config/attachments
#    server: imap.gmail.com
#    port: 993
#    username: !secret gmailuser
#    password: !secret gmailpass
#    senders:
#      - [email protected]

I’m having some issues getting this working too. It completely messes up MQTT discovery when I enable the sensor in yaml:

2021-03-17 17:33:35 WARNING (MainThread) [homeassistant.loader] You are using a custom integration imap_attachment which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant
2021-03-17 17:33:35 WARNING (MainThread) [homeassistant.loader] No 'version' key in the manifest file for custom integration 'imap_attachment'. This will not be allowed in a future version of Home Assistant. Please report this to the maintainer of 'imap_attachment'
2021-03-17 17:33:35 ERROR (MainThread) [homeassistant.util.logging] Exception in async_discovery_message_received when handling msg on 'homeassistant/sensor/347538_status/config': '{"name":"Bedroom Amp status","stat_t":"tele/tasmota-bedroom_amp/HASS_STATE","avty_t":"tele/tasmota-bedroom_amp/LWT","pl_avail":"Online","pl_not_avail":"Offline","json_attr_t":"tele/tasmota-bedroom_amp/HASS_STATE","unit_of_meas":"%","val_tpl":"{{value_json['RSSI']}}","ic":"mdi:information-outline","uniq_id":"347538_status","dev":{"ids":["347538"],"name":"Bedroom Amp","mdl":"Sonoff S2X","sw":"9.1.0(tasmota)","mf":"Tasmota"}}'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/util/package.py", line 37, in is_installed
    req = pkg_resources.Requirement.parse(package)
  File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3139, in parse
    req, = parse_requirements(s)
ValueError: not enough values to unpack (expected 1, got 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/mqtt/discovery.py", line 161, in async_discovery_message_received
    await async_process_discovery_payload(component, discovery_id, payload)
  File "/usr/src/homeassistant/homeassistant/components/mqtt/discovery.py", line 222, in async_process_discovery_payload
    await hass.config_entries.async_forward_entry_setup(
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 895, in async_forward_entry_setup
    result = await async_setup_component(self.hass, domain, self._hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 895, in async_forward_entry_setup
    result = await async_setup_component(self.hass, domain, self._hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 895, in async_forward_entry_setup
    result = await async_setup_component(self.hass, domain, self._hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 895, in async_forward_entry_setup
    result = await async_setup_component(self.hass, domain, self._hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 895, in async_forward_entry_setup
    result = await async_setup_component(self.hass, domain, self._hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/helpers/discovery.py", line 195, in async_load_platform
    setup_success = await setup.async_setup_component(hass, component, hass_config)
  File "/usr/src/homeassistant/homeassistant/setup.py", line 57, in async_setup_component
    return await setup_tasks[domain]  # type: ignore
  File "/usr/src/homeassistant/homeassistant/setup.py", line 64, in async_setup_component
    return await task  # type: ignore
  File "/usr/src/homeassistant/homeassistant/setup.py", line 174, in _async_setup_component
    processed_config = await conf_util.async_process_component_config(
  File "/usr/src/homeassistant/homeassistant/config.py", line 828, in async_process_component_config
    p_integration = await async_get_integration_with_requirements(hass, p_name)
  File "/usr/src/homeassistant/homeassistant/requirements.py", line 79, in async_get_integration_with_requirements
    await async_process_requirements(
  File "/usr/src/homeassistant/homeassistant/requirements.py", line 126, in async_process_requirements
    if pkg_util.is_installed(req):
  File "/usr/src/homeassistant/homeassistant/util/package.py", line 41, in is_installed
    req = pkg_resources.Requirement.parse(urlparse(package).fragment)
  File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3139, in parse
    req, = parse_requirements(s)
ValueError: not enough values to unpack (expected 1, got 0)

Hello,

I have been trying to use “PDF File Sensor” to get the actual price of my electricity in a country where the providers all publish their rates olny in “nice” -but hardly usable- pdfs.
Like this one: https://www.luminus.be/fr/particuliers/electricite-gaz-naturel/-/media/general/pricelists/fr/gcf0c_fr_comfyflex_gas.pdf
Downloading the pdf locally with the Downloader works like a charm. Parsing the pdf works as well until I try to use RegEx to get the right number.

I havo no experience at all with regex so I scratched my head and played around with regex101 where I copied the text from the pdf for a couple hours.
The second match on this expression /[\d]{2},[\d]{2}/g is supposed to give me the number I need: 24,27

So I try translating it into my code

sensor:
  - platform: pdf
    name: gas_price_production_per_kwh_scraped
    file_path: '/media/downloads/luminus.pdf'
    regex_search: '[\d]{2},[\d]{2}/g'
    regex_match_index: 2

but that just does not generate any sensor.
How should I format my RegEx here?
(I know the RegEx is the issue as the same code with regex_search: '[\d][\d],[\d][\d]' and no regex_match_index does gives me the first amount from the pdf, ie 69,95.)

Thank you

1 Like

Hey do you have any plans on getting this over to HACS? and can this be used as a PDF viewer?
Thanks