Home Assistant Add-on: Promtail

CentralCommand · March 26, 2021, 8:09pm

Home Assistant Add-on: Promtail

Promtail is an agent which ships the contents of local logs to a private Loki instance or Grafana Cloud. It is usually deployed to every machine that has applications needed to be monitored.

This addon requires supervisor version 2021.03.8 as it relies on the new journald capability just added. This is the current stable release as of 4/5. If you haven’t updated yet, make sure you update first.

About

By default this addon version of Promtail will tail logs from the systemd journal. This will include all logs from all addons, supervisor, home assistant, local log files in /share or /ssl if you have a particular add-on that logs to a file instead of to stdout.

How do I use it?

Promtail is a central piece of what’s known as the PLG Stack for application monitoring - Promtail, Loki and Grafana. I’m sure a lot of you are already familiar with Grafana data analysis and visualization tools either from the great community add-on or use in some other aspect of your life.

But Grafana is also central to system monitoring. The same company also owns Loki and Promtail which are used to collect and aggregate logs and other metrics from your systems. Then Grafana can pull in this information from Loki so you can explore, analyze, and create metrics and alerts. Grafana isn’t the only tool that can read from Loki but it is usually used in this stack since its all designed to work well together.

Essentially the process you probably want to set up is this:

Promtail scrapes your logs and feeds them to Loki
Loki aggregates and indexes and makes its API available
Add Loki as a data source to Grafana and explore

Great! Where’s Loki?

Also in this repository! You can find it here.

Anything else I need to know?

Before making any complicated scrape configs I’d recommend reading the Loki best practices guide. Also learning about LogQL and what it can do. Less is more in the scraping stage.

Other then that the readme and documentation cover all the options. If you need help, you can:

Comment here
Open an issue in the repository
Ask for help in the #add-ons channel of the HA discord (I’m CentralCommand#0913 there).

Also big thanks to @massive for letting me know the right way to do this with the journal. And for providing the journal scraping configuration, that’s what this add-on is uses as its default scraping configuration now.

HA relevant scrape config examples

Before journald support was released for addons I had to test purely using additional scrape configs looking at log files other addons pumped out. I don’t use any of these anymore now that journald support exists but I figured I’d share them. I thought people needing to configure additional scrape configs might find it useful to have a few other HA relevant examples at their disposal (in addition to what’s in the promtail docs).

Caddy 2 access logs

You can get Caddy 2 to log all access to the file of your choice and then scrape it with Promtail. Add something like this to your Caddyfile:

:443 {
   log {
      output file /ssl/caddy/logs/caddy.log {
         roll_size 20MiB
         roll_keep_for 168h
      }
   }
}

Then you can add a scrape config like this:

- job_name: caddy
  pipeline_stages:
    - json:
        expressions:
          stream: level
          status_code: status
          host: request.host
          time: ts
    - labels:
        stream:
        status_code:
        host:
    - timestamp:
        source: time
        format: Unix
  static_configs:
    - targets:
        - localhost
      labels:
        job: caddy
        __path__: /ssl/caddy/logs/caddy.log

This one was structured so it was nice and easy, the others were tougher.

Zigbee2MQTT logs

The Zigbee2MQTT add-on dumps out all of its logs to a folder called log in its folder (/share/zigbee2mqtt/log by default). This includes every MQTT message it publishes. It’s not structured, but its scrapable, here’s the config I used when testing:

- job_name: zigbee2mqtt
  pipeline_stages:
    - regex:
        expression: '^(?P<stream>\S+)\s+(?P<time>\d{4}(?:-\d\d){2} \d\d(?::\d\d){2}):\s+(?P<content>.*)$'
    - regex:
        expression: '^(?P<mqtt_event>MQTT publish):\s+.*$'
        source: content
    - labels:
        stream:
        mqtt_event:
  static_configs:
    - targets:
        - localhost
      labels:
        job: zigbee2mqtt
        __path__: /share/zigbee2mqtt/log/*/log*.txt

Home Assistant log file

This actually won’t work anymore since /config isn’t being mapped by the addon. But I thought it was a good reference since it was a bit tricky to figure out. The multiline bit at the top causes it to suck up stack traces into the log line they are generated from. Otherwise it would’ve been really tough to read these logs with each line of the stack trace as a separate log entry.

- job_name: homeassistant
  pipeline_stages:
    - multiline:
        firstline: '^\d{4}(?:-\d\d){2} \d\d(?::\d\d){2} '
    - regex:
        expression: '^(?s)(?P<time>\d{4}(?:-\d\d){2} \d\d(?::\d\d){2})\s+(?P<stream>\S+)\s+\((?P<thread>[^)]+)\)\s+\[(?P<component>[^\]]+)\].*)$'
    - labels:
        stream:
        time:
        component:
  static_configs:
    - targets:
        - localhost
      labels:
        job: homeassistant
        __path__: /config/home-assistant.log

massive · March 27, 2021, 7:37am

@CentralCommand Great work, and thank you for taking time to implement this. I’m looking forward to migrating my Portainer based installation to this once the journald support lands in stable Supervisor.

CentralCommand · March 29, 2021, 5:59pm

So I’m realizing that the logs from supervisor and homeassistant actually aren’t more parseable from the journal. The problem is the errors. Anytime HA or supervisor encounters an exception it writes out a log entry per line in the traceback. Which ends up looking really gross in Grafana since each line of the traceback shows up separately. And it reverses the order like this:

When I was testing with the home-assistant.log file I was able to fix this with a multiline directive. But now I’m not really sure how to do that with the journal. I can’t really seem to get it to recognize the multiline directive at all. And even if I did since all the log entries come in from one single job there’s not really a good way to process some logs differently from others.

I’m going to have to do some thinking about this. Is this something you’ve solved or are you just living with the confusing tracebacks currently?

CentralCommand · March 29, 2021, 8:23pm

Ah ha! Figured it out. The match directive allows you to set separate pipelines for different subsets of the logs. And multiline works fine but I was missing the escape sequence at the beginning that color codes it. So throwing this on as the pipeline stage fixes up the supervisor and HA logs:

pipeline_stages:
  - match:
      selector: '{container_name=~"homeassistant|hassio_supervisor"}'
      stages:
        - multiline:
            firstline: '^\x{001b}\[\d+m\d{4}(?:-\d\d){2}\s+\d\d(?::\d\d){2}\s+'

Gonna go ahead and update the default config with that so people can use it.

massive · April 6, 2021, 10:43am

This works beautifully! I replaced custom Portainer based stuff with this.

I’ve tweaked pipeline stages config a bit, to extract more data from the log lines. I’m also using AppDaemon, and getting information from that.

- multiline:
    firstline: '^\d{4}(?:-\d\d){2} \d\d(?::\d\d){2} '  
- match:
    selector: '{container_name="homeassistant"}'
    stages:
      - regex:
          expression: '^.*(?P<date_local>[\d-]+) (?P<time_local>[\d:]+) (?P<level>[\w]+) \((?P<thread>[\w]+)\) \[(?P<ha_component>[\w\.]+)\] (?P<message>.+)'
      - labels:
          date_local:
          time_local:
          level:
          thread:
          ha_component:
          message:
- match:
    selector: '{container_name="addon_a0d7b954_appdaemon"}'
    stages:
      - regex:
          expression: '^(?P<date_local>[\d-]+) (?P<time_local>[\d:\.]+) (?P<level>[\w]+) (?P<appdaemon_component>[\w\.]+): (?P<message>.+)'
      - labels:
          date_local:
          time_local:
          level:
          appdaemon_component:
          message:

CentralCommand · April 6, 2021, 12:04pm

Nice! Although something to be aware of with that config. When I started experimenting the first thing I did was dig in deep with regexes to parse out all the pieces into labels. But then at some point I found Loki’s best practices guide and realized I was going to create issues for myself with the dynamic labels, especially the ones coming from unbounded values. I actually ended up deleting all my carefully crafted regexes and going back to your config at that point lol. Seemed good to turn less into labels and just query on the text content in grafana.

So of yours above level is fine. ha_component and appdaemon_component are probably fine, definitely should be less then 1000 values and I imagine those are quite helpful. thread I’m not sure about, I don’t see a whole lot of unique values for that but I don’t know what the bound is, also not sure when I would search by that personally.

date_local and time_local I would definitely think about removing as labels. Instead what you should be able to do I think is use the timestamp stage to set the timestamp of the log entry from that date/time value. message I would strongly recommend removing as it is very long and completely dynamic. Instead you can use the output stage to reset the contents of the message to just that part and remove the bits you’ve already processed into labels.

massive · April 6, 2021, 12:53pm

That makes perfect sense. I’ve updated my config accordingly. Thanks for the tips!

massive · April 13, 2021, 7:36am

Good day!

I just upgraded Promtail and Loki addons to the most recent version. However, both refuse to start after the upgrade. I’m seeing from the changelog that changes have been done to AppArmor, and these seem to be the cause for the issue.

I’m pasting relevant bits of HA logs below. This is about promtail, but loki gives an identical error.

21-04-13 07:26:57 INFO (SyncWorker_3) [supervisor.docker.interface] Updating image ghcr.io/mdegat01/promtail/amd64:1.3.1 to ghcr.io/mdegat01/promtail/amd64:1.5.1
21-04-13 07:26:57 INFO (SyncWorker_3) [supervisor.docker.interface] Downloading docker image ghcr.io/mdegat01/promtail/amd64 with tag 1.5.1.
21-04-13 07:27:04 INFO (SyncWorker_3) [supervisor.docker.interface] Stopping addon_39bd2704_promtail application
21-04-13 07:27:05 INFO (SyncWorker_3) [supervisor.docker.interface] Cleaning addon_39bd2704_promtail application
21-04-13 07:27:05 INFO (MainThread) [supervisor.addons] Add-on '39bd2704_promtail' successfully updated
21-04-13 07:27:05 INFO (SyncWorker_1) [supervisor.docker.interface] Cleanup images: ['ghcr.io/mdegat01/promtail/amd64:1.3.1']
21-04-13 07:27:05 INFO (MainThread) [supervisor.host.apparmor] Adding/updating AppArmor profile: 39bd2704_promtail
21-04-13 07:27:05 INFO (MainThread) [supervisor.host.services] Reloading local service hassio-apparmor.service
21-04-13 07:27:06 ERROR (SyncWorker_7) [supervisor.docker] Can't start addon_39bd2704_promtail: 500 Server Error for http+docker://localhost/v1.41/containers/84d36db00bff465de12016f21aef106c1a5cfcc4f4a88389a58e8e65503d8bb9/start: Internal Server Error ("OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/exec: no such file or directory: unknown")

Any thoughts?

CentralCommand · April 13, 2021, 2:58pm

That’s strange, it looks like it can’t apply the profile at all. What’s your setup look like? Are you running HAOS or supervised?

CentralCommand · April 13, 2021, 3:25pm

Ok I am seeing an apparmor related problem on amd64. I’m not seeing the same one but I am going to get to work on that now and push an update. I can’t reproduce your issue though where the apparmor profile literally won’t install. I’ll need some more info about your system I think to try and figure out what is going on there.

Btw, just to check, in the supervisor tab it doesn’t say “unsupported system” right? Need to ask because I’m worried its something with supervisor rather then the addon based on the message and some googling. But need more info first.

CentralCommand · April 13, 2021, 6:32pm

Ok I’m pushing an update now that removes the custom apparmor profile while I work through some of these amd64 issues. I’m not sure why this wasn’t showing up before but its behaving very differently there.

I actually have a different repository with beta versions of all my addons here. I’m going to iterate and release an update there, when I do I’ll PM you @massive and ask you to install it to ensure it works fine before releasing to stable. The beta versions are totally separate addons so they can just be installed and run alongside with no impact on the primary version.

massive · April 14, 2021, 5:58am

Yes, indeed. I’m running supervised HA on top of Ubuntu 20.04 LTS, which is unfortunately “unsupported” (it wasn’t clearly said when first intalled HA on top of Ubuntu a few years ago). That being said, I’ve never had any issues with other add-ons, even though I’m running about ~20 of them at any given time.

FWIW it seems that the latest version bump (apparmor removal) fixes the issue for now

CentralCommand · April 14, 2021, 12:52pm

That actually doesn’t surprise me. I saw in the documentation that there was a way to add a custom apparmor profile for addons and they recommended it for increased security. The first thing I did was look around for examples of other addons using this capability and I found that very very few are. In fact between the core addons repo and the community addons repo there is only one with a custom apparmor profile - Dnsmasq. I guessing you might have an issue installing that one as well.

So I did update the beta build. I ran it with the updated apparmor profile on an amd64 test machine for some time so it should work there in theory but now I’m thinking there’s a more fundamental issue. When you click “Learn more” in the unsupported system message, what does it say? Is apparmor one of the things mentioned?

bulbur · November 17, 2022, 7:56pm

If anyone else stumbles over the problem that the pipeline stage posted above doesn’t extract the labels: It looks like the home assistant logs didn’t have miliseconds in the date back then.
This works for me now (2022.11.2)

- multiline:
    firstline: '^\d{4}(?:-\d\d){2} \d\d(?::\d\d){2}.\d\d\d '  
- match:
    selector: '{container_name="homeassistant"}'
    stages:
      - regex:
          expression: '^.*(?P<date_local>[\d-]+) (?P<time_local>[\d:.]+) (?P<level>[\w]+) \((?P<thread>[\w]+)\) \[(?P<ha_component>[\w\.]+)\] (?P<message>.+)'
      - labels:
          level:
          ha_component:

Vodros · May 8, 2023, 4:55pm

It seems like the latest update of HA OS breaks Promtail. I’ve opened an issue on GitHub: Error reading journal position · Issue #221 · mdegat01/addon-promtail · GitHub
Anyone experiencing the same issue?

piersdd · May 14, 2023, 9:51am

Not sure about broken(being a new user), however, Loki refuses to start

mkdir /data/loki/chunks: permission denied
…
error initialising module: store

Vodros · May 15, 2023, 3:32pm

Sounds like permission issues. Check that your docker user has permission to write to that directory.

johntdyer · May 15, 2023, 4:01pm

yea @Vodros , I can confirm its for sure borked since the systemd update… The fix is to add some env variable to systemd but I’m not sure how that applies on HassOS and as such I am personally not sure how best to proceed…

ChrisHaPunkt · May 20, 2023, 7:47am

Can you tell what env variables you are talking about? Would like to test.

gregscher · September 9, 2023, 12:54am

Check out this thread. It provides a solution to this issue. I just tried it and it worked perfectly.

github.com/mdegat01/addon-loki

Permission error creating chunks directory

opened 03:10PM - 06 Nov 22 UTC

keithskillicorn

When starting Loki addon I get the following error and Loki stops: `s6-rc: info…: service s6rc-oneshot-runner: starting s6-rc: info: service s6rc-oneshot-runner successfully started s6-rc: info: service fix-attrs: starting s6-rc: info: service fix-attrs successfully started s6-rc: info: service legacy-cont-init: starting cont-init: info: running /etc/cont-init.d/00-banner.sh ----------------------------------------------------------- Add-on: Loki Loki for Home Assistant ----------------------------------------------------------- Add-on version: 1.11.2 You are running the latest version of this add-on. System: Home Assistant OS 9.3 (amd64 / generic-x86-64) Home Assistant Core: 2022.10.5 Home Assistant Supervisor: 2022.10.2 ----------------------------------------------------------- Please, share the above information when looking for help or support in, e.g., GitHub, forums or the Discord chat. ----------------------------------------------------------- cont-init: info: /etc/cont-init.d/00-banner.sh exited 0 cont-init: info: running /etc/cont-init.d/01-log-level.sh Log level is set to INFO cont-init: info: /etc/cont-init.d/01-log-level.sh exited 0 cont-init: info: running /etc/cont-init.d/nginx.sh cont-init: info: /etc/cont-init.d/nginx.sh exited 0 s6-rc: info: service legacy-cont-init successfully started s6-rc: info: service legacy-services: starting services-up: info: copying legacy longrun loki (no readiness notification) services-up: info: copying legacy longrun nginx (no readiness notification) s6-rc: info: service legacy-services successfully started [14:29:37] INFO: Starting Loki... [14:29:37] INFO: Using default config [14:29:37] INFO: Retention period set to 30d [14:29:37] INFO: Loki log level set to info [14:29:37] INFO: Handing over control to Loki... level=info ts=2022-11-04T14:29:38.253016834Z caller=main.go:103 msg="Starting Loki" version="(version=2.6.1, branch=HEAD, revision=6bd05c9a4)" level=info ts=2022-11-04T14:29:38.253836497Z caller=modules.go:736 msg="RulerStorage is not configured in single binary mode and will not be started." level=info ts=2022-11-04T14:29:38.254322253Z caller=server.go:288 http=127.0.0.1:8080 grpc=[::]:9095 msg="server listening on addresses" level=info ts=2022-11-04T14:29:38.259738005Z caller=modules.go:962 msg="failed to initialize usage report" err="mkdir /data/loki/chunks: permission denied" mkdir /data/loki/chunks: permission denied error initialising module: compactor github.com/grafana/dskit/modules.(*Manager).initModule /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122 github.com/grafana/dskit/modules.(*Manager).InitModuleServices /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92 github.com/grafana/loki/pkg/loki.(*Loki).Run /src/loki/pkg/loki/loki.go:341 main.main /src/loki/cmd/loki/main.go:105 runtime.main /usr/local/go/src/runtime/proc.go:255 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1581 level=error ts=2022-11-04T14:29:38.262652238Z caller=log.go:103 msg="error running loki" err="mkdir /data/loki/chunks: permission denied\nerror initialising module: compactor\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:341\nmain.main\n\t/src/loki/cmd/loki/main.go:105\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581" [14:29:38] WARNING: Halt add-on with exit code 1 s6-rc: info: service legacy-services: stopping s6-rc: info: service legacy-services successfully stopped s6-rc: info: service legacy-cont-init: stopping s6-rc: info: service legacy-cont-init successfully stopped s6-rc: info: service fix-attrs: stopping s6-rc: info: service fix-attrs successfully stopped s6-rc: info: service s6rc-oneshot-runner: stopping s6-rc: info: service s6rc-oneshot-runner successfully stopped`

The only problem I have now is that Promtail doesn’t seem to be scraping anything from Home assistant as I get this message in Loki…

level=info ts=2023-09-09T00:45:08.034773267Z caller=metrics.go:170 component=frontend org_id=fake latency=fast query_type=labels length=10m0s duration=8.598759ms status=200 label= throughput=0B total_bytes=0B total_entries=0

and this message in Grafana:

Data source connected, but no labels were received. Verify that Loki and Promtail are correctly configured.

and this is my log in Promtail

s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/00-banner.sh

Add-on: Promtail
Promtail for Home Assistant

Add-on version: 2.2.0
You are running the latest version of this add-on.
System: Home Assistant OS 10.5 (aarch64 / yellow)
Home Assistant Core: 2023.9.0
Home Assistant Supervisor: 2023.08.3

Please, share the above information when looking for help
or support in, e.g., GitHub, forums or the Discord chat.

cont-init: info: /etc/cont-init.d/00-banner.sh exited 0
cont-init: info: running /etc/cont-init.d/01-log-level.sh
Log level is set to INFO
cont-init: info: /etc/cont-init.d/01-log-level.sh exited 0
cont-init: info: running /etc/cont-init.d/02-set-timezone.sh
[20:00:53] INFO: Configuring timezone
cont-init: info: /etc/cont-init.d/02-set-timezone.sh exited 0
cont-init: info: running /etc/cont-init.d/promtail_setup.sh
[20:00:53] INFO: Setting base config for promtail…
cont-init: info: /etc/cont-init.d/promtail_setup.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun promtail (no readiness notification)
s6-rc: info: service legacy-services successfully started
[20:00:54] INFO: Starting Promtail…
[20:00:54] INFO: Promtail log level set to info
[20:00:55] INFO: Handing over control to Promtail…
level=info ts=2023-09-09T00:00:55.372651311Z caller=server.go:288 http=[::]:9080 grpc=[::]:40303 msg=“server listening on addresses”
level=info ts=2023-09-09T00:00:55.374384737Z caller=main.go:121 msg=“Starting Promtail” version=“(version=2.6.1, branch=HEAD, revision=6bd05c9)”