It’s rather difficult to manage complex networks of autonomous nodes - (n*(n-1)) meshes always give me headaches so my approach has been not to look too closely at the detail, but trust the individual nodes to “do the right thing” and just help them see enough RF energy from neighbours.
Adding and moving routers is basically what I’d do as well, which makes me wonder about an external factor like a RF noise source…
Repeated times between failure of exactly 600 seconds does sound like a protocol issue though - hence my IKEA firmware update thought (ZHA + TRÅDFRI + SilLabs coordinator works well here).