On stopping a VM, Proxmox will remove the dirty bitmap forcing a full read on every backup while the VM is stopped.

In real life situations, with junior employees part of the staff that have permission to stop a VM, number 4 “Archive method” or 5 “Pool based exclusions” below are the only ones that really seems to work.

“Skip stopped” patch is the ideal solution, if you are technically advanced enough (ex. cloud company), but do not work well in enterprise.

Options to avoid excessive backup’s and IO overload

  1. Live with it
    Just accept it, and be constantly annoyed by slow backup procedures, and backup jobs that suddenly don’t complete during night, when many VMs have been stopped and IO have become too bad, and no-one really knows why. Even seniors tend to forget, that stopped VMs become IO problems in Proxmox. Upgrade storage or migrate away from prox, when the problem become too big or expensive.
  2. Detach
    Detach the disks on stopped VMs, since PBS do not backup detached disks. In a small setup, might be a good solution, as backup for a VM will be paused while disks are detached and automatically start again, when you attach them, which you have to do before starting the VM. But no-one wants to do this, as a detached/unused disk is expected not to be important, and might get removed. And not something you should teach a junior to do. And you might not attach all correctly before starting VM, and if you have other detached disks still detached from previous storage, you are now in a complete mess, without knowing which disks are the correct to attach.
  3. Exclude VMs
    Manually exclude stopped VMs in the backup job. No-go, because juniors WILL forget to enable them after starting a VM, and we cannot have running VMs without backups. This is a “disaster to come” method, but actually the one that is naturally chosen, until a more experienced senior/consultant change it.
  4. Archive node
    Rename a node to “proxX-nobackup” and implement a company rule, that all VMs that are stopped longer than today, must be moved there. Name is important to be sure juniors are completely aware that there are no backups on this node. Don’t run backups on this node. This rather annoying rule almost fixes the problem in practical use cases. When anyone discovers a stopped VM left somewhere, they just move it and IO rarely gets to become a problem, because the “Stopped VMs” are moved away from backups, before there are a lot of them.
  5. Pool based exclusion
    Similar to archive node, you can have the same solution much easier, by changing stopped VMs to be in a separate “No-backup” pool on the same node, and in your backup jobs don’t backup this pool. Easy to implement, but it is not very visible that a VM is in this specific No-backup pool, so you will have situations where VM gets started again, and is left in the “No-backup” pool without backups after start. A bit better than the “Exclude VMs” method, as someone might at some point, discover in the “No-backup” pool that a started VM is left there, and move it to the correct pool. Very easy for a senior, to quickly check the No-backup pool. Another downside, is that you have to set the specific pool to backup (ex. “Default”), you cannot exclude a pool. So if a 3rd pool is created, this will not be backed up, unless you create another backup job for that pool. And the even bigger problem, by default when creating a new VM it will not get added to a pool, unless you select one. There is no option to backup all VMs that are not in a pool, so if the pool is forgotten to be set on VM creation, then there are no backups of this VM.
  6. Patch to skip stopped VMs
    Claudio Luck’s patch or similar are perfect in 99% of the cases i see. We want one backup since VM was stopped and then we don’t want to read or touch the VM or it’s backups until the VM gets started again. But it is a code patch, not something anyone want in their backup procedures.

Monitor your backups

If you also in your monitor software (Zabbix or similar), implement the triggers below, you have a proper setup:

  • Xxx is not in a pool > 1 hour (if you use pool based backups, all VMs must be in a pool)
  • Xxx is stopped in Default pool > 12 hours (will create excessive read IO)
  • Xxx is running in No-backup pool > 12 hours (if it is running there must be backups, and they will be fast)
  • Xxx is running, is not in an “No-backup” pool and latest backup is more than 3 days ago (this situation should not happen, but maybe it was left outside of a pool or there is an error in the pool setup)

Links

  • Persistent dirty bitmaps, will probably at some point get added to Proxmox and PBS. Check Proxmox bug 3233.