Duplicity replacement --> rsync Time machine-like backups

pagaille · January 13, 2018, 11:53am

Right, but there will be a new full backup done each week.

From doc : Each backup is on its own folder named after the current timestamp. Files that haven’t changed from one backup to the next are hard-linked to the previous backup so take very little extra space.

Therefore, changed files are simply copied in the current backup folder.

NB : Older versions are kept (versioning) ! The script automatically deletes old backups using the following logic:

Within the last 24 hours, all the backups are kept.
Within the last 31 days, the most recent backup of each day is kept.
After 31 days, only the most recent backup of each month is kept.
Additionally, if the backup destination directory is full, the oldest backups are deleted until enough space is available.

Is it clearer ?

pagaille · January 13, 2018, 11:56am

I’m sure you’re right, but I know ZFS requires a lot of RAM… That’s not an all-round solution. My old 4Go RAM server would not survive it

danb35 · January 13, 2018, 12:26pm

The RAM isn’t as big a deal as the hacking that would be needed to get Neth installed on ZFS in a way that’s stable, repeatable, and would survive upgrades without any extra work. Other Linux flavors make this a bit easier, but not CentOS. I do like the idea (and, again, this use case would be perfect for snapshots and replication, though it would require another ZFS box to send them to), but it would take a lot more work before I think it’d be safe to use for production.

filippo_carletti · January 17, 2018, 3:13pm

@pagaille I opened you a pull request to create the marker automatically. I also cleaned up some unused variables.

While I really appreciate your efforts, I don’t think that this is ready for inclusion in the core backup-data of NethServer.
Don’t get me wrong, I quickly reviewed the script and I think that we should go ahead, but I need to involve more people (@giacomo, @Stll0).

The first thing we must check is the inclusion/exclusion syntax of duplicity vs rsync.
I fear that we may leave out some files in the backup and include too much when there are exclusions.
We may adopt a syntax for the include/exclude files in /etc/backup-data.d/ and convert them appropriately to the format needed by the tool we use.

Then we must work on the restore interface.
We may need a function (a button on the interface) to format the usb local disk as needed.
I see that rsync-time-backup have some checks for the destination, but I don’t know if those are enough.
I didn’t test backup over ssh, I think it would be a good option.

@pagaille would you like to coordinate our efforts?

pagaille · January 19, 2018, 8:49am

Thanks so much for your interest @filippo_carletti ! My first impression was that you wasn’t convinced of the advantages of this script

Sure. As stated above, this is a work in progress that needs an UI and lot of testing.

It runs for some weeks here however, until now without any problem.

I already took care of this. I adapted the include and exclude files handling in the script so that rsync can use them. As far as I can tell, it works. The file format doesn’t have to be modified.

Yep. I’ve got an idea, I’ll try to write a draft.

Yes. It is missing currently.

Since nethserver-backup takes care of the mounting I see no reason why something could go wrong.

Me too ! It is a nice and easy alternative to webdav !

Sure ! How should I begin ?

giacomo · January 23, 2018, 10:25am

We need a little time to test it a couple of server, then we can continue with the work!

alefattorini · January 30, 2018, 4:43pm

You may go deeper into the matter at Fosdem! What do you think? Looks a great topic to dive into

pagaille · January 30, 2018, 7:45pm

In public ???! Naaah, I’m the man in the shadows

filippo_carletti · January 31, 2018, 11:00am

LOL.
We can meet in a reserved room at fosdem.
I put your rsync backup in production, I’m keeping an eye on it.
I would like to work on the restore.
Open issues:

one-filesystem option
compress
rsync over ssh (or sshfs)

pagaille · January 31, 2018, 12:03pm

In that case : of course.

It’s been running for a month here. It started expiring backups :

rsync_tmbackup: Previous backup found - doing incremental backup from /mnt/backup/mattlabs/2018-01-21-230030
rsync_tmbackup: Creating destination /mnt/backup/mattlabs/2018-01-22-230040
rsync_tmbackup: Expiring /mnt/backup/mattlabs/2018-01-21-125033
rsync_tmbackup: Expiring /mnt/backup/mattlabs/2018-01-21-122849
rsync_tmbackup: Expiring /mnt/backup/mattlabs/2018-01-21-121344
rsync_tmbackup: Starting backup...

I like the beauty and simplicity of that script

Stefano_Zamboni · January 31, 2018, 1:03pm

just a suggestion: expiration should be done after current backup job has finished
otherwise, it your retention policy is very “short”, you’d find yourself with no good backup

my 2c

pagaille · January 31, 2018, 1:59pm

I feel you

The script proceeds as the following :

The script automatically deletes old backups using the following logic:

Within the last 24 hours, all the backups are kept.

Within the last 31 days, the most recent backup of each day is kept.

After 31 days, only the most recent backup of each month is kept.

Additionally, if (and only if) the backup destination directory is full, the oldest backups are deleted until enough space is available.

Therefore I believe that we are on the safe side.

filippo_carletti · January 31, 2018, 2:41pm

I took note that the Retention policy setting is ignored.

pagaille · January 31, 2018, 3:32pm

Absolutely. I love the dumb approach : HDD not full ? --> fill it. HDD full ? --> delete oldest backups until there is enough space. The user doesn’t have to worry and enjoys as much backups as the device can handle.

Have a look at the Apple doc for Time machine, I think we should mimic that https://support.apple.com/en-us/HT201250

filippo_carletti · January 31, 2018, 4:52pm

Fully agree.

Do you know how time machine performs a restore? I’ve read some user documentation, but I can’t figure out quickly how it works behind the scenes (I don’t know macOS).

We could scan the first level of the backup disk to quickly know the dates of all backups (find /mnt/backup/servername/ -maxdepth 1 -type d), but traversing the whole disk to find all files will be slow.
Does time machine ask the user to wait? Or does it keep a cache of the backups?
NethServer keeps a cache, using duc (the same tool to measure disk usage).

pagaille · January 31, 2018, 7:22pm

Yes, I saw that, I disabled it took twice the time needed for the backup itself.

I would simply read and display the folder on the disk. The only real issue doing so is searching for every copy (backup) of a given file. But in my experience, I used that function maybe 10 times in 10 years.

Os/x restore files or a full installation by simply copying the latest or other backup folder, then probably restores databases and others tricks like we do.

The ACL and extended attributes are kept by making the use of an Apple file system mandatory, OR by creating a “sparsebundle” monolithic file containing an HFS (Apple) filesystem on a non-Apple file system. That’s a clever and controlled solution.

The indexing is done (for searching) by the general file indexer (which indexes also the contents, on the fly), Spotlight.

Maybe we could use some Lucene indexer for such use but I believe that it is really not a priority

Ctek · February 1, 2018, 1:43pm

This is a nice idea but will it put additional stress on the backup process? Also you need to add the class to get the info and also another one to read the info.

As resource wise, will this prove its utility if it is integrated into the backup?
Don’t get me wrong, i just ask as a way to see the best approach.

pagaille · February 1, 2018, 3:50pm

I agree.

My general feeling is that we don’t really need to create any index of any sort regarding backups. Just browse the files & folders tree on the backup disk, done.

I believe the way Duplicity stored the files on the disk made the indexing step mandatory. But it is not needed here.

Ctek · February 1, 2018, 7:33pm

on the other hand I guess that we can use the log file (if you want to have the files parsed and indexed)
Rsync can output the activity or even generate the listing in a file in a specified format, if I’m not mistaking.

This way you have the index files generated from the log. Easy to parse and use.
Even tar (or most archivers) can generate a listing of the files (including the paths) if needed.

pagaille · February 1, 2018, 8:58pm

Brilliant !!