ffmpeg part two - Electric Boogaloo

I just attended the Watkins Glen opening day for the second year. It was, again, a blast.

I made some slight adjustments to my ffmpeg assembly procedure from last year.

Dashcam saves video in 5-minute chunks

Instead of creating .list files, I simply used a pipe as input:

for fo in AMBA091*; do echo file "$fo"; done \
    | ffmpeg -f concat -i - -c copy Front-Track1.mov

Front and Rear videos need to be combined

Much like last year, I made short samples to confirm if any offsets needed to be done. However, I decided to move the video to the bottom-right corner to cover the timestamps, since they were incorrect on some videos (well, correct. Just not for this time zone)

Math is basically the same as before for scaling, but instead of a left offset of 70, we want a right offset of 70, but using left-hand coords. Which works out to:

1920 - 480 - 70 = 1370

After the usual synchronization samples were made, it was time to perform the final assembly.

I used a slightly different file layout this time, keeping the front and rear videos separated. I used a loop to assemble them into a combined video:

$ time for FILE in Track{1,2,3}; do ffmpeg -i Front/DCIM/*${FILE}*mov -vf "movie=Rear/DCIM/Rear-${FILE}-cat.mov, scale=480:-1 [inner]; [in][inner] overlay=1370:740 [out]" -strict -2 ${FILE}.mov; done

You're going to want a good CPU. This is the execution time for just under 48 minutes of video on an Intel i5-2520M:

real    172m2.494s
user    619m51.494s
sys     1m37.383s

Final result

You can see the resulting videos on youtube: Part 1, Part 2, and Part 3. Part 3 has some bad sound. I'm not sure why.

Intel GPU Scaling mode

I was attempting to run my laptop at a lower resolution than the laptop panel. However, by default the video is scaled to fill the panel. This causes the image to be distorted (fonts look bad, etc).

On Linux (with Xorg, anyway), this behaviour can be tweaked with xrandr:

$ xrandr --output LVDS1 --set "scaling mode" "Center"

This is not a persistent setting, which is fine for my purposes.

Thanks to the Arch Linux Wiki article on Intel Graphics for documenting this.

My failed experiment with CalDAV/CardDAV

In an ongoing quest to attempt to lessen my Google dependancy, I decided to self-host my Calendar and Contacts using Baïkal.

Installing and configuring Baïkal is sufficiently documented elsewhere. This post is a 9somewhat short) account of why I'm giving up on self-hosted contacts and calendars.

Google

The problems can be summed up into these bullet points:

  • It is assumed (and practically required) to use Google Play Store
  • Google Play Store requires a Google Account
  • Google Account means you have Mail, Calendar, and Contacts

Simply adding your google account into your phone causes Mail, Calendar and Contacts to sync. Mail you can disable, and use an alternate client, as that data is housed internally to the gmail app, and not exposed system wide for other apps to use.

Calendars and Contacts, even if you disable sync, are still "there". Some apps might add events to the "first" calendar, without asking (which may or may not be synced to Google, and not your self-hosted calendar). Updating contacts sometimes adds those updates to the Google Contacts list. There is no apparent way to move items from one contacts/calendar account to another.

Summary

As it stands now, you can either have:

  • self-hosted contacts and calendars with probably most of your data, and understand that you will miss events, and people.

  • Google Contacts and Calendar with all of your data.

As much as I preferred self-hosting, it simply isn't practically possible until you can completely remove Google Contacts/Calendars from your device, and the management apps provide the ability to move events.

AWStats from multiple hosts

I decided I wanted some stats. There are a few options: Use a service (Google Analytics, etc) or parse your logs. Both have pros and cons. This article isn't supposed to help you decide.

I just wanted simple stats based on logs: It's non-intrusive to visitors, doesn't send their browsing habits to third parties (other than what they send themselves), and uses the apache log data I've already got for the entire year.

I'm mainly interested in seeing how many people actually read these articles, as well as what search terms referred them here.

Fix your logs

I've got seven virtualhosts spread across four virtual machines. My first problem, is all were using /var/log/httpd/access_log for logging. A lot of grep work, and I managed to split those out to individual access logs: /var/log/httpd/access_log.chrisirwin.ca, for example.

My biggest problem is a lot of log enteries didn't actually indicate which virtualhost they were from. I ended up spending a few hours coming up with a bunch of rules to identify all queries for my non-main virtualhosts (yay static files). Then I dumped anything that didnt' match those rules into my main virtualhost's log (including all the generic GET / entries.

All my logs are sorted into per-virtualhost logs, and all lines from the original are accounted for.

I renamed access_log to access_log.old, just so I don't mistakenly review it's data again.

Fix your logging

Now that we've got separate access logs, we need to tell our virtualhosts to use them. In each virtualhost I added new CustomLog and ErrorLog definitions, using the domain name of the virtualhost.

CustomLog       "logs/access_log.chrisirwin.ca" combined
ErrorLog        "logs/error_log.chrisirwin.ca"

Then restart httpd

$ sudo systemctl restart httpd

I also disabled logrotate, and un-rotated my logs with zcat. I'll probably need to revisit this in the future, but 1 year worth of logs is only 55MB.

Fetch logs

It goes without saying that awstats needs to be local to the logs. I have four virtual machines. Do I want to manage awstats on all of them? No.

So I wrote a bash script to pull in my logs to a local directory:

$ cat /opt/logs/update-logs 
#!/bin/bash

cd $(dirname $(readlink -f $0))

# Standard apache/httpd hosts
for host in chrisirwin.ca web.chrisirwin.ca; do
    mkdir -p $host
    rsync -avz $host:/var/log/httpd/*log* $host/
done

# Gitlab omnibus package is weird
host=gitlab.chrisirwin.ca
mkdir -p $host
rsync -avz $host:/var/log/gitlab/nginx/*log* $host/

Now I have a log store with a directory per server, and logs per virtualhost within them.

Configure cron + ssh-keys to acquire that data, or run it manually whenever.

Install awstats

Then I picked my internal web host, and installed awstats. This is in Fedora 22, but requires you to enable epel for CentOS/RHEL.

$ sudo dnf install awstats

And, uh, restart apache again

$ sudo systemctl restart httpd

Configure awstats

Now go to /etc/awstats, and make a copy of the config for each domain:

$ sudo cp awstats.model.conf awstats.chrisirwin.ca.conf

You'll probably want to read through all the options, but here's all the values I modified:

LogFile="/opt/logs/chrisirwin.ca/access_log.chrisirwin.ca"
SiteDomain="chrisirwin.ca"
HostAliases="REGEX[^.*chrisirwin\.ca$]"
# DNSLookups is going to make log parsing take a *very* long time.
DNSLookup=1
# My site is entirely https, so tell awstats that
UseHTTPSLinkForUrl="/"

Run the load script

Let's just piggy-back on provided functionality:

$ time sudo /etc/cron.hourly/awstats

Mine took >15 minutes. I think it was primarily DNS related.

Review your logs

By default, awstats figures out what config to use based on the domain name in the URL. However, I've aggregated my logs to a single location. Luckily, awstats developers though of this, and you can pass along a an alternate config in the url:

https://internal.chrisirwin.ca/awstats/awstats.pl?config=chrisirwin.ca

Tweaks to awconfig

Unless you're running awstats on your localhost, you'll be denied access. You'll likely have to edit /etc/httpd/conf.d/awstats.conf and add Require ip 10.10.10.10/16, or whatever your local ip range is. Note that while you can add hostnames instead of IPs, reverse DNS needs to be configured.

While there, you could also add DirectoryIndex awstats.pl.

Discard (TRIM) with KVM Virtual Machines

I've got a bunch of KVM virtual machines running at home. They all use sparse qcow2 files as storage, which is nice and space efficient -- at least at the beginning.

Over time, as updates are installed, temp files are written and deleted, and data moves around, the qcow2 files slowly expand. We're not talking about a massive amount of storage, but it would be nice to re-sparsify those images.

In the past, I've made a big empty file with dd and /dev/zero, delete it, then fallocate on the host to punch the detected holes. However, this is cumbersome.

As it turns out, there is a better way: discard. Discard support was initially added to tell SSDs what data can be cleaned and re-used (SSDs call it 'TRIM'), to preserve performance and extend drive lifetime (allowing better wear levelling). The same methods can also be used to allow a VM to tell it's host what part of it's storage is no longer required. This allows the host to actually regain free space when guest machines do.

I used the following two pages as references. The first is more generically useful for machines with actual SSDs, as well as checking trim works through multiple storage layers (dm, lvm, etc).

Fix fstab

Ensure you're not using any device paths in fstab, like /dev/sda1 or /dev/vda1. These steps may renumber or rename your hard disks, and you don't want to troubleshoot boot problems later on. Switch them to LABEL or UUID entries, depending on your preference/use-case.

This also means fixing your initrd and grub, if necessary. Most installs shouldn't require that, though. Typically, it's just lazy manually-added filesystems :)

It goes without saying, but reboot your VMs now to ensure they boot after your changes. That will make troubleshooting easier later.

Shutdown VMs and libvirtd

Since I'm doing some manual munging of the VM definition files, first step is to shut down all VMs and stop libvirtd.

Update machine type

Some of my VMs were quite old, and were using old machine versions, as evidenced by one of the .xml files in /etc/libvirt/qemu:

<type arch='x86_64' machine='pc-i440fx-1.6'>hvm</type>

From what I understand, machine types later than 2.1 include discard support. I wanted to update everything to the current 2.3 machine type:

sed -e "s/pc-i440fx-.../pc-i440fx-2.3/" -i *.xml

Add discard support to hard disks

Your sed line will vary here. I've manually specified writeback caching, so my hard drive driver line looks like the following:

<driver name='qemu' type='qcow2' cache='writeback'/>

It was fairly simple to add discard:

sed -e "s#writeback'/>#writeback' discard='unmap'/>#" -i *.xml

It should now look like this:

<driver name='qemu' type='qcow2' cache='writeback' discard='unmap'/>

You could probably key off the qcow2 bit instead of the writeback bit. The order doesn't matter.

Change each hard drive from virtio to scsi bus

All of my VMs were using virtio disks. However, they don't pass discard through. However, the virtio-scsi controller does.

There is probably a pretty easy way to do this with virsh, but I opted to just use virt-manager, since I have a finite number of VMs (and reading the man page for virsh would take longer than just doing it with virt-manager).

Change Disk Bus to SCSI:

Disk Bus assignment

Change SCSI Controller to VirtIO SCSI:

SCSI Controller Model

The latter step might not be required. The only other option is "hypervisor default", so it might just use virtio-scsi by default. Better safe than sorry.

Boot and check your VMs

After starting your VMs, you should be able to confirm that discard support is enabled:

sudo lsblk -o MOUNTPOINT,DISC-MAX,FSTYPE

If you see 0B under DISC-MAX, then something didn't work:

MOUNTPOINT               DISC-MAX FSTYPE
/                              0B ext4

However, if you see an actual size, then congrats. You support discard:

MOUNTPOINT               DISC-MAX FSTYPE
/                              1G ext4

Configure your VMs themselves to discard unused data

Manually run an fstrim to discard all the currently unused crufty storage you've collected on all applicable filesystems:

sudo fstrim -a

Going forward, you can either add 'discard' to the mount options in fstab, or use fstrim periodically. I opted for fstrim, as it has a systemd timer unit that can be scheduled:

sudo systemctl enable fstrim.timer
sudo systemctl start fstrim.timer

Done! Or am I...

Now, there are additional considerations to be made during backup.

For example, if you use rsync, you'll probably want to add --sparse as an option, so it doesn't inflate your backup copy to full size. However, that won't actually punch holes that have been discarded since the last backup. So you still need to use fallocate on your backup copies to actually reclaim discarded space.

Another pain is I back up to a btrfs filesystem, which uses snapper to preserve previous revisions. This should be a great solution, however, there are other considerations:

  • rsync's default behaviour is to do all work in a copy, then replace the original. As far as btrfs is concerned, this is an entirely new data, and doesn't share anything with existing snapshots. That means btrfs snapshots are quite bloated.
  • need to use --inplace to avoid above snapshot bloat.
  • --inplace and --sparse are mutually-exclusive. Well shit.

My current solution is to use --inplace for backups, then fallocate all files. I try to manually rsync --sparse new VMs ahead of their initial backup to avoid the temporary inflation that --inplace would cause.

Multiple Instances of Gnome Terminal

Gnome 3 introduced a very handy feature, grouping multiple application windows (whether they be separate instances or not) into a single desktop icon. This means when <alt+Tab>ing through your windows, you can skip over the dozen firefox windows, then dive into just your terminal windows. Generally, this works great, and I think most users don't have any issues.

However, some people (myself included) use a lot of terminals. Some are temporary short-lived generic terminals. Others are long-lived running mail (mutt), or a main development session. Unfortunately, trying to switch to my email terminal can be cumbersome as I squint at thumbnails of 10+ other terminals.

Luckily, the mechanisms to control this are somewhat accessible. Gnome matches a window to .desktop file based on some Window properties. All like matches are grouped together. Luckily, it seems GTK+ exposes functionality to modify these values at the command line. I've latched on to window class (WM_CLASS), but there are likely other rules that guide a match.

WM_CLASS seems to be set by some combination of both the --class and --name GTK+ flags. The first step is to see if I can make an appropriate change. In gnome-terminal, I ran the following:

yelp --class=foo --name=foo

Bingo! I got a standard yelp window, but it identified itself to gnome as foo in the title bar, <alt+tab>, and dock! It didn't get grouped with the yelp shortcut!

Lets do the same with gnome-terminal:

gnome-terminal --class=foo --name=foo

No luck. I got a new window, but it was grouped with my existing one. The above works if this is the first instance, however, so gnome-terminal is certainly capable of changing it's window properties.

Turns out, gnome-terminal is being clever. It's actually running a background process, gnome-terminal-server, that owns all windows. Running an additional gnome-terminal simply pokes the existing process to create a new child window.

In gnome 3.6 and earlier, this was easy enough to solve. Adding --disable-factory would disable this functionality. However, since gnome 3.8, it's now using some sort of dbus activation. I had given up on gnome-terminal.

I tried using alternative terminals. Ultimately, I was frustrated that they didn't pick up my theme correctly (GTK+-2.0 based), or were weird (terminator), or that I the fonts just looked better in gnome-terminal (xterm, etc), or needed a universe of dependancies (konsole)

Finally, I came across a lead on Stack Exchange: Run true multiple process instances of gnome-terminal. This lead me to a page on Gnome's Wiki on Running a separate instance of gnome-terminal.

This workaround involves manually starting a new terminal server.

/usr/libexec/gnome-terminal-server --app-id com.example.terminal --name=foo --class=foo &

gnome-terminal --app-id com.example.terminal

Aha, so it is possible to get my desired functionality. However, while this works, it isn't ideal. This requires running two commands, rather than just one. Additionally, the server dies after 10 seconds if no clients connect, preventing me from spawning it at login.

But this manual server initialization isn't required with the standard backend. So how does that work?

Turns out, it's a dbus service definition. You can review the current one /usr/share/dbus-1/services/org.gnome.Terminal.service, then make our own.

I've decided to call my session 'PIM', as I'm using it for my mail/calendar terminals

cat /usr/share/dbus-1/services/org.gnome.Terminal-PIM.service
[D-BUS Service]
Name=org.gnome.Terminal-PIM
Exec=/usr/libexec/gnome-terminal-server --class=org.gnome.Terminal-PIM --app-id org.gnome.Terminal-PIM

Now, I've also created the associated (and like-named) .desktop file (used the /usr/share/applications/org.gnome.Terminal.desktop as a template)

$ cat ${HOME}/.local/share/applications/org.gnome.Terminal-PIM.desktop 
[Desktop Entry]
Name=Mail & PIM
Comment=Mutt and Calendar
Keywords=mail;mutt;calendar
Exec=gnome-terminal --app-id org.gnome.Terminal-PIM -e "screen -DR PIM -c .screenrc-ejpim"
Icon=mail_send
Type=Application
StartupNotify=true
X-GNOME-SingleWindow=false

Note: I'm actually launching it straight into a pre-defined screen session, but you could change the -e parameter to whatever you wish.

Both dbus and gnome-shell need to be restarted to pick up their changes. You can tell gnome-shell to restart itself, but the surest method I'm aware of for dbus is to log out and in again.

Now I can run the "Mail & PIM" shortcut, and get a gnome-terminal window that is grouped separately (with a mail icon!).

At this point, it would be work investigating gnome-terminal profiles, if desired (different colour schemes, etc).

Unfortunately, while I put my .desktop file in ~/.local/share/applications, I couldn't find a user-specific dbus servicedir. I had to install that in the system itself with sudo. I'd prefer to have it local to my user, so I can move it easily to other systems with my existing vcsh configuration.

Posted
Video assembly with ffmpeg

I recently took my car to a racetrack, covered with cameras. I wanted to post these on youtube, but encountered a few issues:

  1. Dashcam saves video in 5-minute chunks

  2. Front and Rear videos need to be combined

  3. I don't know anything about video editing

  4. I didn't have a working video editor

  5. Fedora doesn't seem to ship ffmpeg, and rpmfusion doesn't support Fedora 22 yet

The last point was somewhat resolved by a binary build of ffmpeg.

Dashcam saves video in 5-minute chunks

There is no gap or overlap with the mini0805, which makes clips easy to combine with ffmpeg's concat filter.

I used an input file to list all the components:

$ cat Front-Run2.list

file 'Front/DCIM/100MEDIA/AMBA1299.MOV'
file 'Front/DCIM/100MEDIA/AMBA1300.MOV'
file 'Front/DCIM/100MEDIA/AMBA1301.MOV'
file 'Front/DCIM/100MEDIA/AMBA1302.MOV'
file 'Front/DCIM/100MEDIA/AMBA1303.MOV' 

Then provide that to ffmpeg:

$ ffmpeg -f concat -i Front-Run2.list -c copy Front-Run2.mov

Using the 'copy' codec avoids re-encoding the video, which makes this a quick operation.

This is likely a good time to trim the start and end of the video. I did this on the compiled version:

$ ffmpeg -i Front-Run2.mov -ss 0:45 -t 12:30 -c copy Front-Run2.trim.mov

After reviewing that output, I replace the previous version:

$ mv Front-Run2.trim.mov Front-Run2.mov

Again, the copy filter makes this a quick operation. -ss is the start offset (so the new video will start 0:45 seconds in). -t however is the duration, so it will end at the 13:15 mark of the original video. There is also a to option to set a stop time, but I missed that in the man page until writing this article.

This step was needed for all the videos I wanted to use.

Front and Rear videos need to be combined

This step unfortunately involved a small amount of trial-and-error.

First, I wanted to overlay the rear camera footage in the bottom corner of the front camera. This required some additional information:

  • I'm combining Front-Run2.mov and Rear-Run2.mov
  • Videos are 1920x1080. I want the rear video at 25%. That's 480x270
  • I want the overlay to be in from the edge. Lets say by 70 pixels.
  • There may be a better overlay method, but this works.

Now, in a perfect world, you can just compile the "finished" version. However, both of my dash cams start recording with a variance of 1-3 seconds, so the videos don't line up exactly (even though I used the same trimming). My video has my car start moving at 1:25, so I made a 30 second video starting at 1:15.

Note that we can't use the 'copy' codec, as we're actually modifying the video at this point. This makes the 30 second clips a massive time saver.

$ ffmpeg -i Front-Run2.mov -vf "movie=Rear-Run2.mov, scale=480:-1 [inner]; [in][inner] overlay=70:740 [out]" -ss 1:15 -t 30 Sample-Run2.mov

I then watched the video to time the difference between when each car started moving. I found the rear video was about 1 second behind. I trimmed it using -ss 0:01, then recreated my 30 second sample. Once I was satisfied, I generated the entire video.

$ ffmpeg -i Front-Run2.mov -vf "movie=Rear-Run2.mov, scale=480:-1 [inner]; [in][inner] overlay=70:740 [out]" Combined-Run2.mov

Slight sound censoring

I was discussing a few things I'd rather not have in the video. This means I needed to mute three sections of the audio.

I watched the video, taking note of the ranges I wanted to mute:

  • 7:14-7:19
  • 8:47-9:22
  • 15:02-15:34

Unfortunately, the volume filter I was using seemed to only take seconds... So:

  • 434-439
  • 526-562
  • 902-934

Now that I'll be building my "final" video, I converted it to a "faststart" mp4 based on youtube's video recommendations. Also, we're not modifying the video, so we can use -vcodec copy. The audio will be re-encoded, due to the filters.

$ ffmpeg -i Combined-Run2.mov -af "volume=enable='between(t,434,439)':volume=0, volume=enable='between(t,526,562)':volume=0, volume=enable='between(t,902,934)':volume=0" -vcodec copy -movflags faststart Final-Run2.mp4

Final result

You can see the resulting video on youtube.

Outstanding issues

So all videos were 1920x1080, but youtube only offers "720p". I don't know why.

Living with BTRFS

This is the contents of a presentation I gave to KWLug in April 2015, roughly converted to blog format. The slides are available. I'll update this post with a link to the kwlug.org podcast when it goes online.

I didn't get through all of the points, I barely touched on snapshots (and didn't cover any utilities). I'll post a follow-up with my filesystem corruption demonstration instructions.

Note that I am not a filesystem developer, just a user who isn't afraid to experiment and share the lessons learned. If there are any comments/corrections, email me at chris -at- chrisirwin.ca

With what?

  • "Butter F S"
  • "Better F S"
  • "B-tree F S"
  • "Bee Tee Arr Eff Ess"

Why should I consider a new filesystem?

ext3/ext4/ntfs/etc work fine for me!

My trip to Ottawa!

Parliament

BTRFS Benefits

  • Data & metadata checksums
  • Subvolumes
  • Copy on Write
  • Snapshots
  • Defragmentation support
  • Deduplication support
  • Multi-device support ("RAID")
  • SSD optimizations
  • send/receive support

Data & metadata checksums

All data is checksummed at write, and verified at read.

The files you save are guaranteed* to be the files you get!

So you don't need ECC like ZFS?

Strictly speaking, you don't need ECC memory for either ZFS or BTRFS. ECC memory provides error detection and correction in-memory. It would be ideal to have and use, but not all machines support it.

Not having ECC doesn't mean you shouldn't use a filesystem that provides checksums. It will still protect you from disk errors.

Car analogy: If you don't have airbags, you still wear your seat belt.

Subvolumes

One btrfs filesystem (disk/partition/etc) can contain multiple subvolumes (root, home, etc).

  • Logical separation for data (like partitions or logical volumes)
  • Can be mounted directly (like partitions or logical volumes)
  • No division of free space (unlike partitions or logical volumes)
  • Can be mounted as one (unlike partitions or logical volumes)

While they have special abilities, subvolumes act like directories.

Copy on Write

Changes don't physically overwrite what they're logically overwriting.

AAAA -> ABBA

AAAA    AAAABB
[1-4]   [1,5-6,4]

Alternative copy on write method

ex: LVM

AAAA -> ABBA

AAAA    AAAAAA   ABBAAA
[1-4]            [1-4]

Due to having to sync the disk to save AA before writing BB, this affects write performance.

Note: BTRFS doesn't do this, but other things do (LVM…)

Copy on Write tricks

File copies can use no additional storage*

$ cp --reflink=always original.jpg copy.jpg

Unlike a hardlink, modifications to copy.jpg don't touch original.jpg

Snapshots

Snapshots can be read-only or read-write.

  • Read-only are ideal for facilitating backups

  • Read-write can be used for testing, etc.

Snapshots don't use any additional disk space, and don't snapshot free space, unlike LVM.

$ btrfs subvolume snapshot [-r] ./home ./home-20150413

What about LVM?

LVM does provide snapshots:

  • Negatively affect performance and throughput -- one write becomes two writes and at least one sync
  • Require a pre-allocated amount of space
  • become corrupt when that space is used
  • even if that space was "free" on the original volume

Fragmentation!

Yes, BTRFS does cause fragmentation. For this reason, BTRFS developers suggest not using databses (mysql), or virtual machines on BTRFS filesystems because they will become fragmented over time.

The bright side is with SSDs, fragmentation is less of an issue. SSDs themselves are already fragmented by design, so file contiguity is not important. However, a very heavily fragmented file can cause some CPU spikes when reading on an SSD.

For files that are not modified (images, music…), or overwritten completely (os updates, atomic file switches), there is no impact.

Defragment support

There is on-line defragmentation support, which can defragment at the filesystem, subvolume, directory, or file level.

$ btrfs filesystem defragment [-r] [-t 5M] /home

Maybe don't defragment a SSD?. It could cause unecessary erase cycles on your SSD, but it doesn't relocate files unneccessarily. It probably is fine with current SSDs.

Deduplication support

BTRFS supports on-line, out-of-band deduplication. This is accessed using third-party tools that look for duplicate blocks, and then tell btrfs to merge them.

I have not used these tools.

Live deduplication support is being worked on, but will require large amounts of RAM. Large amounts of storage is typically cheaper than large amounts of RAM, but your needs may vary.

I will not use that, either.

Defragment and deduplication

Dedup two files, then defragment them. What happens?

Disable COW on a file

You can disable COW on a directory or file. This will avoid fragmentation, but at a cost:

  • Data checksumming is disabled.

  • Snapshots won't work (because they rely on COW)

You very probably do not want to do this.

$ chattr +C /dir/file

Multi-device support ("RAID")

BTRFS implements it's own multi-device support.

"RAID1" just means to make sure at least two copies exist for all data. It can work with mismatched drives (1TB, 500GB, 500GB)

"Single" means to keep only one copy of data, ensuring you can use all available space (but with no redundancy/recovery)

Note: RAID5+6 are still in heavy development. Until recently, there was no drive replace support. I've not used, experimented, or investigated these modes.

Versus traditional RAID

From Data Corruption on Wikipedia:

As an example, ZFS creator Jeff Bonwick stated that the fast database at Greenplum – a database software company specializing in large-scale data warehousing and analytics – faces silent corruption every 15 minutes.[9] As another example, a real-life study performed by NetApp on more than 1.5 million HDDs over 41 months found more than 400,000 silent data corruptions, out of which more than 30,000 were not detected by the hardware RAID controller. Another study, performed by CERN over six months and involving about 97 petabytes of data, found that about 128 megabytes of data became permanently corrupted.

BTRFS will detect a bad read and recover from a second copy, or cause the read to fail, hopefully making you restore your backups (rather than silently accepting data corruption)

Simulating Data Corruption

From raid.wiki.kernel.org

RAID (be it hardware or software), assumes that if a write to a disk doesn't return an error, then the write was successful. Therefore, if your disk corrupts data without returning an error, your data will become corrupted. This is of course very unlikely to happen, but it is possible, and it would result in a corrupt filesystem.

Simulating Data Corruption

RAID cannot, and is not supposed to, guard against data corruption on the media. Therefore, it doesn't make any sense either, to purposely corrupt data (using dd for example) on a disk to see how the RAID system will handle that. It is most likely (unless you corrupt the RAID superblock) that the RAID layer will never find out about the corruption, but your filesystem on the RAID device will be corrupted.

This is the way things are supposed to work. RAID is not a guarantee for data integrity, it just allows you to keep your data if a disk dies (that is, with RAID levels above or equal one, of course).

BTRFS Integrity

BTRFS supports "scrubbing", checking the integrity of all data. This can be done on a path, or a specific device in a multi-device BTRFS.

$ btrfs start [-B] [-c #] [-n #] <path>

It can report any errors found. If another copy is available ("RAID1"), it will automatically fix the bad copy.

DEMO!

Demonstration of corruption on plain ext4, ext4 on mdadm raid 1, ext4 on mdadm raid 5, btrfs (single device), and btrfs "raid1"

SSD optimizations

BTRFS has some useful mount options

autodefrag : Detect random writes and defragments the affected files

degraded : Allow you to mount a "RAID" filesystem missing a device.

ssd : "Avoiding unnecessary optimizations. This results in larger write operations and faster write throughput.

discard : Enables TRIM support. This is disabled by default as it affects generational rollbacks (not covered in this talk), and many drives reserve space for garbage collection on their own. There is also some concern about "queued TRIM support", which didn't exist before SATA 3.1. (Also, queued TRIM support seems buggy, since Windows doesn't use it)

Send/Receive support

BTRFS can perform it's own backups. Much like the dump utility for ext2/3/4, this has the advantage of working at the filesystem level. It will preserve all data btrfs knows, including the checksum data, and shared blocks.

$ btrfs send ./home-20150413 \
  | btrfs receive /mnt/backups/

My BTRFS-formatted USB backup disk now contains an exact copy of my snapshot -- right down to which blocks are shared between files, and the data checksums of those blocks.

Send incremental

Send/receives can also share space (and save time) as well by providing a 'parent' reference that exists on both drives:

$ btrfs send -p ./home-20150412 ./home-20150413 \
  | btrfs receive /mnt/backups/

Because all data is checksummed, you can be guaranteed to have a valid "full" home-20150413 on your backup drive.

Send/Receive variations

You can send snapshots to another host:

$ btrfs send ./home-20150413 \
  | ssh my-server.com btrfs receive /mnt/backup/laptop/

Or if you're backing up to ext4, S3, or another non-btrfs volume:

$ btrfs send ./home-20150413 \
  | gzip /mnt/backup-ext4/home-20150413.gz

And restore at a later date:

$ zcat /mnt/backup-ext4/home-20150413.gz \
  | btrfs receive /path/to/restore

Snapshot utilities

snapper

  • Lots of users. Weird (to me) snapshot layout

  • Uses some logic to keep 24 hourly snapshots, then expire those and keep daily snapshots, etc.

  • No support in itself for getting those snapshots to another disk/host

  • Some issues with scheduled snapper-clean + discard

BTRFS Wiki on Backups

There are other tools, or you could roll your own cron-job.

So what about ZFS?

I thought ZFS did all these things?

Yes, it does

Why not just use ZFS?

Go ahead

Why don't you use ZFS?

  • I don't want to use Solaris/FreeBSD
  • I don't want to use a third-party filesystem on Linux
  • I don't want to depend on COPRs/PPAs and dkms/akmods for filesystem drivers
  • BTRFS is built-in

Why doesn't everybody use BTRFS?

  • It is still a young filesystem. Some features are incomplete (deduplication), dangerously incomplete (RAID5/6), or missing (live deduplication).

  • It doesn't support all use-cases (swap files won't work at all, fragmentation with databases/virtual machines/torrents).

  • Some people have experienced problems that resulted in data loss (though not too recently). With extensive backups, this shouldn't be an issue. However, filesystems are expected to be fool-proof.

  • There are some gotchas and quirks that are confusing.

BTRFS "Gotchas"

From Gotchas on BTRFS Wiki

  • deadlock at mount time in 3.19.1 - 3.19.3

  • deadlock on heavy (rsync) workloads in 3.15 - 3.16.1

  • defragment/snapshot problems before 3.9 - 3.14

  • defragmenting with kernels up to 2.6.37 will unlink COW-ed copies of data, don’t use it if you use snapshots, have de-duplicated your data or made copies with cp --reflink.

BTRFS Quirks

df is correct/incorrect/misleading

Assuming a 1TB+500GB "RAID1", how much free space is there?

BTRFS Quriks

Assuming a 1TB+500GB "RAID1", how much free space is there?

  • df reports 1.5TB.

  • There is 1.5TB of free space

  • You can only write 500GB of data to this drive. You will then get an out of space error, while df reports 500GB free space. Not intuitive or obvious.

  • You also can't modify or overwrite files at that point, because it needs to write the new data before the old data is de-referenced and considered "free"

BTRFS Quirks

Why not have df report usable space instead of free space

That's too hard

BTRFS Quirks

What about 1x2TB + 2x500GB as "RAID1"?

  • You could have 1.0TB of usable space: each 500GB as the "copy" for 1TB, with 1TB "free" but unusable
  • You could have 500GB of usable space: one 500GB master, one 500GB copy, and 2TB "free" but unusable

It depends on how BTRFS decides to allocate it's data (prefer fewer drives vs spread out as many drives). It may depend on if the filesystem was grown (i.e. added 2TB drive to existing 2x500GB array). There may be other factors.

BTRFS Quirks

In theory, developers could figure out the above to list free space… But the BTRFS roadmap has "Object-level mirroring and striping", so free space could depend on where your file was written.

BTRFS Quirks

There is no good solution to this I can think of, other than KIDS™: Keep Your Disks Simple

Should I stay away then?

  • I haven't had a single BTRFS* issue that resulted in data loss*
  • Backups solve all
  • ext4-to-btrfs migrations are easy/hard

What am I using

  • lvm
  • btrfs for /, /home
  • ext4/xfs for /boot, games (steam), torrents, Virtual Machines, etc.

References

Adding a dependency into upstream-supplied systemd units

I recently restarted all my VMs at once and found that several managed to start services before they finished mounting their NFS shares. These shares back apache virtualhosts, mysql databases, and ultimately left several services in non-functional states. Systemd has the ability to depend on services (Requires=, and others) as well as filesystems (RequiresMountsFor=). However, I don't want to modify or replace the .service files installed from the package. I'd have to manually reconcile changes during updates, and it just generally isn't nice.

Turns out systemd took this into account. From man systemd.unit:

Along with a unit file foo.service, a directory foo.service.d/ may exist. All files with the suffix ".conf" from this directory will be parsed after the file itself is parsed. This is useful to alter or add configuration settings to a unit, without having to modify their unit files. Make sure that the file that is included has the appropriate section headers before any directive. Note that for instanced units this logic will first look for the instance ".d/" subdirectory and read its ".conf" files, followed by the template ".d/" subdirectory and reads its ".conf" files.

Great, so to add directives to the httpd.service, make a .d directory to contain your additions:

$ sudo mkdir -p /etc/systemd/system/httpd.service.d

Then populate it. Since I depend on filesystems, I added a RequiresMountsFor:

$ cat /etc/systemd/system/httpd.service.d/fix-nfs.conf
[Unit]
RequiresMountsFor=/var/www/sites /opt/ssl

You'll have to issue a daemon-reload to systemd for the additions to be detected.

You can test by umounting the directories you indicated, then systemctl start httpd.service. Systemd will note that it requires those directories, and ensure they're mounted before starting httpd. Nice.

Recovering a deleted file (on btrfs)

So you might find yourself saying this one day:

HELP! I just deleted a file on my btrfs! It wasn't old enough to be in a backup, snapshot, or git It was old enough that I don't want to retype it. How do I get it back? Also, my btrfs is on an SSD, and it might kick off a garbage collection routine at any time!

Well, as luck would have it, I just did that to a blog post I've spent all evening writing.

Don't panic, but don't wait. Since this is an SSD, the former location of the file was probably marked as garbage for the SSD to clean up. That said, the SSD doesn't immediately run it's garbage collection routine whenever you delete a random 8K file. That would be just as bad for wear as not having TRIM/discard at all.

Typical recovery steps

So, lets go through all the typical steps, and why I didn't do them.

Umount your drive immediately

Can't. It's my root and home are subvolumes on that drive.

Shut down the machine, boot from live media

If I was on a hard disk, I would. However, I don't know how the firmware on my SSD works. Will it decide a powercycle is a great time to do a garbage collection sweep?

Do some fancy btrfs generation rollback magic

I don't know how. I found Jörg Walter's btrfs-undelete. Unfortunately, it mentions that if the drive is mounted, it might not work.

Ask on IRC or mail list or reddit

Remember, this file was in my $HOME, is currently "unallocated" space that could be overwritten or garbage collected at any time. "Sooner" is really better than "later".

How I got my file back

So I remembered I had referenced the path "/var/www/ssl" in the file. This is important, because:

  • It's short, and I know it's in the file. Trying to remember a longer string potentially means not finding your file because your grammar or punctuation is incorrect.
  • This path does not exist on this computer or any other files on this computer, so I should only find copies of this file.

While I'm not umounting the device, it would be stupid to deliberately write data to it. I already had a second drive connected, so I used it for recovery:

$ cd /my/other/drive

Now, I'm going to grep the raw block device for my short, unique string. I'm going to tell grep to output 1000 characters of context (thats 1000 characters before and after a match), and write that to a file. This needs to be done as root, obviously.

$ grep --text -C 1000 "/var/www/ssl" /dev/mapper/vg_w520-btrfs_ssd > pleasework.txt

Once it finishes:

$ ls -lh pleasework.txt 
-rw-r--r--. 1 root root 92M 2015-01-16 03:02 pleasework.txt

Oh boy. 92MB of matches for a 8K file. Lots of matches, since I :w after every sentence. vim isn't going to like such a big file...

That said, I couldn't think of a tool I'd rather use than vim for searching through the file. When all you have is a hammer, every problem looks like a nail, after all. After opening it in vim, I I searched for instances of "/var/www/ssl", and found that there were many, many instances. However, many had been partially overwritten, as in this screenshot below:

Screenshot of Corrupted Data

But I know my copy is in there, so now I'll search for a more recent string. One that I typed in one of my most recent edits, just before my mv fiasco. Searching for this string found three copies:

pretty similar to my article covering [[SELINUX and apache (httpd)

They were all very close, but in my case, only one wasn't corrupted, which is the copy I extracted. Otherwise, I'd have saved all copies and compared them.

Lessons Learned

  • Backups, snapshots, and git won't help you if files are too young to be in any of them.
  • You can potentially recover a deleted file, quickly, using just tools you have on-hand (grep & vim).
  • Be more careful when doing file operations. When doing a mv *.mdwn, be sure you type the destination. And if you think you typed ls, double check. Or make sure the file you care about is alphabetically first, at least ;)

Caveats

Lots of things could have prevented this from working. I think I was only able to recover due to:

  • filesystem behaviour
  • vim's atomic file saves (so writing out a whole fresh copy of the file every time)
  • SSD being mostly empty (potentially avoiding aggressive garbage collection cycles)

Or put more simply: I got lucky.

Your mileage may vary.

No warranties.

Best of luck.

This blog is powered by ikiwiki.