Making du report useful output for comparing across systems

You have to be a little careful using the du command to compare output across systems. For example, a few weeks ago I backed up a bunch of files from our MacBook Pro to our NAS and then was running du commands to verify that directories on the systems had the same amount of data. There is a pitfall here, though…

Looking at the Mac, I see:

marc@hyperion:~
08:39:21 $ du -sh Pictures
 50G    Pictures

and then the NAS:

nas1:/c/media# du -sh Pictures
51G     Pictures

Hmmmm. The NAS has more than the Mac? Why? Well maybe because the NAS already had stuff in that directory from a prior backup? A fine theory, but not likely to be correct in this case, because I used rsync with the --delete option, so any files that were on the destination and not on the source should’ve been removed. Perhaps a bug in rsync? Or is it something else…? (Hint, the answer is “something else”).

Drilling down, we eventually find a low-level discrepancy:

marc@hyperion:~
08:48:47 $ du -k Pictures/iPhoto\ Library/AlbumData.xml
16372   Pictures/iPhoto Library/AlbumData.xml

nas1:/c/media# du -k Pictures/iPhoto\ Library/AlbumData.xml
16400   Pictures/iPhoto Library/AlbumData.xml

(Note that in this case, the destination showed more disk usage than the source, but as you’ll understand after reading this post, this situation could be reversed if the filesystems involved happened to be configured differently).

Ah, there we go. Somehow, this file is bigger on the NAS than it is on the Mac… Except, it isn’t…

marc@hyperion:~
08:49:04 $ ls -l Pictures/iPhoto\ Library/AlbumData.xml
-rw-r--rw-  1 marc  marc  16761978 Feb 12 00:02 Pictures/iPhoto Library/AlbumData.xml

nas1:/c/media# ls -l Pictures/iPhoto\ Library/AlbumData.xml
-rw-r--rw-  1 marc 501 16761978 2011-02-12 00:02 Pictures/iPhoto Library/AlbumData.xml

Same exact size in bytes, according to ls, and yet du indicates that the disk usage is different. What is going on here?

Well, the key thing to be aware of here is that du measures disk usage, whereas ls is measuring the size of the files. Same thing, right? Not quite.

Disk usage here is including the overhead inherent in filesystems with fixed-size blocks.

Said another way, ls measures the logical disk usage of files, whereas du measures the physical disk usage of files.

To illustrate this even more clearly, we can create a file with exactly 1024 bytes and see what du reports for it on various filesystems.

Here’s our 1024 byte (1 KB file):

marc@hyperion:/tmp
08:28:07 $ dd if=/dev/zero of=/tmp/1K bs=1 count=1024
1024+0 records in
1024+0 records out
1024 bytes transferred in 0.004783 secs (214095 bytes/sec)

marc@hyperion:/tmp
08:28:49 $ ls -l /tmp/1K
-rw-r--r--  1 marc  wheel  1024 Feb 13 08:28 /tmp/1K

OK, a 1 KB file. What does du have to say about it?

marc@hyperion:/tmp
08:29:08 $ du -k /tmp/1K
4       /tmp/1K

A 1 KB file is actually taking up 4 KB. That’s because the HFS+ filesystem that this file lives on uses 4 KB blocks.

If you have the du from GNU coreutils installed (sometimes named gdu on a BSD-based system such as OS X to distinguish it from a default BSD-derived system du command), then you have a nifty command-line option that reports the logical size rather than the physical size:

marc@hyperion:~
07:41:31 $ gdu -k /tmp/1K
4       /tmp/1K

marc@hyperion:~
07:41:33 $ gdu -k --apparent-size /tmp/1K
1       /tmp/1K

Now this is on OS X, which is BSD-based, and I don’t know how to make the system’s BSD du do this, but it’s easy to install GNU coreutils on any system (e.g.: brew install coreutils, port install coreutils, apt-get install coreutils, etc…). Linux systems probably already have this out of the box and probably the command is named du rather than gdu because Linux (or as some would prefer, GNU/Linux) systems usually use the GNU tools by default.

Leave a Reply

Your email address will not be published. Required fields are marked *