Table of Contents

Recovering my lost data: LVM and RAID

So I upgraded Ubuntu 9.10 to Ubuntu 11.10. When the system boots it says it cannot mount /store, my 960GB RAID+LVM file-system. That the one that holds over 10 years of personal photographs and such. :-(

About the file-system: ''store''

There are many layers of indirection between the file-system and the physical storage when using LVM or RAID. When using both, the number of layers can seem excessive. Here's a diagram of the layers involved in my (lost) setup:

Note that the RAID block device, md0, is not partitioned. I believe that was a mistake on my part, and a likely reason why Ubuntu 11.10 cannot auto-detect it.

Problem statement

Since upgrading system boot is interrupted with an error screen to the affect of “Cannot mount /store” and a prompt to enter a root shell or skip mounting. From what I can see, the RAID array is detected without problems, and is functioning correctly. So the system looks like this:

The RAID (multi-disk) status looks fine to me:

root@ikari:~# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid5 sdb[1] sde[3] sdc[0] sdd[2] sdf[4](S)
      937713408 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

but the resulting 960.2 GB block device is partitioned as a “Linux RAID autodetect” - which would suggest that it is a *member* of some other multi-disk setup. This, I believe, is human error on my part when I created the thing…

root@ikari:~# fdisk -l /dev/md127

Disk /dev/md127: 960.2 GB, 960218529792 bytes
255 heads, 63 sectors/track, 116739 cylinders, total 1875426816 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 196608 bytes
Disk identifier: 0xd71c877b

      Device Boot      Start         End      Blocks   Id  System
/dev/md127p1              63   625137344   312568641   fd  Linux RAID autodetect
Partition 1 does not start on physical sector boundary.

Recovery strategy

  1. Create a disk-image of md127, partition table and all (required buying an external 2TB USB drive)
  2. Use LVM snapshots with this disk-image to make (and quickly roll-back) experimental changes

Formatting the external USB drive

So I created a single “Linux LVM” partition on the 2TB disk, created a single 1.8TB physical volume and a single 1.8TB volume group containing it. On this I created a 1TB logical volume called lv_scratch and copied the contents of md127 to it (e.g. dd if=/dev/md127 of=/dev/mapper/lv_scratch). Once the copy was made, I created a snapshot of lv_scratch which I imaginatively called snap.

LVM snapshots are interesting creatures. As the name suggests, the snapshot (named snap) holds the state of lv_scratch as it was when I created it. I can still read and write to lv_scratch, but the contents of snap will not change. This is ideal for making consistent backups. The snapshot works by deferring any writes to lv_scratch and placing them instead in some temporary copy-on-write (COW) volume. All access to lv_scratch consults the COW volume - when there is a hit it is returned, otherwise the original (unchanged) lv_scratch is read. When the snapshot is deleted, the deferred changes stored in snap are written to lv_scratch and become permanent. Makes sense if you are used to copy-on-write behaviour.

Now here is where things get interesting. The snapshot, snap, does not have to be read-only: you can create it read-write. Doing so gives you a very cheap copy of lv_scratch, and any changes you make to the snapshot are stored in a COW table. You can discard the changes by deleting the snapshot. Ideal for my situation: I want to experiment with the partition table and various file-system recovery tools etc. I let these manipulate the snapshot, and if things go bad I delete and recreate the snapshot and try over.

root@ikari:~# fdisk -l /dev/sdj

Disk /dev/sdj: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000f0222

   Device Boot      Start         End      Blocks   Id  System
/dev/sdj1            2048  3907028991  1953513472   8e  Linux LVM
root@ikari:~# pvdisplay 
  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  --- Physical volume ---
  PV Name               /dev/sdj1
  VG Name               vg_scratch
  PV Size               1.82 TiB / not usable 4.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              476931
  Free PE               86787
  Allocated PE          390144
  PV UUID               nrf9cQ-Asfz-Y2x2-SDoT-3ppu-mpEC-Fnuf8Z
   
root@ikari:~# vgdisplay
  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  --- Volume group ---
  VG Name               vg_scratch
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  13
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1.82 TiB
  PE Size               4.00 MiB
  Total PE              476931
  Alloc PE / Size       390144 / 1.49 TiB
  Free  PE / Size       86787 / 339.01 GiB
  VG UUID               Lk7UZP-48xF-vBPi-6g8F-sXlF-qyzy-pQNKgq
   
root@ikari:~# lvdisplay
  /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg_scratch/lv_scratch
  VG Name                vg_scratch
  LV UUID                aFBpgv-gqcd-jjLU-c7xO-Jyeb-2R0t-HpEF84
  LV Write Access        read/write
  LV snapshot status     source of
                         /dev/vg_scratch/snap [active]
  LV Status              available
  # open                 0
  LV Size                1.00 TiB
  Current LE             262144
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Name                /dev/vg_scratch/snap
  VG Name                vg_scratch
  LV UUID                OvOsQ7-uACi-xJVZ-vseu-fKEc-F73h-CmSalH
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch
  LV Status              available
  # open                 0
  LV Size                1.00 TiB
  Current LE             262144
  COW-table size         500.00 GiB
  COW-table LE           128000
  Allocated to snapshot  0.00% 
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

So here's the goal I'm aiming for on my external storage:

Recognising the nested LVM volumes

Since it's been a few days and reboots since I last worked on this, I'll start by plugging the USB drive it.

root@ikari:~# dmesg
[  479.180019] usb 2-5: new high speed USB device number 7 using ehci_hcd
[  479.313228] scsi13 : usb-storage 2-5:1.0
[  480.312605] scsi 13:0:0:0: Direct-Access     Seagate  Desktop          0130 PQ: 0 ANSI: 4
[  480.336633] sd 13:0:0:0: Attached scsi generic sg10 type 0
[  480.337029] sd 13:0:0:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[  480.337671] sd 13:0:0:0: [sdi] Write Protect is off
[  480.337671] sd 13:0:0:0: [sdi] Mode Sense: 2f 08 00 00
[  480.340027] sd 13:0:0:0: [sdi] No Caching mode page present
[  480.340027] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[  480.341806] sd 13:0:0:0: [sdi] No Caching mode page present
[  480.341811] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[  480.357290]  sdi: sdi1
[  480.359346] sd 13:0:0:0: [sdi] No Caching mode page present
[  480.359350] sd 13:0:0:0: [sdi] Assuming drive cache: write through
[  480.359354] sd 13:0:0:0: [sdi] Attached SCSI disk

The (outer) LVM PVs are automatically detects, and their VGs + LVs are subsequently detected:

root@ikari:~# pvs
  PV         VG         Fmt  Attr PSize PFree  
  /dev/sdi1  vg_scratch lvm2 a-   1.82t 339.01g

root@ikari:~# vgs
  VG         #PV #LV #SN Attr   VSize VFree  
  vg_scratch   1   2   1 wz--n- 1.82t 339.01g

root@ikari:~# lvs
  LV         VG         Attr   LSize   Origin     Snap%  Move Log Copy%  Convert
  lv_scratch vg_scratch owi-a-   1.00t                                          
  snap       vg_scratch swi-a- 500.00g lv_scratch   0.00   

Somewhere on the snap logcial volume is my nested LVM. I used xxd /dev/vg_scratch/snap | less and searched for LVM2. The first hit was a false-positive (appeared to have stripes of NULLs written across it), but the second hit looked plausible:

8018600: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018610: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018620: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018630: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018640: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018650: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018660: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018670: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018680: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018690: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80186f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018700: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018710: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018720: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018730: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018740: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018750: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018760: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018770: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018780: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018790: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80187f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018800: 4c41 4245 4c4f 4e45 0100 0000 0000 0000  LABELONE........
8018810: 9148 4053 2000 0000 4c56 4d32 2030 3031  .H@S ...LVM2 001
8018820: 5341 7536 6e32 7578 474c 5148 6743 5351  SAu6n2uxGLQHgCSQ
8018830: 6b56 6b5a 655a 4c78 7874 314b 7652 6a31  kVkZeZLxxt1KvRj1
8018840: 00f8 0391 df00 0000 0000 0300 0000 0000  ................
8018850: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018860: 0000 0000 0000 0000 0010 0000 0000 0000  ................
8018870: 00f0 0200 0000 0000 0000 0000 0000 0000  ................
8018880: 0000 0000 0000 0000 0000 0000 0000 0000  ................
8018890: 0000 0000 0000 0000 0000 0000 0000 0000  ................
80188a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

I know from using xxd to examine a correct and functioning LVM2 partition (the PV behind vg_scratch as it happens) that the “LVM2” text should appear at 0x210. So I'll create a loopback device with an appropriate offset to make that happen:

root@ikari:~# losetup /dev/loop0 /dev/vg_scratch/snap --offset $((0x8018600))

root@ikari:~# lvmdiskscan 
  /dev/ram0  [      64.00 MiB] 
  /dev/loop0 [     894.15 GiB] LVM physical volume
  /dev/dm-0  [     186.27 GiB] 
  /dev/ram1  [      64.00 MiB] 
  /dev/sda1  [     294.09 GiB] 
  /dev/dm-1  [     894.27 GiB] 
  /dev/ram2  [      64.00 MiB] 
  /dev/dm-2  [     894.27 GiB] 
  /dev/ram3  [      64.00 MiB] 
  /dev/dm-3  [     894.27 GiB] 
  /dev/ram4  [      64.00 MiB] 
  /dev/dm-4  [     782.47 GiB] 
  /dev/ram5  [      64.00 MiB] 
  /dev/sda5  [       4.00 GiB] 
  /dev/dm-5  [     715.38 GiB] 
  /dev/ram6  [      64.00 MiB] 
  /dev/ram7  [      64.00 MiB] 
  /dev/ram8  [      64.00 MiB] 
  /dev/ram9  [      64.00 MiB] 
  /dev/ram10 [      64.00 MiB] 
  /dev/ram11 [      64.00 MiB] 
  /dev/ram12 [      64.00 MiB] 
  /dev/ram13 [      64.00 MiB] 
  /dev/ram14 [      64.00 MiB] 
  /dev/ram15 [      64.00 MiB] 
  /dev/sdb1  [       1.82 TiB] LVM physical volume
  0 disks
  24 partitions
  0 LVM physical volume whole disks
  2 LVM physical volumes

root@ikari:~# pvs
  PV         VG         Fmt  Attr PSize   PFree  
  /dev/loop0 store_vg   lvm2 a-   894.25g 178.88g
  /dev/sdb1  vg_scratch lvm2 a-     1.82t      0 

If lvmdiskscan doesn work you could try using partprobe to tell the Kernel to rescan partition tables and do what it does.

root@ikari:~# lvs
  LV         VG         Attr   LSize   Origin     Snap%  Move Log Copy%  Convert
  store_lv   store_vg   -wi-a- 715.38g                                          
  home_zfs   vg_scratch -wi-a- 186.27g                                          
  lv_scratch vg_scratch owi-a- 894.27g                                          
  snap       vg_scratch swi-ao 782.47g lv_scratch   0.00    

Finding the ext4 file-system

OK, so now I can read/write my store LVM that housed my ext4 file-system. But the file-system isn't where it's supposed to be, apparently:

root@ikari:~# file -Ls /dev/store_vg/store_lv 
/dev/store_vg/store_lv: data

Time to play with a hex editor again to look at a valid ext4 file-system header (from my current Ubuntu installation, /dev/sda1).

After a bit of searching I find the characteristic 53ef marker in the right column, followed by the name of my file-system: store. Looks good.

# xxd /dev/store_vg/store_lv | less
05c9db0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9dc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9dd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9de0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9df0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9e00: 00c0 fc06 c017 f90d c9da b200 0c2a d00a  .............*..
05c9e10: c604 f906 0000 0000 0200 0000 0200 0000  ................
05c9e20: 0080 0000 0080 0000 0040 0000 9d17 aa4d  .........@.....M
05c9e30: 9d17 aa4d 0200 1b00 53ef 0100 0100 0000  ...M....S.......
05c9e40: 4c2c a24d 004e ed00 0000 0000 0100 0000  L,.M.N..........
05c9e50: 0000 0000 0b00 0000 8000 0000 3400 0000  ............4...
05c9e60: 0600 0000 0300 0000 9495 5e9b 7d7e 41da  ..........^.}~A.
05c9e70: a595 6af4 4848 b5c6 7374 6f72 6500 0000  ..j.HH..store...
05c9e80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9e90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9ea0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9eb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9ec0: 0000 0000 0000 0000 0000 0000 0000 c803  ................
05c9ed0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
05c9ee0: 0800 0000 0000 0000 0000 0000 9a4a cd41  .............J.A
05c9ef0: 88ff 4759 ac42 d083 8b3f 3b0d 0201 0000  ..GY.B...?;.....
05c9f00: 0000 0000 0000 0000 cf4a 9f46 0906 0000  .........J.F....
05c9f10: 0a06 0000 0b06 0000 0c06 0000 0d06 0000  ................

By comparing this to my valid file-system, I can see that the 53ef line should be at offset 0x430:

# xxd /dev/sda1 | less
0000380: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000390: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00003f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000400: 0020 2601 005d 9804 73d1 3a00 8366 bb02  . &..]..s.:..f..
0000410: d8da 2001 0000 0000 0200 0000 0200 0000  .. .............
0000420: 0080 0000 0080 0000 0020 0000 b153 264f  ......... ...S&O
0000430: d9bf f44e 0d00 2200 53ef 0100 0100 0000  ...N..".S.......
0000440: 98b4 f44e 004e ed00 0000 0000 0100 0000  ...N.N..........
0000450: 0000 0000 0b00 0000 0001 0000 3c00 0000  ............<...
0000460: 4602 0000 7b00 0000 89bb eae1 b864 492d  F...{........dI-
0000470: a3d6 2bc9 5336 151e 0000 0000 0000 0000  ..+.S6..........
0000480: 0000 0000 0000 0000 2f00 1eae 7c13 0000  ......../...|...
0000490: 0000 c099 98a1 0188 ffff 985d 698f 0188  ...........]i...
00004a0: ffff 307a 87a2 0188 ffff 307a 87a2 0188  ..0z......0z....
00004b0: ffff 585c 698f 0188 ffff 645d 1681 ffff  ..X\i.....d]....
00004c0: ffff c000 588f 0188 0000 0000 0000 ed03  ....X...........
00004d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00004e0: 0800 0000 0000 0000 b60a 4000 5547 eae9  ..........@.UG..
00004f0: 2f3c 45ee 9913 70fb 20b7 7395 0101 0000  /<E...p. .s.....
0000500: 0000 0000 0000 0000 98b4 f44e 0af3 0200  ...........N....
0000510: 0400 0000 0000 0000 0000 0000 ff7f 0000  ................
0000520: 0080 4802 ff7f 0000 0100 0000 ffff 4802  ..H...........H.
0000530: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000540: 0000 0000 0000 0000 0000 0000 0000 0008  ................
0000550: 0000 0000 0000 0000 0000 0000 1c00 1c00  ................
0000560: 0100 0000 0000 0000 0000 0000 0000 0000  ................
0000570: 0000 0000 0400 0000 b6b0 300b 0000 0000  ..........0.....
0000580: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000590: 0000 0000 0000 0000 0000 0000 0000 0000  ................

So let's create another loopback device with an offset:

# losetup /dev/loop1 /dev/store_vg/store_lv --offset $((0x5C9A00))

For the record, that makes the current overall loopback settings:

# losetup -a
/dev/loop0: [0005]:12783 (/dev/mapper/vg_scratch-snap), offset 134317568
/dev/loop1: [0005]:40851 (/dev/mapper/store_vg-store_lv), offset 6068736

So is the ext4 file-system visible?

# file -Ls /dev/loop1
/dev/loop1: Linux rev 1.0 ext3 filesystem data, UUID=94955e9b-7d7e-41da-a595-6af44848b5c6, volume name "store" (needs journal recovery) (large files)

Success!

Update: I ended up writing a Python script, find_ext3.py, to help me locate ext3 superblocks, and automatically check their validity using dumpe2fs. This showed that the most common file-system origin is actually _before_ the start of the LVM logical volume:

root@skuld:/home/meermanr/projects/find_ext3# cut -d' ' -f9- store_vg-store_lv.find_ext3.log | sort | uniq
 -c | sort -rn | head
     17 origin -134282240
     16 origin -134514176
      1 origin 8382976
      1 origin 8268288
      1 origin 8256000
      1 origin 8215040
      1 origin 8145408
      1 origin 8133120
      1 origin 8043008
      1 origin 8030720

From this I conclude that my original (working) installation was not actually using the Logical Volume! This may explain why updating Ubuntu to a version which has LVM support by default made my system unable to find the file-system.

So from this point on I'll ignore the Logical Volumes in my disk image (store_vg/store_lv), and instead look for a file-system in the raw image.

Getting my data back

Obviously I tried mounting it first, to no avail:

root@ikari:/tmp# mount /dev/loop1 /tmp/store
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Since this is all sitting on top of the LVM snapshot I made earlier (/dev/vg_scratch/snap) I am happy to try various tools that modify the drive contents, such as fsck.

First attempt, no joy:

# fsck.ext4 -y /dev/loop1
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Group descriptors look bad... trying backup blocks...
fsck.ext4: Bad magic number in super-block when using the backup blocks
fsck.ext4: going back to original superblock
Error reading block 3226742528 (Invalid argument).  Ignore error? yes

Force rewrite? yes

Superblock has an invalid journal (inode 8).
Clear? yes

*** ext3 journal has been deleted - filesystem is now ext2 only ***

The filesystem size (according to the superblock) is 234428352 blocks
The physical size of the device is 187529782 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort? yes

Error writing block 3226742528 (Invalid argument).  Ignore error? yes

Using my find_ext3.py script I mapped out the location of all superblocks on my disk in the form of a log file as follows:

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# head vg_scratch-snap.log 
OK  /dev/vg_scratch/snap 1024 store #0 4096 kB 32768 bpg, origin 0
BAD /dev/vg_scratch/snap 6358016 store #0 4096 kB 32768 bpg, origin 6356992
BAD /dev/vg_scratch/snap 6403072 store #0 4096 kB 32768 bpg, origin 6402048
BAD /dev/vg_scratch/snap 6444032 store #0 4096 kB 32768 bpg, origin 6443008
BAD /dev/vg_scratch/snap 6456320 store #0 4096 kB 32768 bpg, origin 6455296
BAD /dev/vg_scratch/snap 6525952 store #0 4096 kB 32768 bpg, origin 6524928
BAD /dev/vg_scratch/snap 6562816 store #0 4096 kB 32768 bpg, origin 6561792
BAD /dev/vg_scratch/snap 6640640 store #0 4096 kB 32768 bpg, origin 6639616
BAD /dev/vg_scratch/snap 6652928 store #0 4096 kB 32768 bpg, origin 6651904
BAD /dev/vg_scratch/snap 6722560 store #0 4096 kB 32768 bpg, origin 6721536

Obviously the “OK” superblock at origin 0 isn't actually valid, according to fsck. I suspect I've reformatted this drive a number of times, and unfortunately used the same label (store) each time, which doesn't help. So let's take a statistical approach: tally how often each origin offset is mentioned, and investigate the most frequent:

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# cut -d' ' -f9- ./vg_scratch-snap.log | sort | uniq -c | sort -rn | head
     22 origin 231936
     19 origin 0
      1 origin 880602744320
      1 origin 8802304
      1 origin 8790016
      1 origin 8679424
      1 origin 8667136
      1 origin 8556544
      1 origin 8544256
      1 origin 8433664

As above, I mounted this on a loop back device and ran fsck on it. This got a lot further than previous attempts: more text scrolled past me, and it sat there for a while writing out a new journal. Ultimately it gave up with the same error as last time.

On the advice of Gentoo forum post I ran

mke2fs -S /dev/loop0

followed by fsck. That was last night. This morning it's still running, pinning one of my CPUs at 100%, consuming so much memory it has caused my system to nearly exhaust its swap file and nearly consume all of the snapshot volumes Copy-On-Write table!

Lowering IO and CPU scheduling priority of fsck

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# ps fo pid,pmem,pcpu,cmd -t 13
  PID %MEM %CPU CMD
19540  0.0  0.0 -bash
19666  0.0  0.0  \_ sudo su
19667  0.0  0.0      \_ su
19675  0.0  0.0          \_ bash
20181  0.0  0.0              \_ fsck /dev/loop0 -y
20182 49.4 86.8                  \_ fsck.ext2 -y /dev/loop0

To keep my system usable I lowered the IO and CPU priority of fsck. First change the IO scheduling class to “idle” (3) for the hungry process:

root@Ikari:~# ionice -c3 -p 20182

Then raise the “niceness” of the process. Higher values make processes nicer, which means they are more likely to “give way” to other processes. Really it just means the kernel will pre-empty nice processes more often:

root@Ikari:~# renice 10 20182
20182 (process ID) old priority 0, new priority 10

Adding more swap to my system

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# free -m
             total       used       free     shared    buffers     cached
Mem:          5969       5523        445          0        325        117
-/+ buffers/cache:       5080        888
Swap:         6234       3477       2757

I was concerned to see that 50% of my swap was in use. I don't know how long fsck will take, so adding more swap seems prudent. As luck would have it, I recently added an solid state drive (SSD) to my system, so I have an unused spinning disk which I'm pretty sure has a swap partition on it that isn't doing anything.

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# lsblk
NAME                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0                                   7:0    0 894.3G  0 loop 
sda                                     8:0    0  93.2G  0 disk 
├─sda1                                  8:1    0  87.2G  0 part /
├─sda2                                  8:2    0     1K  0 part 
└─sda5                                  8:5    0     6G  0 part [SWAP]
sdb                                     8:16   0 298.1G  0 disk 
├─sdb1                                  8:17   0 294.1G  0 part 
├─sdb2                                  8:18   0     1K  0 part 
└─sdb5                                  8:21   0     4G  0 part
sr0                                    11:0    1  1024M  0 rom  
sr1                                    11:1    1  1024M  0 rom  
sde                                     8:64   0   1.8T  0 disk 
└─sde1                                  8:65   0   1.8T  0 part 
  ├─vg_scratch-home_zfs (dm-0)        252:0    0 186.3G  0 lvm  
  ├─vg_scratch-lv_scratch-real (dm-3) 252:3    0 894.3G  0 lvm  
  │ ├─vg_scratch-lv_scratch (dm-2)    252:2    0 894.3G  0 lvm  
  │ └─vg_scratch-snap (dm-1)          252:1    0 894.3G  0 lvm  
  ├─vg_scratch-snap-cow (dm-4)        252:4    0  18.6G  0 lvm  
  │ └─vg_scratch-snap (dm-1)          252:1    0 894.3G  0 lvm  
  └─vg_scratch-photorec (dm-5)        252:5    0 763.9G  0 lvm  

It's probably /dev/sdb5, let's check with file:

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# file -Ls /dev/sdb5
/dev/sdb5: no read permission

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# sudo !!
sudo file -Ls /dev/sdb5
[sudo] password for meermanr: 
/dev/sdb5: Linux/i386 swap file (new style), version 1 (4K pages), size 1048063 pages, no label, UUID=d0bbff73-a09a-47f6-8387-e27268cdc9fc

Great! Let's enable it!

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# sudo swapon /dev/sdb5

And verify:

# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# lsblk | grep SWAP
└─sda5                                  8:5    0     6G  0 part [SWAP]
└─sdb5                                  8:21   0     4G  0 part [SWAP]
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master)
# free -m
             total       used       free     shared    buffers     cached
Mem:          5969       5672        297          0        435        138
-/+ buffers/cache:       5098        871
Swap:        10234       3476       6758

Extending the snapshot volume

# lvdisplay /dev/vg_scratch/snap

  --- Logical volume ---
  LV Name                /dev/vg_scratch/snap
  VG Name                vg_scratch
  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch
  LV Status              available
  # open                 1
  LV Size                894.27 GiB
  Current LE             228934
  COW-table size         9.63 GiB
  COW-table LE           2335
  Allocated to snapshot  74.40%       <-- Do not want!
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1

As it happens, I had not allocated all of the volume group:

root@Ikari:~# vgs
  VG         #PV #LV #SN Attr   VSize VFree
  vg_scratch   1   4   1 wz--n- 1.73t    9.3g

So extending the snapshot is easy:

root@Ikari:~# lvextend /dev/vg_scratch/snap --extents +100%FREE

Verify:

# lvdisplay /dev/vg_scratch/snap
  --- Logical volume ---
  LV Name                /dev/vg_scratch/snap
  VG Name                vg_scratch
  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch
  LV Status              available
  # open                 1
  LV Size                894.27 GiB
  Current LE             228934
  COW-table size         18.63 GiB      <-- Has increased
  COW-table LE           4769           <-- (Same thing, but measured in logical extents)
  Allocated to snapshot  39.18%         <-- Much better!
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1

... 2 weeks later

It has been two weeks since I started fsck, and it is still running. During this time I've not been able to use my desktop PC for gaming, and so I've decided to hit ^C and move it to another machine. Here are the forensics:

Now:

# meermanr@Ikari:/home/meermanr (master *)
# date
Mon May 27 14:53:34 BST 2013

Size of block device:

root@Ikari:/home/meermanr/projects/find_ext3# python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open("/dev/loop0")
>>> f.seek(0, 2)
>>> f.tell()
960218560000
>>> hex(f.tell())
'0xdf917c7600'
>>> 

Offsets of fsck and python:

Every 2.0s: lsof /dev/loop0                                                       Mon May 27 14:56:18 2013

COMMAND     PID USER   FD   TYPE DEVICE     SIZE/OFF NODE NAME
fsck.ext2 20182 root    4u   BLK    7,0 0x3f903ef000 5941 /dev/loop0
python    23598 root    3r   BLK    7,0 0xdf917c7600 5941 /dev/loop0

That's approximately 28%. :-(

 STARTED %CPU %MEM   RSS CMD
  May 13  0.0  0.0   364 su
  May 13  0.0  0.0   528  \_ bash
  May 13  0.0  0.0  1320      \_ watch lvdisplay /dev/vg_scratch/snap
  May 13  0.0  0.0   364 su
  May 13  0.0  0.0   536  \_ bash
  May 13  0.0  0.0   464      \_ fsck /dev/loop0 -y
  May 13 99.1 24.3 1488468          \_ fsck.ext2 -y /dev/loop0
  May 14  0.0  0.0  1456 watch lsof /dev/loop0

So fsck has about 1.4GiB of resident memory (and 4696MiB of virtual memory according to top):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                     
20182 root      30  10 4696m 1.4g  784 R   94 24.3  19564:51 fsck.ext2 -y /dev/loop0  

Total system memory:

# meermanr@Ikari:/home/meermanr (master *)
# free -m
             total       used       free     shared    buffers     cached
Mem:          5969       5036        933          0         87        157
-/+ buffers/cache:       4791       1178
Swap:        10234       4764       5470
Every 2.0s: lvdisplay /dev/vg_scratch/snap                                       Mon May 27 14:54:06 2013

File descriptor 4 (pipe:[72594953]) leaked on lvdisplay invocation. Parent PID 23517: sh
  --- Logical volume ---
  LV Name                /dev/vg_scratch/snap
  VG Name                vg_scratch
  LV UUID                4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/vg_scratch/lv_scratch
  LV Status              available
  # open                 1
  LV Size                894.27 GiB
  Current LE             228934
  COW-table size         84.89 GiB
  COW-table LE           21733
  Allocated to snapshot  43.48%
  Snapshot chunk size    4.00 KiB
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1

Output from fsck itself:

File ... (inode #9791282, mod time Thu Oct  5 01:40:26 2006) 
  has 11143 multiply-claimed block(s), shared with 5 file(s):
	<filesystem metadata>
	... (inode #9791794, mod time Thu Oct  5 01:40:26 2006)
	... (inode #4115835, mod time Thu Aug 20 03:31:06 2009)
	... (inode #4130006, mod time Mon Nov 29 16:38:10 2010)
	... (inode #4784754, mod time Tue Jul 26 06:01:10 2005)
Clone multiply-claimed blocks? yes