So I upgraded Ubuntu 9.10 to Ubuntu 11.10. When the system boots it says it cannot mount /store
, my 960GB RAID+LVM file-system. That the one that holds over 10 years of personal photographs and such.
There are many layers of indirection between the file-system and the physical storage when using LVM or RAID. When using both, the number of layers can seem excessive. Here's a diagram of the layers involved in my (lost) setup:
Note that the RAID block device, md0
, is not partitioned. I believe that was a mistake on my part, and a likely reason why Ubuntu 11.10 cannot auto-detect it.
Since upgrading system boot is interrupted with an error screen to the affect of “Cannot mount /store” and a prompt to enter a root shell or skip mounting. From what I can see, the RAID array is detected without problems, and is functioning correctly. So the system looks like this:
The RAID (multi-disk) status looks fine to me:
root@ikari:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid5 sdb[1] sde[3] sdc[0] sdd[2] sdf[4](S) 937713408 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>
but the resulting 960.2 GB block device is partitioned as a “Linux RAID autodetect” - which would suggest that it is a *member* of some other multi-disk setup. This, I believe, is human error on my part when I created the thing…
root@ikari:~# fdisk -l /dev/md127 Disk /dev/md127: 960.2 GB, 960218529792 bytes 255 heads, 63 sectors/track, 116739 cylinders, total 1875426816 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 65536 bytes / 196608 bytes Disk identifier: 0xd71c877b Device Boot Start End Blocks Id System /dev/md127p1 63 625137344 312568641 fd Linux RAID autodetect Partition 1 does not start on physical sector boundary.
md127
, partition table and all (required buying an external 2TB USB drive)
So I created a single “Linux LVM” partition on the 2TB disk, created a single 1.8TB physical volume and a single 1.8TB volume group containing it. On this I created a 1TB logical volume called lv_scratch
and copied the contents of md127
to it (e.g. dd if=/dev/md127 of=/dev/mapper/lv_scratch
). Once the copy was made, I created a snapshot of lv_scratch
which I imaginatively called snap
.
LVM snapshots are interesting creatures. As the name suggests, the snapshot (named snap
) holds the state of lv_scratch
as it was when I created it. I can still read and write to lv_scratch
, but the contents of snap
will not change. This is ideal for making consistent backups. The snapshot works by deferring any writes to lv_scratch
and placing them instead in some temporary copy-on-write (COW) volume. All access to lv_scratch
consults the COW volume - when there is a hit it is returned, otherwise the original (unchanged) lv_scratch
is read. When the snapshot is deleted, the deferred changes stored in snap
are written to lv_scratch
and become permanent. Makes sense if you are used to copy-on-write behaviour.
Now here is where things get interesting. The snapshot, snap
, does not have to be read-only: you can create it read-write. Doing so gives you a very cheap copy of lv_scratch
, and any changes you make to the snapshot are stored in a COW table. You can discard the changes by deleting the snapshot. Ideal for my situation: I want to experiment with the partition table and various file-system recovery tools etc. I let these manipulate the snapshot, and if things go bad I delete and recreate the snapshot and try over.
root@ikari:~# fdisk -l /dev/sdj Disk /dev/sdj: 2000.4 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000f0222 Device Boot Start End Blocks Id System /dev/sdj1 2048 3907028991 1953513472 8e Linux LVM
root@ikari:~# pvdisplay /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error --- Physical volume --- PV Name /dev/sdj1 VG Name vg_scratch PV Size 1.82 TiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 476931 Free PE 86787 Allocated PE 390144 PV UUID nrf9cQ-Asfz-Y2x2-SDoT-3ppu-mpEC-Fnuf8Z root@ikari:~# vgdisplay /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error --- Volume group --- VG Name vg_scratch System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 13 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.82 TiB PE Size 4.00 MiB Total PE 476931 Alloc PE / Size 390144 / 1.49 TiB Free PE / Size 86787 / 339.01 GiB VG UUID Lk7UZP-48xF-vBPi-6g8F-sXlF-qyzy-pQNKgq root@ikari:~# lvdisplay /dev/dm-0: read failed after 0 of 4096 at 0: Input/output error /dev/dm-1: read failed after 0 of 4096 at 0: Input/output error /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error --- Logical volume --- LV Name /dev/vg_scratch/lv_scratch VG Name vg_scratch LV UUID aFBpgv-gqcd-jjLU-c7xO-Jyeb-2R0t-HpEF84 LV Write Access read/write LV snapshot status source of /dev/vg_scratch/snap [active] LV Status available # open 0 LV Size 1.00 TiB Current LE 262144 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Name /dev/vg_scratch/snap VG Name vg_scratch LV UUID OvOsQ7-uACi-xJVZ-vseu-fKEc-F73h-CmSalH LV Write Access read/write LV snapshot status active destination for /dev/vg_scratch/lv_scratch LV Status available # open 0 LV Size 1.00 TiB Current LE 262144 COW-table size 500.00 GiB COW-table LE 128000 Allocated to snapshot 0.00% Snapshot chunk size 4.00 KiB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3
So here's the goal I'm aiming for on my external storage:
Since it's been a few days and reboots since I last worked on this, I'll start by plugging the USB drive it.
root@ikari:~# dmesg [ 479.180019] usb 2-5: new high speed USB device number 7 using ehci_hcd [ 479.313228] scsi13 : usb-storage 2-5:1.0 [ 480.312605] scsi 13:0:0:0: Direct-Access Seagate Desktop 0130 PQ: 0 ANSI: 4 [ 480.336633] sd 13:0:0:0: Attached scsi generic sg10 type 0 [ 480.337029] sd 13:0:0:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) [ 480.337671] sd 13:0:0:0: [sdi] Write Protect is off [ 480.337671] sd 13:0:0:0: [sdi] Mode Sense: 2f 08 00 00 [ 480.340027] sd 13:0:0:0: [sdi] No Caching mode page present [ 480.340027] sd 13:0:0:0: [sdi] Assuming drive cache: write through [ 480.341806] sd 13:0:0:0: [sdi] No Caching mode page present [ 480.341811] sd 13:0:0:0: [sdi] Assuming drive cache: write through [ 480.357290] sdi: sdi1 [ 480.359346] sd 13:0:0:0: [sdi] No Caching mode page present [ 480.359350] sd 13:0:0:0: [sdi] Assuming drive cache: write through [ 480.359354] sd 13:0:0:0: [sdi] Attached SCSI disk
The (outer) LVM PVs are automatically detects, and their VGs + LVs are subsequently detected:
root@ikari:~# pvs PV VG Fmt Attr PSize PFree /dev/sdi1 vg_scratch lvm2 a- 1.82t 339.01g root@ikari:~# vgs VG #PV #LV #SN Attr VSize VFree vg_scratch 1 2 1 wz--n- 1.82t 339.01g root@ikari:~# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv_scratch vg_scratch owi-a- 1.00t snap vg_scratch swi-a- 500.00g lv_scratch 0.00
Somewhere on the snap
logcial volume is my nested LVM. I used xxd /dev/vg_scratch/snap | less
and searched for LVM2
. The first hit was a false-positive (appeared to have stripes of NULLs written across it), but the second hit looked plausible:
8018600: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018610: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018620: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018630: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018640: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018650: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018660: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018670: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018680: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018690: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80186f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018700: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018710: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018720: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018730: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018740: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018750: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018760: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018770: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018780: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018790: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80187f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018800: 4c41 4245 4c4f 4e45 0100 0000 0000 0000 LABELONE........ 8018810: 9148 4053 2000 0000 4c56 4d32 2030 3031 .H@S ...LVM2 001 8018820: 5341 7536 6e32 7578 474c 5148 6743 5351 SAu6n2uxGLQHgCSQ 8018830: 6b56 6b5a 655a 4c78 7874 314b 7652 6a31 kVkZeZLxxt1KvRj1 8018840: 00f8 0391 df00 0000 0000 0300 0000 0000 ................ 8018850: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018860: 0000 0000 0000 0000 0010 0000 0000 0000 ................ 8018870: 00f0 0200 0000 0000 0000 0000 0000 0000 ................ 8018880: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 8018890: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 80188a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
I know from using xxd
to examine a correct and functioning LVM2 partition (the PV behind vg_scratch
as it happens) that the “LVM2” text should appear at 0x210. So I'll create a loopback device with an appropriate offset to make that happen:
root@ikari:~# losetup /dev/loop0 /dev/vg_scratch/snap --offset $((0x8018600)) root@ikari:~# lvmdiskscan /dev/ram0 [ 64.00 MiB] /dev/loop0 [ 894.15 GiB] LVM physical volume /dev/dm-0 [ 186.27 GiB] /dev/ram1 [ 64.00 MiB] /dev/sda1 [ 294.09 GiB] /dev/dm-1 [ 894.27 GiB] /dev/ram2 [ 64.00 MiB] /dev/dm-2 [ 894.27 GiB] /dev/ram3 [ 64.00 MiB] /dev/dm-3 [ 894.27 GiB] /dev/ram4 [ 64.00 MiB] /dev/dm-4 [ 782.47 GiB] /dev/ram5 [ 64.00 MiB] /dev/sda5 [ 4.00 GiB] /dev/dm-5 [ 715.38 GiB] /dev/ram6 [ 64.00 MiB] /dev/ram7 [ 64.00 MiB] /dev/ram8 [ 64.00 MiB] /dev/ram9 [ 64.00 MiB] /dev/ram10 [ 64.00 MiB] /dev/ram11 [ 64.00 MiB] /dev/ram12 [ 64.00 MiB] /dev/ram13 [ 64.00 MiB] /dev/ram14 [ 64.00 MiB] /dev/ram15 [ 64.00 MiB] /dev/sdb1 [ 1.82 TiB] LVM physical volume 0 disks 24 partitions 0 LVM physical volume whole disks 2 LVM physical volumes root@ikari:~# pvs PV VG Fmt Attr PSize PFree /dev/loop0 store_vg lvm2 a- 894.25g 178.88g /dev/sdb1 vg_scratch lvm2 a- 1.82t 0
If lvmdiskscan
doesn work you could try using partprobe
to tell the Kernel to rescan partition tables and do what it does.
root@ikari:~# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert store_lv store_vg -wi-a- 715.38g home_zfs vg_scratch -wi-a- 186.27g lv_scratch vg_scratch owi-a- 894.27g snap vg_scratch swi-ao 782.47g lv_scratch 0.00
OK, so now I can read/write my store
LVM that housed my ext4
file-system. But the file-system isn't where it's supposed to be, apparently:
root@ikari:~# file -Ls /dev/store_vg/store_lv /dev/store_vg/store_lv: data
Time to play with a hex editor again to look at a valid ext4
file-system header (from my current Ubuntu installation, /dev/sda1
).
After a bit of searching I find the characteristic 53ef
marker in the right column, followed by the name of my file-system: store
. Looks good.
# xxd /dev/store_vg/store_lv | less 05c9db0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9dc0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9dd0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9de0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9df0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9e00: 00c0 fc06 c017 f90d c9da b200 0c2a d00a .............*.. 05c9e10: c604 f906 0000 0000 0200 0000 0200 0000 ................ 05c9e20: 0080 0000 0080 0000 0040 0000 9d17 aa4d .........@.....M 05c9e30: 9d17 aa4d 0200 1b00 53ef 0100 0100 0000 ...M....S....... 05c9e40: 4c2c a24d 004e ed00 0000 0000 0100 0000 L,.M.N.......... 05c9e50: 0000 0000 0b00 0000 8000 0000 3400 0000 ............4... 05c9e60: 0600 0000 0300 0000 9495 5e9b 7d7e 41da ..........^.}~A. 05c9e70: a595 6af4 4848 b5c6 7374 6f72 6500 0000 ..j.HH..store... 05c9e80: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9e90: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9ea0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9eb0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9ec0: 0000 0000 0000 0000 0000 0000 0000 c803 ................ 05c9ed0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 05c9ee0: 0800 0000 0000 0000 0000 0000 9a4a cd41 .............J.A 05c9ef0: 88ff 4759 ac42 d083 8b3f 3b0d 0201 0000 ..GY.B...?;..... 05c9f00: 0000 0000 0000 0000 cf4a 9f46 0906 0000 .........J.F.... 05c9f10: 0a06 0000 0b06 0000 0c06 0000 0d06 0000 ................
By comparing this to my valid file-system, I can see that the 53ef
line should be at offset 0x430:
# xxd /dev/sda1 | less 0000380: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000390: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00003f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000400: 0020 2601 005d 9804 73d1 3a00 8366 bb02 . &..]..s.:..f.. 0000410: d8da 2001 0000 0000 0200 0000 0200 0000 .. ............. 0000420: 0080 0000 0080 0000 0020 0000 b153 264f ......... ...S&O 0000430: d9bf f44e 0d00 2200 53ef 0100 0100 0000 ...N..".S....... 0000440: 98b4 f44e 004e ed00 0000 0000 0100 0000 ...N.N.......... 0000450: 0000 0000 0b00 0000 0001 0000 3c00 0000 ............<... 0000460: 4602 0000 7b00 0000 89bb eae1 b864 492d F...{........dI- 0000470: a3d6 2bc9 5336 151e 0000 0000 0000 0000 ..+.S6.......... 0000480: 0000 0000 0000 0000 2f00 1eae 7c13 0000 ......../...|... 0000490: 0000 c099 98a1 0188 ffff 985d 698f 0188 ...........]i... 00004a0: ffff 307a 87a2 0188 ffff 307a 87a2 0188 ..0z......0z.... 00004b0: ffff 585c 698f 0188 ffff 645d 1681 ffff ..X\i.....d].... 00004c0: ffff c000 588f 0188 0000 0000 0000 ed03 ....X........... 00004d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00004e0: 0800 0000 0000 0000 b60a 4000 5547 eae9 ..........@.UG.. 00004f0: 2f3c 45ee 9913 70fb 20b7 7395 0101 0000 /<E...p. .s..... 0000500: 0000 0000 0000 0000 98b4 f44e 0af3 0200 ...........N.... 0000510: 0400 0000 0000 0000 0000 0000 ff7f 0000 ................ 0000520: 0080 4802 ff7f 0000 0100 0000 ffff 4802 ..H...........H. 0000530: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000540: 0000 0000 0000 0000 0000 0000 0000 0008 ................ 0000550: 0000 0000 0000 0000 0000 0000 1c00 1c00 ................ 0000560: 0100 0000 0000 0000 0000 0000 0000 0000 ................ 0000570: 0000 0000 0400 0000 b6b0 300b 0000 0000 ..........0..... 0000580: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0000590: 0000 0000 0000 0000 0000 0000 0000 0000 ................
So let's create another loopback device with an offset:
# losetup /dev/loop1 /dev/store_vg/store_lv --offset $((0x5C9A00))
For the record, that makes the current overall loopback settings:
# losetup -a /dev/loop0: [0005]:12783 (/dev/mapper/vg_scratch-snap), offset 134317568 /dev/loop1: [0005]:40851 (/dev/mapper/store_vg-store_lv), offset 6068736
So is the ext4
file-system visible?
# file -Ls /dev/loop1 /dev/loop1: Linux rev 1.0 ext3 filesystem data, UUID=94955e9b-7d7e-41da-a595-6af44848b5c6, volume name "store" (needs journal recovery) (large files)
Success!
Update: I ended up writing a Python script, find_ext3.py, to help me locate ext3
superblocks, and automatically check their validity using dumpe2fs
. This showed that the most common file-system origin is actually _before_ the start of the LVM logical volume:
root@skuld:/home/meermanr/projects/find_ext3# cut -d' ' -f9- store_vg-store_lv.find_ext3.log | sort | uniq -c | sort -rn | head 17 origin -134282240 16 origin -134514176 1 origin 8382976 1 origin 8268288 1 origin 8256000 1 origin 8215040 1 origin 8145408 1 origin 8133120 1 origin 8043008 1 origin 8030720
From this I conclude that my original (working) installation was not actually using the Logical Volume! This may explain why updating Ubuntu to a version which has LVM support by default made my system unable to find the file-system.
So from this point on I'll ignore the Logical Volumes in my disk image (store_vg/store_lv
), and instead look for a file-system in the raw image.
Obviously I tried mounting it first, to no avail:
root@ikari:/tmp# mount /dev/loop1 /tmp/store mount: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so
Since this is all sitting on top of the LVM snapshot I made earlier (/dev/vg_scratch/snap
) I am happy to try various tools that modify the drive contents, such as fsck
.
First attempt, no joy:
# fsck.ext4 -y /dev/loop1 e2fsck 1.41.14 (22-Dec-2010) fsck.ext4: Group descriptors look bad... trying backup blocks... fsck.ext4: Bad magic number in super-block when using the backup blocks fsck.ext4: going back to original superblock Error reading block 3226742528 (Invalid argument). Ignore error? yes Force rewrite? yes Superblock has an invalid journal (inode 8). Clear? yes *** ext3 journal has been deleted - filesystem is now ext2 only *** The filesystem size (according to the superblock) is 234428352 blocks The physical size of the device is 187529782 blocks Either the superblock or the partition table is likely to be corrupt! Abort? yes Error writing block 3226742528 (Invalid argument). Ignore error? yes
Using my find_ext3.py script I mapped out the location of all superblocks on my disk in the form of a log file as follows:
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # head vg_scratch-snap.log OK /dev/vg_scratch/snap 1024 store #0 4096 kB 32768 bpg, origin 0 BAD /dev/vg_scratch/snap 6358016 store #0 4096 kB 32768 bpg, origin 6356992 BAD /dev/vg_scratch/snap 6403072 store #0 4096 kB 32768 bpg, origin 6402048 BAD /dev/vg_scratch/snap 6444032 store #0 4096 kB 32768 bpg, origin 6443008 BAD /dev/vg_scratch/snap 6456320 store #0 4096 kB 32768 bpg, origin 6455296 BAD /dev/vg_scratch/snap 6525952 store #0 4096 kB 32768 bpg, origin 6524928 BAD /dev/vg_scratch/snap 6562816 store #0 4096 kB 32768 bpg, origin 6561792 BAD /dev/vg_scratch/snap 6640640 store #0 4096 kB 32768 bpg, origin 6639616 BAD /dev/vg_scratch/snap 6652928 store #0 4096 kB 32768 bpg, origin 6651904 BAD /dev/vg_scratch/snap 6722560 store #0 4096 kB 32768 bpg, origin 6721536
Obviously the “OK” superblock at origin 0 isn't actually valid, according to fsck
. I suspect I've reformatted this drive a number of times, and unfortunately used the same label (store
) each time, which doesn't help. So let's take a statistical approach: tally how often each origin offset is mentioned, and investigate the most frequent:
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # cut -d' ' -f9- ./vg_scratch-snap.log | sort | uniq -c | sort -rn | head 22 origin 231936 19 origin 0 1 origin 880602744320 1 origin 8802304 1 origin 8790016 1 origin 8679424 1 origin 8667136 1 origin 8556544 1 origin 8544256 1 origin 8433664
As above, I mounted this on a loop back device and ran fsck
on it. This got a lot further than previous attempts: more text scrolled past me, and it sat there for a while writing out a new journal. Ultimately it gave up with the same error as last time.
On the advice of Gentoo forum post I ran
mke2fs -S /dev/loop0
followed by fsck
. That was last night. This morning it's still running, pinning one of my CPUs at 100%, consuming so much memory it has caused my system to nearly exhaust its swap file and nearly consume all of the snapshot volumes Copy-On-Write table!
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # ps fo pid,pmem,pcpu,cmd -t 13 PID %MEM %CPU CMD 19540 0.0 0.0 -bash 19666 0.0 0.0 \_ sudo su 19667 0.0 0.0 \_ su 19675 0.0 0.0 \_ bash 20181 0.0 0.0 \_ fsck /dev/loop0 -y 20182 49.4 86.8 \_ fsck.ext2 -y /dev/loop0
To keep my system usable I lowered the IO and CPU priority of fsck
. First change the IO scheduling class to “idle” (3) for the hungry process:
root@Ikari:~# ionice -c3 -p 20182
Then raise the “niceness” of the process. Higher values make processes nicer, which means they are more likely to “give way” to other processes. Really it just means the kernel will pre-empty nice processes more often:
root@Ikari:~# renice 10 20182 20182 (process ID) old priority 0, new priority 10
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # free -m total used free shared buffers cached Mem: 5969 5523 445 0 325 117 -/+ buffers/cache: 5080 888 Swap: 6234 3477 2757
I was concerned to see that 50% of my swap was in use. I don't know how long fsck
will take, so adding more swap seems prudent. As luck would have it, I recently added an solid state drive (SSD) to my system, so I have an unused spinning disk which I'm pretty sure has a swap partition on it that isn't doing anything.
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 894.3G 0 loop sda 8:0 0 93.2G 0 disk ├─sda1 8:1 0 87.2G 0 part / ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 6G 0 part [SWAP] sdb 8:16 0 298.1G 0 disk ├─sdb1 8:17 0 294.1G 0 part ├─sdb2 8:18 0 1K 0 part └─sdb5 8:21 0 4G 0 part sr0 11:0 1 1024M 0 rom sr1 11:1 1 1024M 0 rom sde 8:64 0 1.8T 0 disk └─sde1 8:65 0 1.8T 0 part ├─vg_scratch-home_zfs (dm-0) 252:0 0 186.3G 0 lvm ├─vg_scratch-lv_scratch-real (dm-3) 252:3 0 894.3G 0 lvm │ ├─vg_scratch-lv_scratch (dm-2) 252:2 0 894.3G 0 lvm │ └─vg_scratch-snap (dm-1) 252:1 0 894.3G 0 lvm ├─vg_scratch-snap-cow (dm-4) 252:4 0 18.6G 0 lvm │ └─vg_scratch-snap (dm-1) 252:1 0 894.3G 0 lvm └─vg_scratch-photorec (dm-5) 252:5 0 763.9G 0 lvm
It's probably /dev/sdb5
, let's check with file
:
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # file -Ls /dev/sdb5 /dev/sdb5: no read permission # meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # sudo !! sudo file -Ls /dev/sdb5 [sudo] password for meermanr: /dev/sdb5: Linux/i386 swap file (new style), version 1 (4K pages), size 1048063 pages, no label, UUID=d0bbff73-a09a-47f6-8387-e27268cdc9fc
Great! Let's enable it!
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # sudo swapon /dev/sdb5
And verify:
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # lsblk | grep SWAP └─sda5 8:5 0 6G 0 part [SWAP] └─sdb5 8:21 0 4G 0 part [SWAP]
# meermanr@Ikari:/home/meermanr/projects/find_ext3 (master) # free -m total used free shared buffers cached Mem: 5969 5672 297 0 435 138 -/+ buffers/cache: 5098 871 Swap: 10234 3476 6758
# lvdisplay /dev/vg_scratch/snap --- Logical volume --- LV Name /dev/vg_scratch/snap VG Name vg_scratch LV UUID 4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 LV Write Access read/write LV snapshot status active destination for /dev/vg_scratch/lv_scratch LV Status available # open 1 LV Size 894.27 GiB Current LE 228934 COW-table size 9.63 GiB COW-table LE 2335 Allocated to snapshot 74.40% <-- Do not want! Snapshot chunk size 4.00 KiB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 252:1
As it happens, I had not allocated all of the volume group:
root@Ikari:~# vgs VG #PV #LV #SN Attr VSize VFree vg_scratch 1 4 1 wz--n- 1.73t 9.3g
So extending the snapshot is easy:
root@Ikari:~# lvextend /dev/vg_scratch/snap --extents +100%FREE
Verify:
# lvdisplay /dev/vg_scratch/snap --- Logical volume --- LV Name /dev/vg_scratch/snap VG Name vg_scratch LV UUID 4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 LV Write Access read/write LV snapshot status active destination for /dev/vg_scratch/lv_scratch LV Status available # open 1 LV Size 894.27 GiB Current LE 228934 COW-table size 18.63 GiB <-- Has increased COW-table LE 4769 <-- (Same thing, but measured in logical extents) Allocated to snapshot 39.18% <-- Much better! Snapshot chunk size 4.00 KiB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 252:1
It has been two weeks since I started fsck
, and it is still running. During this time I've not been able to use my desktop PC for gaming, and so I've decided to hit ^C and move it to another machine. Here are the forensics:
Now:
# meermanr@Ikari:/home/meermanr (master *) # date Mon May 27 14:53:34 BST 2013
Size of block device:
root@Ikari:/home/meermanr/projects/find_ext3# python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f = open("/dev/loop0") >>> f.seek(0, 2) >>> f.tell() 960218560000 >>> hex(f.tell()) '0xdf917c7600' >>>
Offsets of fsck
and python
:
Every 2.0s: lsof /dev/loop0 Mon May 27 14:56:18 2013 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME fsck.ext2 20182 root 4u BLK 7,0 0x3f903ef000 5941 /dev/loop0 python 23598 root 3r BLK 7,0 0xdf917c7600 5941 /dev/loop0
That's approximately 28%.
STARTED %CPU %MEM RSS CMD May 13 0.0 0.0 364 su May 13 0.0 0.0 528 \_ bash May 13 0.0 0.0 1320 \_ watch lvdisplay /dev/vg_scratch/snap May 13 0.0 0.0 364 su May 13 0.0 0.0 536 \_ bash May 13 0.0 0.0 464 \_ fsck /dev/loop0 -y May 13 99.1 24.3 1488468 \_ fsck.ext2 -y /dev/loop0 May 14 0.0 0.0 1456 watch lsof /dev/loop0
So fsck
has about 1.4GiB of resident memory (and 4696MiB of virtual memory according to top):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20182 root 30 10 4696m 1.4g 784 R 94 24.3 19564:51 fsck.ext2 -y /dev/loop0
Total system memory:
# meermanr@Ikari:/home/meermanr (master *) # free -m total used free shared buffers cached Mem: 5969 5036 933 0 87 157 -/+ buffers/cache: 4791 1178 Swap: 10234 4764 5470
Every 2.0s: lvdisplay /dev/vg_scratch/snap Mon May 27 14:54:06 2013 File descriptor 4 (pipe:[72594953]) leaked on lvdisplay invocation. Parent PID 23517: sh --- Logical volume --- LV Name /dev/vg_scratch/snap VG Name vg_scratch LV UUID 4EFJ8Y-bzWT-aif4-MlT9-4234-aS1d-qcipq0 LV Write Access read/write LV snapshot status active destination for /dev/vg_scratch/lv_scratch LV Status available # open 1 LV Size 894.27 GiB Current LE 228934 COW-table size 84.89 GiB COW-table LE 21733 Allocated to snapshot 43.48% Snapshot chunk size 4.00 KiB Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 252:1
Output from fsck
itself:
File ... (inode #9791282, mod time Thu Oct 5 01:40:26 2006) has 11143 multiply-claimed block(s), shared with 5 file(s): <filesystem metadata> ... (inode #9791794, mod time Thu Oct 5 01:40:26 2006) ... (inode #4115835, mod time Thu Aug 20 03:31:06 2009) ... (inode #4130006, mod time Mon Nov 29 16:38:10 2010) ... (inode #4784754, mod time Tue Jul 26 06:01:10 2005) Clone multiply-claimed blocks? yes