Quantcast

If the backup FS is full the router should not die

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

If the backup FS is full the router should not die

TheQL
This happens if a backup fails:

PANIC, EXITING: java.io.IOException: No space left on device
java.io.IOException: No space left on device
        at java.io.RandomAccessFile.writeBytes(Native Method)
        at java.io.RandomAccessFile.write(RandomAccessFile.java:469)
        at com.swiftmq.impl.store.standard.cache.StableStore.writePage(Unknown Source)
        at com.swiftmq.impl.store.standard.cache.StableStore.copy(Unknown Source)
        at com.swiftmq.impl.store.standard.backup.BackupProcessor.checkpointFinished(Unknown Source)
        at com.swiftmq.impl.store.standard.cache.CacheManager.flush(Unknown Source)
        at com.swiftmq.impl.store.standard.transaction.TransactionManager.performCheckPoint(Unknown Source)
        at com.swiftmq.impl.store.standard.log.LogManager.process(Unknown Source)
        at com.swiftmq.tools.queue.SingleProcessorQueue.dequeue(Unknown Source)
        at com.swiftmq.impl.store.standard.log.b.run(Unknown Source)
        at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)

We do have the backup FS separate from the store, etc., so I do not think it is required to let the router die. I have no perfect idea on how to let the user know there is a problem, but that's not the best way. We did in fact notice the problem with the backup FS, we just didn't act on it in time.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

IIT Software
Administrator
The backup is performed during a check point from the Log Manager. You are right, we should differentiate here...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

IIT Software
Administrator
In reply to this post by TheQL
I cannot reproduce it here. When I force an IOException at exactly that position during the backup, the backup completes with an entry in the error.log as it should be but no stack trace nor router panic. This is actually not possible from the code as the block above is covered by a try {...} catch (Exception e){...}.

Which release do you use?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

TheQL
It's 9.6.0, but we did run into this in the past at least once with a probably older version.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

IIT Software
Administrator
Anyway, I consider this as fixed. I can't find any older version that has not this try/catch.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

TheQL
But then where did the panic come from? The store filesystem was not close to being full, which is a case I would understand to be a critical problem.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

IIT Software
Administrator
I have no idea. This is the code where the exception is thrown. As you see it is caught. There is no way to get into a PANIC:

  // Called from the LogManager after a Checkpoint has been performed and before the Transaction Manager
  // is restarted
  public void checkpointFinished()
  {
    if (ctx.traceSpace.enabled)
      ctx.traceSpace.trace(ctx.storeSwiftlet.getName(), toString() + "/checkpointFinished ...");
    BackupCompleted nextPO = new BackupCompleted();
    try
    {
      if (ctx.preparedLog.backupRequired())
        ctx.preparedLog.backup(currentSaveSet);
      ctx.stableStore.copy(currentSaveSet);
      ctx.durableStore.copy(currentSaveSet);
      new File(currentSaveSet + File.separatorChar + COMPLETED_FILE).createNewFile();
      nextPO.setSuccess(true);
// This was my test:
if (true) throw new IOException("Disk full!");
    } catch (Exception e)
    {
      nextPO.setSuccess(false);
      nextPO.setException(e.toString());
    }
    enqueue(new ScanSaveSets(nextPO));
    if (ctx.traceSpace.enabled)
      ctx.traceSpace.trace(ctx.storeSwiftlet.getName(), toString() + "/checkpointFinished done");
  }
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

TheQL
Is a store snapshot / copy generated on the main FS and then moved to the backup target, so that it might have been full without being noticed? This could happen unnoticed, if the required space was freed again after the exception, although I believe there was more than double of the currently used space available.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

IIT Software
Administrator
It has nothing to do with the store. The exception was thrown during the copy of the store to the backup save set and this is actually not possible from the code and verified by my test:

        at com.swiftmq.impl.store.standard.cache.StableStore.writePage(Unknown Source)
        at com.swiftmq.impl.store.standard.cache.StableStore.copy(Unknown Source)
        at com.swiftmq.impl.store.standard.backup.BackupProcessor.checkpointFinished(Unknown Source)
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: If the backup FS is full the router should not die

TheQL
Sadly I'm no expert in this, but as the exception was thrown and the panic occured, it must be reproducable and we must be not seeing a part of the puzzle. Also I really do recall this happening before with a full backup FS in the past.

Maybe this additional info helps:

mount options:
/dev/mapper/SystemVG-backupfs on /opt/backup type ext4 (rw,nodev,nobarrier)

fs options:
$ tune4fs -l /dev/mapper/SystemVG-backupfs
tune4fs 1.41.12 (17-May-2010)
Filesystem volume name:   <none>
Last mounted on:          /opt/backup/opt/backup
Filesystem UUID:          4a5e43e6-de20-456d-8bfe-ac64d95f42d0
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              655360
Block count:              2621440
Reserved block count:     131049
Free blocks:              2521765
Free inodes:              655313
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      191
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Tue Jul 31 13:50:18 2012
Last mount time:          Sun Sep 20 15:18:04 2015
Last write time:          Sun Sep 20 15:18:04 2015
Mount count:              2
Maximum mount count:      28
Last checked:             Tue Sep  8 13:23:20 2015
Check interval:           15552000 (6 months)
Next check after:         Sun Mar  6 12:23:20 2016
Lifetime writes:          135 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      ee3285a6-5f73-44b2-aa05-b28433e55843
Journal backup:           inode blocks

Could it maybe be an issue with the reserved blocks only for the root user?
Loading...