page.db shrink behaviour

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

page.db shrink behaviour

TheQL
Hi,

searching the forum I found out that our page.db is not shrinking very much because it only shrinks to the highest occupied page. As we have some persistent messages in the page.db and I will not be deleting them any time soon this seems to be an unresolvable issue.

This is what it looks like:
File Size: -8543474
Free Pages: 1631254
Used Pages: 39161

As you can see, we (still) suffer from the display of a negative value as size (>3.5GB) and have a huge amount of free pages. Is there any chance of improving the shrink process?

What's most annoying in our HA setup is the initial sync as we are using a replicated store. It seems SwiftMQ not only syncs all used pages but instead syncs the complete file. Probably even uncompressed, as I would assume there are a lot of zeros in the file (maybe not, unsure). Anyway, if there is no help for me at the moment, as a suggestion for the future I would highly appreciate some improvement in this matter. We experience huge timeouts in our HA setup on restarting one of the nodes, which is exactly what the HA setup should prevent from happening.

Thanks,

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
What you could try is to move your messages to another queue and back to the original queue(s). This will occupy pages at the start of page.db and free those at the end. Then issue the shrink command.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
In reply to this post by TheQL
A better way might be to figure out who's occupying the high pages. Do the following:

1) Perform an online backup.

2) Transfer the backup saveset to your local machine.

3) Use a local SwiftMQ router and modify smqr1 script by adding -Dswiftmq.store.analyze=true.

4) Copy the page.db from the backup saveset into the local router's store.

5) Start the router then stop it.

6) For each queue and the rootindex you'll find a .analyze file.

7) Check each file for the page numbers to figure out which queue to move.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
In reply to this post by IIT Software
Thanks, might try this...
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
In reply to this post by IIT Software
Hm,

now I am a little irritated...

Just wanted to create a temp queue to move some messages and I receive the following error.

com.swiftmq.swiftlet.store.StoreException: java.lang.ArrayIndexOutOfBoundsException
        at com.swiftmq.impl.store.standard.StoreSwiftletImpl.getPersistentStore(Unknown Source)
        at com.swiftmq.impl.queue.standard.RegularQueueFactory.createQueue(Unknown Source)
        at com.swiftmq.impl.queue.standard.QueueManagerImpl.a(Unknown Source)
        at com.swiftmq.impl.queue.standard.QueueManagerImpl.a(Unknown Source)
        at com.swiftmq.impl.queue.standard.QueueManagerImpl.a(Unknown Source)
        at com.swiftmq.impl.queue.standard.a.onEntityAdd(Unknown Source)
        at com.swiftmq.mgmt.Entity.addEntity(Unknown Source)
        at com.swiftmq.mgmt.EntityList$3.execute(Unknown Source)
        at com.swiftmq.mgmt.CommandRegistry.executeCommand(Unknown Source)
        at com.swiftmq.mgmt.RouterConfigInstance$1.execute(Unknown Source)
        at com.swiftmq.mgmt.CommandRegistry.executeCommand(Unknown Source)
        at com.swiftmq.mgmt.RouterConfigInstance.executeCommand(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.v750.DispatcherImpl.visit(Unknown Source)
        at com.swiftmq.mgmt.protocol.v750.CommandRequest.accept(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.v750.DispatcherImpl.visit(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.po.ClientRequest.accept(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.v750.DispatcherImpl.process(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.DispatchQueue.a(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.DispatchQueue.visit(Unknown Source)
        at com.swiftmq.impl.mgmt.standard.po.ClientRequest.accept(Unknown Source)
        at com.swiftmq.tools.pipeline.PipelineQueue.process(Unknown Source)
        at com.swiftmq.tools.queue.SingleProcessorQueue.dequeue(Unknown Source)
        at com.swiftmq.tools.pipeline.PipelineQueue$QueueProcessor.run(Unknown Source)
        at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)

We did have a problem with the store last week. The harddisk ran full and swiftmq wasn't really pleased with that. Extended diskspace and restarted, system performed a store check but does this implicate a problem with the store? I am not too keen on deleting our page.db as a fix solution.

Regards,

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
 Please analyze the store as described above.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
Can I save my store with an online backup? And how do I perform this?
http://www.swiftmq.com/products/router/admin/cli/cmdref/index.html

Just enter backup? Where will the file be saved?

I am confused anyway, as the store swiftlet is configured to keep 3 generations of backups but the folder is empty.

I will try to analyze the store locally, but can I use that (fixed) store and exchange it with my page.db on the HA router? Do I need to copy the config to have the queues available locally or isn't that necessary?

Thanks for your help! It's highly appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
We have docs which describes that.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
Sorry, I am just a little worried right now and not calm enough as SwiftMQ really is a central infrastructural component for us and I am a little afraid we could be losing data when pages are to be adressed that are somehow out of bounds. Performing the backup right now.

I am a little confused that the transaction.log isn't growing while performing the backup. In fact the transaction.log hasn't been touched since Jan 17 13:13.

Still unsure if I can just check the store without copying the config but will try that. It'll all take some time... I'll be back with the results. Thanks again!
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
A backup just incepts a checkpoint and copies the store files. That should work. Then use the backup saveset and test locally. You don't need the config files. The only thing you do at your local machine is to analyze the store.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
Hi,

thanks.

Got a page.db in a saveset folder. But only that, no .completed or anything file. Assumed it would be finished as it stopped growing. Just copying to my local machine, will take another half hour though.

EDIT: of course ".completed" was there... Me bad...
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
In reply to this post by IIT Software
This is what I didn't want to find:

+++ consistency check in progress ...
    Unrecoverable Error -- have to delete the whole persistent Store!

*** The consistency check of the persistent store has found inconsistent data.
*** This has been corrected (the store is now consistent).
*** However, you might have lost persistent data.
*** To avoid inconsistent data in future,
*** please consider to enable 'force-sync' of the transaction log.

So, what do I do now? Copy all data to another router, stop the router, delete the store and copy my data back?

And would "force-sync" really help in the future? At which cost and where do I enable it?

Thanks once more!
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
The whole store is corrupt for whatever reason. If the disk is full and the log manager can't write to it, it shuts the HA instance immediately down and the Standby takes over. Was that the case when your disk were full or was the HA instance in Standalone mode?

So, what do I do now? Copy all data to another router, stop the router, delete the store and copy my data back?
That's the only way if you want to rescue those messages which are still accessible by the move.

You don't need to enable force-sync with HA because you always have a Standby. This is only necessary if you run it in Standalone mode or non-HA.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
My colleague encountered the problem. As far as he remembers the router was still active although the disk was full. He did the failover manually which didn't really work. Both routers were down for a short time and then the former Standby router was started as standalone but still had to perform a consistency check.

That's about what happened. Maybe I could find the logfiles for the time this happened.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
You should check the error.log of the Active HA instance (where the disk was full). You should see related entries there.
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
This is what was logged in error.log

2011-01-11 08:05:00.533/sys$routing/ERROR/[RoutingConnection mailbox|routerX.xxx.local:4200]/v400DeliveryStage, recoveryBranchQ=swiftmq/src=router1/dest=mailbox/visited, request=[TransactionRequest [Request, dispatchId=0 requestNumber=0 correlationId=0 timeout=-1 replyRequired=false reply=null], sequenceNo=2211854, xid=[XidImpl, branchQualifier=swiftmq/src=mailbox/dest=router1, formatId=2211854, globalTransactionId=1292335281854-2211854, routing=true], nMessages=1] exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device, disconnecting
2011-01-11 08:05:08.784/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@ef90c2/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:18.797/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@1a3873b/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:18.804/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@8a982f/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:28.782/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@ddc06/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:28.800/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@fcaf84/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:38.777/sys$xa/ERROR/[XALiveContextImpl, xid=[XidImpl, branchQualifier=a0a5101:c324:4d2bd5d8:c0d1, formatId=131075, globalTransactionId=1-a0a5101:c324:4d2bd5d8:c0c7, routing=false], prepared=false]prepare xid=[XidImpl, branchQualifier=a0a5101:c324:4d2bd5d8:c0d1, formatId=131075, globalTransactionId=1-a0a5101:c324:4d2bd5d8:c0c7, routing=false], failed for queue: queue@router1
2011-01-11 08:05:38.793/sys$xa/ERROR/[XALiveContextImpl, xid=[XidImpl, branchQualifier=a0a5101:c324:4d2bd5d8:c0d1, formatId=131075, globalTransactionId=1-a0a5101:c324:4d2bd5d8:c0c7, routing=false], prepared=true]commit (two phase) xid=[XidImpl, branchQualifier=a0a5101:c324:4d2bd5d8:c0d1, formatId=131075, globalTransactionId=1-a0a5101:c324:4d2bd5d8:c0c7, routing=false], failed for queue: queue@router1, exception: com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:38.797/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@1fef86b/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:38.819/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@1d5966/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left on device
2011-01-11 08:05:48.885/xt$javamail/ERROR/com.swiftmq.extension.javamail.inbound.v@101af7f/run, exception=com.swiftmq.swiftlet.queue.QueueException: com.swiftmq.swiftlet.store.StoreException: java.io.IOException: No space left 2011-01-11 08:53:23.768/sys$jms/ERROR/JMSConnection v630/host.xxx.de:39186/exception creating temp queue: com.swiftmq.swiftlet.auth.ResourceLimitException: Resource Limit Group 'public': max temp. queues per connection exceeded. Resource limit is: 50

Final startup of the second router (remember, both were stopped manually some time in the progress) was this (no prior log entries from failover):

2011-01-11 09:28:17.625/sys$store/ERROR/Queue Store QueueXXX is inconsistent - removed!
2011-01-11 09:33:06.979/sys$store/ERROR/The consistency check of the persistent store has found inconsistent data.
2011-01-11 09:33:06.995/sys$store/ERROR/This has been corrected (the store is now consistent).
2011-01-11 09:33:06.995/sys$store/ERROR/However, you might have lost persistent data.
2011-01-11 09:33:06.995/sys$store/ERROR/To avoid inconsistent data in future,
2011-01-11 09:33:06.995/sys$store/ERROR/please consider to enable 'force-sync' of the transaction log.

I would assume, store is ok. Doesn't seem to be the case.


Console of the router running out of disk space:

java.io.IOException: No space left on device
        at java.io.RandomAccessFile.writeBytes(Native Method)
        at java.io.RandomAccessFile.write(RandomAccessFile.java:453)
        at com.swiftmq.impl.store.standard.cache.StableStore.writePage(Unknown Source)
        at com.swiftmq.impl.store.standard.cache.StableStore.create(Unknown Source)
        at com.swiftmq.impl.store.standard_ha.db.ReplicatedStableStoreSource.create(Unknown Source)
        at com.swiftmq.impl.store.standard.cache.CacheManager.createAndPin(Unknown Source)
        at com.swiftmq.impl.store.standard.index.PageOutputStream.c(Unknown Source)
        at com.swiftmq.impl.store.standard.index.PageOutputStream.write(Unknown Source)
        at com.swiftmq.jms.ObjectMessageImpl.writeBody(Unknown Source)
        at com.swiftmq.jms.MessageImpl.writeContent(Unknown Source)
        at com.swiftmq.impl.store.standard.index.QueueIndex.add(Unknown Source)
        at com.swiftmq.impl.store.standard.StoreWriteTransactionImpl.insert(Unknown Source)
        at com.swiftmq.impl.queue.standard.MessageQueue.a(Unknown Source)
        at com.swiftmq.impl.queue.standard.MessageQueue.prepare(Unknown Source)
        at com.swiftmq.swiftlet.queue.QueueTransaction.prepare(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.v400.s.a(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.v400.DeliveryStage.a(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.v400.DeliveryStage.a(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.v400.q.visited(Unknown Source)
        at com.swiftmq.impl.routing.single.smqpr.SMQRVisitor.a(Unknown Source)
        at com.swiftmq.impl.routing.single.smqpr.SMQRVisitor.visit(Unknown Source)
        at com.swiftmq.impl.routing.single.smqpr.v400.TransactionRequest.accept(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.v400.DeliveryStage.process(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.stage.StageQueue.process(Unknown Source)
        at com.swiftmq.tools.queue.SingleProcessorQueue.dequeue(Unknown Source)
        at com.swiftmq.impl.routing.single.connection.stage.g.run(Unknown Source)
        at com.swiftmq.imp      at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)


lots of blank lines


Shutdown SwiftMQ 7.6.0 Production ...
... shutdown: JavaMail Bridge Extension Swiftlet
-bash: line 1:  1782 Killed                  java -server -Xmx1024M -cp ../../jars/swiftmq.jar:../../jars/jndi.jar:../../jars/jms.jar:../../jars/jsse.jar:../../jars/jnet.jar:../../jars/jcert.jar:../../jars/dom4j-full.jar:../../jars/jta-spec1_0_1.jar com.swiftmq.HARouter ../../config/replicated/instance1/routerconfig.xml

Console log of the standby router:

+++ High Availability State: STANDBY/STANDBY
... resume: Authentication Swiftlet
... resume: Store Swiftlet (HA)
+++ RecoveryManager/restarting, processing transaction log...
+++ RecoveryManager/restart, 10% so far...
+++ RecoveryManager/restart, 30% so far...
+++ RecoveryManager/restart, 40% so far...
+++ RecoveryManager/restart, 50% so far...
+++ RecoveryManager/restart, 90% so far...
+++ RecoveryManager/restart, 100% so far...
+++ RecoveryManager/restart done.
-bash: line 1:  3082 Killed                  java -server -Xmx1024M -cp ../../jars/swiftmq.jar:../../jars/jndi.jar:../../jars/jms.jar:../../jars/jsse.jar:../../jars/jnet.jar:../../jars/jcert.jar:../../jars/dom4j-full.jar:../../jars/jta-spec1_0_1.jar com.swiftmq.HARouter ../../config/replicated/instance2/routerconfig.xml

As you can see, too bad there are no timestamps so I can only hope I got the right part, the router was in standby, resumed operation as the active HA instance was manually shut down and started. Later the router was killed once more.

And this is all I got.


Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
In reply to this post by IIT Software
I have another question for our recovery operation.

I thought maybe it would be a good idea to copy the queue config from our HA router to my local instance, copy the backup to the store folder, start the router and then copy the persistent queues back to the HA, which I would stop and delete the store.

Tried this now, wouldn't start locally because of the arrayoutofbounds error. Started with -Dswiftmq.store.analyze=true parameter, but then the store got deleted. Is there a way to have the router start with the corrupted store? At least it is still running on our HA so it must work somehow...

Regards,

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
Concerning the "disk full" stuff above: SwiftMQ immediately shuts down if such an exception occurs during a write to the transaction.log. The exceptions you've gotten above were from writing to the page.db which just pass the exception along the call stack but do not a shut down. We will change that.

===

You could try as you've described above but you need to set -Dswiftmq.store.analyze=false. Otherwise it will perform a consistency check which deletes the store.

Keep in mind when you do the copy back to the HA:

- shut down *both* HA instances, otherwise you will get the corrupted store replicated from the Standby
- copy the page.db to the HA instance which was Standalone
- start this instance *first*
- then start the other instance
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

TheQL
IIT Software wrote
Concerning the "disk full" stuff above: SwiftMQ immediately shuts down if such an exception occurs during a write to the transaction.log. The exceptions you've gotten above were from writing to the page.db which just pass the exception along the call stack but do not a shut down. We will change that.
Ok. But how come the transaction.log hardly ever gets written to? Or is this a specialty with HA setups?

IIT Software wrote
You could try as you've described above but you need to set -Dswiftmq.store.analyze=false. Otherwise it will perform a consistency check which deletes the store.

Keep in mind when you do the copy back to the HA:

- shut down *both* HA instances, otherwise you will get the corrupted store replicated from the Standby
- copy the page.db to the HA instance which was Standalone
- start this instance *first*
- then start the other instance
Actually I planned on adding my local machine to the router network, just deleting the store from our HA and start it up without anything, then copy the data back.

But I fear this won't work as I don't get my local router started with the store:

java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException
        at com.swiftmq.impl.store.standard.index.IndexPage.a(Unknown Source)
        at com.swiftmq.impl.store.standard.index.IndexPage.<init>(Unknown Source
)
        at com.swiftmq.impl.store.standard.index.RootIndexPage.<init>(Unknown So
urce)
        at com.swiftmq.impl.store.standard.index.RootIndex.<init>(Unknown Source
)
        at com.swiftmq.impl.store.standard.StoreSwiftletImpl.startup(Unknown Sou
rce)
        at com.swiftmq.swiftlet.SwiftletManager.startUpSwiftlet(Unknown Source)
        at com.swiftmq.swiftlet.SwiftletManager.startKernelSwiftlet(Unknown Sour
ce)
        at com.swiftmq.swiftlet.SwiftletManager.startKernelSwiftlets(Unknown Sou
rce)
        at com.swiftmq.swiftlet.SwiftletManager.initSwiftlets(Unknown Source)
        at com.swiftmq.swiftlet.SwiftletManager.startRouter(Unknown Source)
        at com.swiftmq.Router.main(Unknown Source)
com.swiftmq.swiftlet.SwiftletException: java.lang.ArrayIndexOutOfBoundsException

        at com.swiftmq.impl.store.standard.StoreSwiftletImpl.startup(Unknown Sou
rce)
        at com.swiftmq.swiftlet.SwiftletManager.startUpSwiftlet(Unknown Source)
        at com.swiftmq.swiftlet.SwiftletManager.startKernelSwiftlet(Unknown Sour
ce)
        at com.swiftmq.swiftlet.SwiftletManager.startKernelSwiftlets(Unknown Sou
rce)
        at com.swiftmq.swiftlet.SwiftletManager.initSwiftlets(Unknown Source)
        at com.swiftmq.swiftlet.SwiftletManager.startRouter(Unknown Source)
        at com.swiftmq.Router.main(Unknown Source)
Exception during startup kernel swiftlet 'sys$store': java.lang.ArrayIndexOutOfB
oundsException


Any idea?
Otherwise I'll have to do a pretty time consuming operation, locking out all consumers, copy all queues with data to another router, stop the router, delete the store, start the router, copy the data back and then re-enable all mover jobs, mailbridges and network connectivity (e.g. via firewall). Sounds like a lot of trouble and a down time which is higher than I'd appreciate.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: page.db shrink behaviour

IIT Software
Administrator
The transaction.log gets written on each transaction but it is important that pages which are ref'd in a log record exists in the page.db. This is an "ensure" operation and writes the page.db. Since "disk full" is thrown here and passed up the call stack, it never gets to the point where the log record is written and would force a shut down.

Root cause was a full disk. Proper recovery would be to hard kill -9 the Active instance so that the Standby can take over. This is required, otherwise the replication connection is still alive. If you restart the Standby while the Active struggles with a full disk, it runs through the sync phase and will get the corrupted store replicated.

The stack trace above shows that the root index is corrupt (that is the index where the queue indexes are rooted in). This explains why your HA still works but you can't start on your local machine.

A corrupted root index may be recoverable but this is nothing we actually do in a free forum (which is for simple questions and not for 20+ replies). As an exception you may provide us your page.db per ftp and send the access url and credentials to bugreport@swiftmq.com.

Next time please use a support contract (Gold required for production).

12