Yesterday I was speaking to a potential customer who was interested in how email archiving could be integrated with their Exchange replication technology. Which got me thinking...
How do you trade off potential data loss against extended downtime in a replicated / clustered environment?
Simplifying hugely, there's two ways you can handle this: you can either replicate data instantly, and not report that an action is complete until all replicas are consistent, or you do it asynchronously, and let the client know it's complete as soon as your active copy is compete.
Synchronous replication is safest, because you guarantee that all copies are always consistent, and contractually, if you've told a client (think here: remote mail server, email client, but it generalises) that you've done something, you know it's hit the disk everywhere, and so things would have to go pretty badly wrong for that data to be lost. But with safety comes a performance hit: you need to accept the data, store it locally, push it over to the other copy, maybe over a slow WAN link to a DR site, wait for that server to acknowledge it, and only then can you allow the request to complete.
Asynchronous replication can be much faster: you tell clients everything's OK as soon as you've committed it locally, then periodically ship those actions over to the other copies, which become "eventually consistent". That might happen once a day, once an hour, or once every few seconds or minutes. Obviously the more often you do it, the less data you stand to lose if your primary server goes up in literal or metaphorical flames. Whatever you do, though, there's a window of vulnerability that exists before it gets copied over.
And here's the problem: at what point do you decide to fail over to your DR site or server? If you fail over automatically, there's the risk that a temporary glitch causes the failover and loses a few minutes' data that hadn't yet been replicated*. But if you're cautious and require manual intervention to initiate the failover, you extend the amount of down-time users see.
So I'm curious - how do you handle this trade-off? Not just email, but other systems as well...
* Again, this simplifies somewhat: if your replication technology is intelligent, you might be able to merge the two sets of data back again after the fact. I believe Windows DFS does this, for example, but that's not always possible, especially if the changes are conflicting.
You need to be a member of The SysAdmin Network to add comments!
Join The SysAdmin Network