
Exact details are missing, but the general understanding is that OVH failed a large scale BGP routine update, causing all routes to disappear. This by itself has several causes, and we'll be sharing three of those causes here.įirst, a worldwide OVH outage on the 13th of October caused a connection loss between all of our servers. We have an alternative storage plan drawn out in case this solution remains problematic.Īs discussed above, issues occur when nodes lose connection. Since the outage last weekend, we have tweaked several settings in this storage solution and are monitoring the results. Unfortunately, the priority wasn't high enough initially and a resolution has been pushed back to the next major update. We were aware of this issue, and have been tracking it for some time. The developers of said storage solution agree, and a bug/feature ticket has been created on their Github as a result. The above wouldn't be a concern if a short connection drop wouldn't immediately kill the replica instance on the other machine that loses connection. Why machines lost connection in the first place will be disclosed later in this post. This is what was the cause of lengthy outages the past few weeks. In some cases, volumes cannot be attached to workloads before they are fully healthy. Doing this with dozens of volumes puts considerable strain on our connection between the two continents.
SERENIA FANTASY FORUM PRIVATE SERVER FULL
A full replication rebuild thus causes time. To save budget and reserve extra capacity for alpha and beta development, excessive hardware has been removed and thus we cannot have replicas for these volumes be reliably placed across multiple machines in the same continent, and we're forced to replicate across continents. Repairing a volume across the Atlantic ocean comes with a big drop in network capacity. We have got data replication across our American and European servers as some of you may know. This doesn't mean data loss, nor does it necessarily cause any damage, but it does cause the storage solution to rebuild the entire replica on that particular node from a node that was healthy. As a result, the data replica on the unhealthy node is immediately considered degraded. This means the replica status is lost and no health can be communicated anymore. The problem, however, is that the node becoming unhealthy, results in the storage solution shutting itself down on this machine.
SERENIA FANTASY FORUM PRIVATE SERVER SOFTWARE
This is okay because then we can deploy software on different machines and quickly recover. As soon as these nodes lose connection with each other, the machine entirely will be marked as unhealthy. That sounds great, so why bring it up? Everything works great as long as everything remains connected. This is great for keeping data integer and fault-tolerant, as well as being able to quickly move software deployments such as databases to new machines without having to wait for file transfers. This means that all machines are constantly replicating each other's volumes. As such, we are running an internal block storage solution called Longhorn, with volume replication across all of our servers in geographically different data centers. When setting up our infrastructure, we had to choose between these options and decided to go with a storage solution that had integrity and was scalable. Some setups value simplicity, some value scalability, others value performance or integrity. Within the hosting, and especially the cloud hosting industry, storage comes in many different shapes, sizes, pros and cons. You can skip to the summary at the bottom for a short, simplified version. The post below will involve some technical details. We wish to be transparent with everyone on what is going on, why these outages happened and how we plan to tackle them moving forward. This caused our website, Minecraft servers and internal tooling to be unavailable. Dear community, as some of you may have noticed, we've suffered from several infrastructure outages over the recent weeks.
