Update: Proton worldwide outage caused by Kubernetes migration, software change (link)

  • Kid@sh.itjust.worksOPM
    link
    fedilink
    English
    arrow-up
    51
    ·
    13 days ago

    Apparently was not related to a cyber attack, as stated in status page (https://status.proton.me/)

    We have resolved all service outages, and the situation has been stable for some time. We have identified the root cause of the problem, implemented a fix, and are now monitoring the results. Jan 09, 2025 - 19:27 CET

  • Mango@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    19
    ·
    13 days ago

    It was honestly just down for about 2 hours. I hate how you can find out about the status of proton services being down faster on their reddit subreddit than on proton’s status page…

    Proton’s official response on the outage via reddit (wish they would move to lemmy):

    Earlier today at around 4PM Zurich, the number of new connections to Proton’s database servers increased sharply globally across Proton’s infrastructure.

    This overloaded Proton’s infrastructure, and made it impossible for us to serve all customer connections. While Proton VPN, Proton Pass, Proton Drive/Docs, and Proton Wallet were recovered quickly, issues persisted for longer on Proton Mail and Proton Calendar. For those services, during the incident, approximately 50% of requests failed, leading to intermittent service unavailability for some users (the service would look to be alternating between up and down from minute to minute).

    Normally, Proton would have sufficient extra capacity to absorb this load while we debug the problem, but in recent months, we have been migrating our entire infrastructure to a new one based on Kubernetes. This requires us to run two parallel infrastructure at the same time, without having the ability to easily move load between the two very different infrastructures. While all other services have been migrated to the new infrastructure, Proton Mail is still in middle of the migration process.

    Because of this, we were not able to automatically scale capacity to handle the massive increase in load. In total, it took us approximately 2 hours to get back to the state where we could service 100% of requests, with users experiencing degraded performance until then. The service was available, but only intermittently, with performance being substantially improved during the second hour of the incident, but requiring an additional hour to fully resolve.

    A parallel investigation by our site reliability engineering team identified a software change that we suspected was responsible for the initial load spike. After this change was rolled back, database load returned to normal. This change was not initially suspected because a long period of time had elapsed between when this change was introduced and when the problem manifested itself, and an initial analysis of the code suggested that it should have no impact on the number of database connections. A deeper analysis will be done as part of our post-mortem process to understand this better.

    The completion of ongoing infrastructure migrations will make Proton’s infrastructure more resilient to unexpected incidents like this by restoring the higher level of redundancy that we typically run, and we are working to complete this work as quickly as possible.

    • ChapulinColorado@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      13 days ago

      I don’t use them, but I do work in tech and oopsies do happen even with a properly configured k8s set of clusters or well managed bare metal infrastructure and well trained engineers. A developer could not be fully aware of something as simple as logs going to a file being something that can bring down capacity due to evicted pods on k8s for example.

      It does sound like the post is beating around the bush on terms of what caused the outage, but if their post mortem acknowledged fully what it was and decent steps being taken to mitigate it, short and long term it could still be a lesson learned. Generally it’s not possible to just correct something that quickly on complex systems or environments that have been used to a certain workflow as much as customers and users would like (developers like anyone else make mistakes).

      Whether a noobie mistake on the code review process or something else if they are honest and clear it can still impress people willing to migrate. Using MS teams and O365 at work it feels like there is an intermittent outage every other month.