Hits stats wrong (Resolved)
  • Priority - Low
  • Affecting System - ElasticSarch
  • Our statistics from ElasticSearch can be wrong due to a failure in data allocation, we've corrected the error on the cluster, but had to throw some log data away, because it's only temporary data we won't refill the data into the cluster, and just let it recover over the 7 days.

  • Date - 21/10/2018 22:05 - 21/10/2018 22:05
  • Last Updated - 21/10/2018 22:06
Disk replacement (Resolved)
  • Priority - Critical
  • Affecting Server - GRA5
  • We have a failing disk in server5 (GRA5), and we have to replace the disk during the evening.
    There will be downtime involved in the replacement.

    We're scheduling the replacement somewhere around 10 pm and the disk will be replaced shortly after or during the night.

    We're expecting the downtime to be roughly 30 minutes or less.
    After the replacement, we'll rebuild the raid array.

    We'll perform an additional dump of MySQL databases to our backup server prior to the replacement of the disk as a safety measure.

    We're sorry for the inconvenience caused by this - but we need to ensure the availability of the raid array.

    Update 20.15: We had a short lockup again, lasting for roughly 1 minute.

    Update 21.00: We'll request a disk replacement in a few minutes.

    Update 21.57: The server has been turned off, to get the disk replaced.

    Update 22.09: The server is back online, services are stabilizing and raid rebuild is running

  • Date - 20/09/2018 16:00 - 21/09/2018 13:00
  • Last Updated - 09/10/2018 13:02
Device upgrade GRA (Resolved)
  • Priority - Low
  • Affecting System - Network
  • The data center will upgrade the top of rack switches in the Gravelines data center.

    This will affect server10 (GRA10).

    The maintenance will take place starting 11 pm the 18th of September and last until 6 am the 19th.

    There will be a loss of network for up to 10 minutes.

    Server5 and server9 got completed the night between September 12 and September 13

  • Date - 18/09/2018 23:00 - 19/09/2018 06:00
  • Last Updated - 19/09/2018 08:04
Device upgrade RBX (Resolved)
  • Priority - Low
  • Affecting System - Network
  • The data center will upgrade the top of rack switches in the Roubaix data center.

    This will affect server6 (RBX6), server7 (RBX7) and server8 (RBX8)

    The maintenance will take place starting 11 pm the 13th of September and last until 6 am the 14th.

    There will be a loss of network for up to 10 minutes.

    Update 07.05: Maintenance completed as of 03.42 am.

  • Date - 13/09/2018 23:00 - 14/09/2018 06:00
  • Last Updated - 14/09/2018 07:06
Device upgrade RBX (Resolved)
  • Priority - Low
  • Affecting System - Network
  • The data center will upgrade the top of rack switches in the Roubaix data center.

    This will affect server6 (RBX6)

    The maintenance will take place starting 11 pm the 19th of September and last until 6 am the 20th.

    There will be a loss of network for up to 10 minutes.

  • Date - 19/09/2018 23:00 - 20/09/2018 06:00
  • Last Updated - 13/09/2018 08:00
Top of Rack switch upgrades (Resolved)
  • Priority - Low
  • Affecting System - Network
  • The data center is performing top of rack switch upgrades across RBX and GRA data centers, this means servers: server5, server6, server7, server8, server9 and server10 might be affected for up to 10 minutes, randomly during the night.

    We're sorry for the inconvenience caused by this.

    Update 07.55 am:
    top of rack switches for server5 and server9 has been updated.

    server10 is planned for the night between September 18 and September 19
    server6, server7, server8 planned for the night between September 13 and September 14

  • Date - 12/09/2018 23:00 - 19/09/2018 06:00
  • Last Updated - 13/09/2018 07:58
MultiPHP enabled (Resolved)
  • Priority - Medium
  • Affecting Server - GRA4
  • We'll migrate this server to a MultiPHP setup to support future versions of PHP (7.0 and 7.1)

    Currently the server runs with something called "EasyApache 3" (Provided by cPanel), we'll be upgrading to the new version called EasyApache 4 in our CloudLinux environment.

    This also means that PHP Selector will be deprecated, meaning that custom module support won't be available.

    Since this means removing old php versions (which was previously compiled from source), to a new set based on yum - it means a short downtime is expected.

    As with any other (new) server we have, we're also switching from FastCGI to mod_lsapi first of all to allow the possibility for user.ini files and php_value settings - but more importantly because also mod_lsapi isn't as buggy as FastCGI is known for.

    We've put a maintenance window of 2 hours, even though it shouldn't be needed, it should be sufficient in case any problems arise.
    We're doing our best to keep the downtime as short as possible.

    After this we'll be offering PHP version 5.6 (current version in use), 7.0 and 7.1.

    We'll enable php 5.6 on all sites after we've upgraded as the default.

    We do advise upgrading to 7.0 in case your software supports it.

    Update 9.01pm: We're starting the update in a few minutes.

    Update 9.23pm: We've completed the maintenance, we had a total downtime of 3-4 minutes meanwhile reinstalling the different versions.

    We're doing some small motifications which won't impact services.

  • Date - 21/01/2017 21:00 - 21/01/2017 21:23
  • Last Updated - 14/08/2018 11:37
Downtime (Resolved)
  • Priority - Low
  • Affecting System - backup server
  • The backup server cdp03 will get moved to another data center, this means the server is unavailable from 9.15 am the 16th of July.
    The server will come online again within 7 hours.

    The server came online again at 11.49 this morning.

  • Date - 16/07/2018 09:15 - 16/07/2018 11:49
  • Last Updated - 16/07/2018 17:22
Server unavailable (Resolved)
  • Priority - Critical
  • Affecting Server - GRA5
  • 09.45: Server5 is currently unavailable, we're checking

    09.54: About 30 racks in the data center seems to be affected by this outage - we're waiting for an update from the data center.

    09.59: gra1-sd4b-n9 experiences network issues, the data center is working on restoring connectivity - server5 (GRA5) routes traffic via this linecard which caused unavailability as a result.

    10.03: Services are returning to normal - a total of 17 minutes downtime was experienced - the data center moved traffic to gra1-sd4a-n9

    10.26: There's packet loss on the server which can result in slower response times and possible intermediate failing requests.

    10.38: The high packet loss only affects the primary IP of the server, all customers are located on secondary IP addresses meaning connectivity to websites will work.

    Because of the primary IP has packet loss, it also means that outgoing DNS resolution and email delivery might be temporarily unavailable until it gets to an acceptable level again.

    11.44: Connectivity to the primary IP has returned with 0% packet loss, email sending and delivery, DNS etc are once again working.

    15.53: The outage was caused by a software bug in the Cisco IOS version that was used on the routers, when the bug triggered it would cause all active sessions to drop on the router and thus killing traffic. The data center switched the traffic to the standby router (gra1-sd4a-n9) for traffic to return, the data center then performed an upgrade of the router gra1-sd4b-n9 which fixed the bug, they switched traffic back to this router and performed the update on gra1-sd4a-n9.

  • Date - 16/07/2018 09:45 - 16/07/2018 11:44
  • Last Updated - 16/07/2018 15:58
Server3a migration (Resolved)
  • Priority - Low
  • Affecting Server - GRA3
  • We'll migrate customers from server3a to new infrastructure

    16/06/2018 we'll migrate a batch of customers to server9 - starting at 8.45pm
    17/06/2018 we'll migrate a batch of customers to server9 - starting at 8.45pm
    18/06/2018 we'll migrate the remaining batch of customers to server10 - starting at 8.45pm

    Update 16/06/2018 8.36pm: We'll start migration in about 10 minutes.
    Update 16/06/2018 9.49pm: migration for today has been completed.

    Update 17/06/2018 8.36pm: We'll start migration in about 10 minutes.
    Update 17/06/2018 9.35pm: migration for today has been completed.

    Update 18/06/2018 8.41pm: We'll start migration in about 5 minutes.
    Update 18/06/2018 9.35pm: migration for today has been completed. This means all accounts have been migrated from server3a.

  • Date - 16/06/2018 20:45 - 18/06/2018 23:59
  • Last Updated - 18/06/2018 21:35
Account creation (Resolved)
  • Priority - Low
  • Affecting Other - cPanel
  • We currently have a slowdown in acceptance of orders due to capacity on existing servers.
    We're in the process of setting up new infrastructure to accommodate the orders.

    We expect orders to be accepted again by the end of today (Saturday 2nd June)

    Update: 9.24 pm - A new server has been put into production

  • Date - 02/06/2018 09:41
  • Last Updated - 02/06/2018 21:25
Replacement of Backup server (Resolved)
  • Priority - Medium
  • Affecting Server - Backup Server
  • We'll replace our backup server, this means we'll have to redo backups.

    As a result older backups won't be restorable directly from cPanel, however, we can manually restore these if you create a ticket at support@hosting4real.net


    16/04/2018:
    We've started backups for all servers on the new backup system - we will keep the old backup server alive for another 14 days, after which we will start scrubbing the server for data.

    03/04/2018:
    We've decommissioned the old system.

  • Date - 16/04/2018 12:45 - 03/05/2018 10:32
  • Last Updated - 03/05/2018 10:32
Patching of linux kernels (Resolved)
  • Priority - Critical
  • Affecting System - All servers
  • Original:
    Due to recent discovered security vulnerabilities in many x86 CPU's, we'll have to upgrade kernels across our infrastructure and reboot our systems.

    We've patched a few systems already where the software update is available - we're waiting a bit with our hosting infrastructure until the kernel has gone to "production", and have been in production for roughly 48 hours to ensure stability.

    We'll reboot systems one by one during evenings - we've no specific date yet when we'll start, but there will be downtime expected, hopefully only 5-10 minutes per server in case no issues happen.

    Servers might be down for longer depending on how the system behaves during reboot, but we'll do anything to prevent reboot issues like with had with server3a recently.

    This post will be updated as we patch our webservers, other infrastructure gets patched in the background where there's no direct customer impact.

    The patching does bring a slight performance degradation to the kernel, the actual degradation vary depending on the workload of servers, so we're unsure what effect it will have for individual customers, it's something we will monitor post-patching.

    Update 05/01/2018 5.23pm:
    We'll update a few servers this evening, 2 of 3 vulnerabilities has will be fixed by this update, so we'll have to perform another reboot of the servers during next week as well when the update is available.

    We do try to keep downtime at the absolute minimum, but due to the impact these vulnerabilities have, we rather perform the additional reboot of our infrastructure to keep the systems secure.

    We're sorry for the inconvenience caused by this.

    Update 05/01/2018 6.25pm:
    We'll do as many servers as possible this evening, if we get no surprises (e.g. non-bootable servers), everything should be patched pretty quickly, we start from highest number towards lowest so as following:

    server8.hosting4real.net
    server7.hosting4real.net
    server6.hosting4real.net
    server5.hosting4real.net
    server4.hosting4real.net
    server3a.hosting4real.net

    These 6 servers are the only ones that are directly impacting customers - for same reason, these restarts are performed during the evening (after 10pm) to minimize the impact on visitors.

    Other services such as support system, mail relays, statistics, backups, will be rebooted as well - we redirect traffic to other systems as possible.

    Expected downtime per host should be roughly 5 minutes if the kernel upgrades go as planned, longer downtime can occur in case the systems enter state where we have to manually recover it afterwards.

    Update 05/01/2018 8.34pm:
    server4.hosting4real.net will get postponed to tomorrow (06/01/2018) at earliest, since the kernel is still in "beta" state from CloudLinux, depending on the outcome we'll decide to either perform the upgrade tomorrow, or postpone to sunday.

    For the other servers, we plan to start today at 10pm with server8 and after that proceeding with server7 and so on.

    Update 05/01/2018 9.57pm: We start with server8 in a few minutes.
    Update 05/01/2018 10.07pm: Server8 done, with 4 minutes downtime - we proceed with server7.
    Update 05/01/2018 10.15pm: Server7 done, with 3-4 minutes downtime - we proceed with server6.
    Update 05/01/2018 10.39pm: Server6 done, with 9 minutes downtime (high php/apache load) - we have to redo server7 since the microcode didn't get applied.
    Update 05/01/2018 11.00pm: Server5 done, with 3 minutes downtime - proceeding with server3a.
    Update 05/01/2018 11.13pm: Server3a done with 5 minutes of downtime - we'll proceed with server4 tomorrow when the CloudLinux 6 patch should be available.
    Update 05/01/2018 11.49pm: Server5 experienced an issue with MySQL - the issue was caused by LVE mounts getting mounted before the MySQL partition (/var/lib/mysql) got mounted as it should, this resulted in MySQL being unavailable in a state that sites connecting using a socket (most sites do this) would not be able to connect, and all sites connecting via 127.0.0.1 would be able to connect just fine.

    In our monitoring site we run on every server, we do not check for both TCP and socket connections towards MySQL being available, as a result the monitoring system didn't see this error directly and thus triggered an alarm.

    We'll change our monitoring page to perform an additional check to connect both via TCP and via socket - we expect this change in our monitoring page to be completed by noon tomorrow.
    We're sorry for the inconvenience caused by the extended downtime on server5

    Update 08/01/2018 8.47pm: We'll patch server4 today, starting at 10pm. We'll try to keep downtime as short as possible, however - the change required here is slightly more complicated which increases the risk.

    We're still waiting for some microcode updates that we have to apply to all servers once they're available - we're hoping for them to arrive by the end of the week.

    Update 08/01/2018 9.58pm: We'll start the update of server4 in a few minutes.
    Update 08/01/2018 10.15pm: We're reverting the kernel to the old one, since the new kernel has issues with booting. Current status is loading up the rescue image to boot the old kernel.
    Update 08/01/2018 11.02pm: Meanwhile we're trying to get server4 back online, we've initialized our backup procedure and started to restore accounts from the latest backup onto another server to ensure customers getting online as fast as possible.

    Update 08/01/2018 11.22pm: We've restored about 10% of the accounts on a new server.
    Update 09/01/2018 00.56am: Information about the outage of server4 can be found here: https://shop.hosting4real.net/serverstatus.php?view=resolved - with title "Outage of server4 (Resolved)"

    Update 10/01/2018 06.31am: A new version of microcodes will soon be released to fix more vulnerabilities, when the version is ready, we'll update a single server (server3a) to verify it's enabling the new features.

    In case the features gets enabled, we'll upgrade the remaining servers (excluding server4) 24 hours later.

    Update 16/01/2018 8.32am: We will perform a microcode update today on server3a to implement a fix for the CPU, this means we'll have to reboot the server, so with an expected ~ 3-5 minutes downtime. We will do the reboot after todays hosting account migrations which start at 9pm, so the server3a update will happen around 9.30 or 10pm.

    Update 16/01/2018 9.28pm: We'll reboot server3a.
    Update 16/01/2018 9.33pm: Server has been rebooted - total downtime of 2 minutes.

  • Date - 05/01/2018 07:00 - 19/01/2018 23:59
  • Last Updated - 16/04/2018 17:21
Outgoing port 25/26 blocked (Resolved)
  • Priority - Low
  • Affecting Server - RBX8
  • We've blocked outgoing port 25/26 on RBX8 / server8 to prevent some spam sending.

    Normal email flow won't be affected since it's going via another port towards our outgoing mail-relay.

    If you're connecting to external SMTP servers - please use port 587.

  • Date - 13/03/2018 10:59
  • Last Updated - 16/04/2018 17:21
Mail routing (Resolved)
  • Priority - High
  • Affecting System - Mail cluster
  • We experienced a mail routing issue for certain domains towards our mail platform which caused some receiving email to bounce (soft or hard depending on configuration).
    This was caused by one of our mailservers rejecting email due to an issue, without the mailserver marking itself properly as down.

    We've removed the mailserver temporaily meanwhile we investigate the issue, and will develop a check that checks for this mail-routing issue to prevent it in the future.

  • Date - 28/01/2018 13:54 - 28/01/2018 14:55
  • Last Updated - 28/01/2018 15:30
Migration (Resolved)
  • Priority - Low
  • Affecting Server - GRA4
  • We'll migrate the remaining customers off server4 after the recent outage on the server.

    We'll move the customers in batches:

    Customers migrated on following days, will be migrated from server4 to server8:
    15/01/2018
    16/01/2018
    17/01/2018
    18/01/2018

    Customers migrated on following days, will be migrated from server4 to server9:
    22/01/2018
    23/01/2018
    24/01/2018
    25/01/2018
    26/01/2018

    All migrations start at 9pm in the evening - accounts containing domains with external DNS will be migrated at specific times informed by emails.
    Remaining customers will be migrated after 9pm and before midnight - each day.

    Update 15/01/2018 9.00pm: We start the migration
    Update 15/01/2018 9.37pm: Migration for today has been completed

    Update 16/01/2018 9.00pm: We start todays migrations
    Update 16/01/2018 9.24pm: Migrations for today has finished

    Update 17/01/2018 9.00pm: We start todays migrations
    Update 17/01/2018 10.04pm: Migrations for today has finished

    Update 18/01/2018 9.00pm: We start todays migrations
    Update 18/01/2018 9.19pm: Migrations for today has finished

    Update 22/01/2018 9.00pm: We start todays migrations
    Update 22/01/2018 9.45pm: Migrations for today has finished

    Update 23/01/2018 9.00pm: We start todays migrations
    Update 23/01/2018 9.34pm: Migrations for today has finished

    Update 24/01/2018 9.00pm: We start todays migrations
    Update 24/01/2018 9.27pm: Migrations for today has finished

    Update 25/01/2018 9.00pm: We start todays migrations
    Update 25/01/2018 9.32pm: Migrations for today has finished

    Update 26/01/2018 9.00pm: We start todays migrations
    Update 26/01/2018 9.02pm: Migrations for today has finished

    We'll shut down server4 (GRA4) tomorrow (28-01-2018) at 2pm Europe/Amsterdam timezone and start wiping the disks monday morning.

  • Date - 15/01/2018 21:00 - 26/02/2018 12:50
  • Last Updated - 27/01/2018 20:32
Outage of server4 (Resolved)
  • Priority - High
  • Affecting Server - GRA4
  • Between 08/01/2018 10.04pm and 09/01/2018 00.13am we experienced a lengthy downtime on server4.

    Below you'll find a detailed description on the outage and what actions we took during the downtime.

    Background:

    In the beginning of January it became known to the public that CPUs contain a big vulnerability that puts the security of all computer systems at risk, hardware vendors and operating system vendors has been working 24/7 to provide mitigation techniques to these security holes from getting exploited.

    The exploits are called "Meltdown" and "Spectre" if you want to read more about them.

    As a service provider, we're committed to providing a secure hosting platform, which means we had to apply these patches as well.

    Overall we've been able to mitigate the vulnerabilities on the majority of our platform, however we had a single machine (server4) which ran an old version of CloudLinux 6.

    To fix the vulnerabilities it requires a few things - updating the kernel (the brain of the operating system) and something called "microcodes". Microcodes are small pieces of code that allows CPUs to talk to the hardware, and allows hardware and CPU's to enable or diable specific features or technologies.

    Both these things can be very risky to upgrade, and there's always a chance of a system not booting correctly during these upgrades, which is usually why we use software such as KernelCare to prevent rebooting systems and thus try to mitigate downtime.

    However to fix these specific security vulnerabilities, it requires extensive changes to how the kernel of the operating system work, so patching a system while it's running with so big changes, can result in a lot of problems, such as crashing infrastructure, instability or serious bugs that can lead to various scenarios.

    For the same reason, we and many others - decided to do an actual reboot of our infrastructure, since it's in general more safe to perform this action.

    Todays event (08/01/2018):

    10.00pm: We started upgrading the software packages on server4, more specifically we had to upgrade the kernel packages of the system as well as installing new microcodes for the CPU.
    10.04pm: We verify that the grub loader is present in the MBR of all 4 harddrives in the server
    10.09pm: We had an issue booting the server, so we did another restart and tried to manually boot from the drives one by one to see if the cause was due to a corrupt grub loader on one of the disks
    10.12pm: We decide to revert to the old version of the kernel that we started out with, basically cancelling the maintenance.
    10.15pm: Booting into CloudLinux rescue mode takes longer than expected due to slow transfer speeds to the IPMI device on the server.
    10.23pm: We started reverting the old kernel, however during the reinstallation, the specific kernel we reverted to wasn't available on the install media
    10.32pm: We reinitialize the install media with the specific kernel available to revert the system
    10.44pm: Meanwhile the install media loads up, we start to prepare account restores to another server to get people back online faster
    10.50pm: The installer media contains a bug in the rescue image that prevents us from actually continue the rollback of the kernel. We opened a ticket with CloudLinux - they're currently investigating the cause of the issue we saw.
    10.54pm: We start restoring accounts from server4 to server8, prioritizing smaller (diskspace wise) accounts first to get the highest percentage online
    11.22pm: 10% of the accounts has been restored - we continue the work on server4
    11.42pm: 30% of the accounts has been restored
    11.58pm: 43% of the accounts has been restored
    12.13am: Server4 comes back online, we cancel all remaining account restorations - with 49% restored on server8.

    The root cause of the system not booting, was due to corruption of the stage1 and stage2 files during the upgrade, it was resolved by booting the rescue mode image from the datacenter, and manually regenerate the stage1 and stage2 files as well as downloading the kernel files files we originally used.

    This allowed us to get the system back online afterwards.

    Current status:
    Currently about half of the accounts has been restored to server8, these accounts will continue to be located on this server, since the hardware and software is never as well.
    We're in the process of checking for inconsistencies in the accounts that was migrated to the other server.

    We'll also update the records in our billing system to reflect the server change as well as emailing all customers moved to the new server about the IPs.

    Upcoming changes:
    We'll leave the current server "as is", and we'll plan migrating the remaining accounts in the following weeks to a new server, to ensure the security of our customers data.

  • Date - 08/01/2018 22:04 - 09/01/2018 00:13
  • Last Updated - 09/01/2018 00:55
Support / billing system migration (Resolved)
  • Priority - Low
  • Affecting System - Support / billing system
  • During the upcoming weekend (22-24 december), we'll migrate our support and billing system to a new infrastructure.

    We're doing this to consolidate a few systems but also to move our support system out of our general network due to the recent outage we had that also affected our support system.

    The migration will happen during day-time, there will be some unavailability for our support system and billing system during the migration - we're expecting about 5-15 minutes of total downtime for the system as a final step - we do this to ensure consistency in our data during migration.

    In the period just around moving the actual database of the system to the new database server, we will put the system into maintenance mode, stop any possible import of tickets into our system, and then enable it again shortly after migration has been completed.

    This also means that ticket import might be delayed for up to 15 minutes.
    We do monitor our support email during the event, so if anything urgent comes up during the migration, we will see these emails.

    We expect to complete the migration on the 23rd of december, with the possibility to postpone the migration until 24th of december.

    Our email support@hosting4real.net will continue to work during the whole process.

    Update 23/12/2017 11.15: We start the migration
    Update 23/12/2017 11:49: Migration has been completed - we're doing additional checks to ensure everything works.
    Update 23/12/2017 12:49: We've tested that payments work, as well as single signon directly to cPanel works as well.

  • Date - 23/12/2017 07:00 - 24/12/2017 17:00
  • Last Updated - 05/01/2018 07:05
Migration of infrastructure (Resolved)
  • Priority - Low
  • Affecting System - CDN Dashboard / management infra
  • We'll be migrating our CDN dashboard and management infrastructure to a new setup.

    The migration will be made to speed up the dashboard and to simplify scaling the backend system in the future.

    During the migration we'll disable the old system completely to ensure integrity of data and prevent duplicate systems from doing updates.

    CDN traffic won't be affected by this migration, however the dashboard and API and purging functionality won't be available for a big part of the actual migration.
    Statistics will be delayed and reprocessed afterwards to also avoid any duplicate or missing data.

    [Update 15/12/2017 20:05]: We start the migration
    [Update 15/12/2017 21:12]: Migration has been completed and DNS switched to the new infrastructure.

  • Date - 15/12/2017 20:00 - 15/12/2017 21:12
  • Last Updated - 15/12/2017 21:13
Network maintenance (Resolved)
  • Priority - Low
  • Affecting Server - GRA4
  • December 12 starting at 10pm until 6am the 13th of December, the data center we use will perform a network equipment upgrade on the switch which server4 (GRA4) uses - the expected downtime for this maintenance is about 1 minute (moving the ethernet cable from one port to another) - the downtime will happen during the night so shouldn't impact any customer traffic too much.

    The upgrade of equipment is to support growth and network offerings at the datacenter.

    At 22.59 the system went offline (confirmed by monitoring)
    At 23.01 the system came back online and most services returned to normal

    A few IP addresses still didn't "ping" as they should, so we issued manual pings from both ends to ensure arp getting renewed on the routers, all services confirmed working at 23.05.

    We're closing the maintenance as of 23.11

  • Date - 12/12/2017 22:00 - 12/12/2017 23:11
  • Last Updated - 12/12/2017 23:11
Extended downtime on server3a (Resolved)
  • Priority - Critical
  • Affecting Server - GRA3
  • We experienced extended downtime on server3a (GRA3) yesterday between 3.37 pm and 5.14pm - lasting a total of 96 minutes.

    The issue was initially caused by a kernel panic (basically making the core of the operating system confused, and thus causing a crash) - the kernel panic itself was caused by a kernel update provided by KernelCare which we use to patch kernels for security updates without having to reboot the systems.

    The update that KernelCare issued contained a bug for very specific kernel versions - which affected one of our servers.

    Normally when a kernel panic happens it will automatically reboot and comes online again a few minutes later - however in our case the downtime got rather lengthy due to the system not coming online again afterwards.

    The issue for the boot was related to UEFI in the system - it couldn't find some information to be able to actually boot into the operating system - after trying multiple solutions we ended up finding a working one to get the system back online.

    The specific error message we got, usually have multiple solutions because the error can be caused by multiple things, such as missing boot loader, corrupt kernel files or missing/misplaced EFI configurations - and that's what caused the downtime to last for the time it did.

  • Date - 07/12/2017 15:37 - 07/12/2017 17:14
  • Last Updated - 08/12/2017 08:36
Multiple servers down (Resolved)
  • Priority - Critical
  • Affecting System - RBX6,RBX7,RBX8,shop.hosting4real.net
  • Today between 08.07 and 10.38 Europe/Amsterdam timezone we experienced a complete outage on servers: RBX6 (Server6), RBX7 (Server7), RBX8 (Server8) as well as our client area/support system.

    Timeline:

    07:20: Receive an alert that ns4.dk - one of our four nameservers are down, nothing extraordinary - we run our nameservers at multiple data centers and providers so downtime can be acceptable in this case.

    08.07: Receive alerts about a single IP on RBX7 being down we immediately start investigating

    08.09: Receive alerts about the complete server RBX7 being down

    08.12: Receive alerts about shop.hosting4real.net, RBX6, RBX8 being down and we realize it's a complete datacenter location outage (Roubaix, France) since it affects both RBX6, RBX7, RBX8 and shop.hosting4real.net which is entirely separate environments.

    08.15: We see that connectivity towards all servers in RBX isn't reaching the data center location as it should and meanwhile verify if it affects our other location GRA where we have the other half of our servers.

    08.27: It's informed to us that the outage affecting RBX are due to fiber optics issue that causes problems with routing towards RBX datacenters (7 datacenters in total with a capacity of roughly 150.000-170.000 servers)

    08.50: It's confirmed that all 100-gigabit fiber links towards RBX from TH2 (Telehouse 2, Paris), GSW (Globalswitch Paris), AMS, FRA, LDN, and BRu are affected.

    10.18: ETA for bringing up the RBX network is 30 minutes - the cause was corruption/data loss on optical nodes within RBX that cleared - configuration getting restored from backups

    10.25: Restore of the connectivity is in progress

    10.29: All connectivity to RBX has been restored, BGP recalculating to bring up the network across the world

    10.33: We confirm that RBX7, RBX8, and shop.hosting4real.net once again has connectivity

    10.38: RBX6 comes online with connectivity

    What we know so far:

    - The downtime for ns4.dk was caused by a power outage in the SBG datacenter - generators usually kick in, but two generators didn't work - causing the routing room to lose its power
    - The downtime between SBG and RBX was not related - it just happened to be Murphy's law coming into effect.
    - Outage in RBX caused by a software bug for the optical nodes in RBX to lose its configuration - DC provider is working with the vendor of the optical nodes to fix the bug together

    We're still awaiting a full postmortem with the details from our provider, once we have it - an email will be sent to all customers that were affected by the outage.

    There haven't been much we could do as a shared hosting provider to prevent this - we do try to keep downtime as minimal as possible for all customers. However, issues do happen, networks or part of them can die, power loss can happen and data centers can go down.

    In this specific case, the problem was something outside our "hands" which we have no influence on and in this case the software bug from the provider could hardly be prevented.

    We're sorry for the issues caused by the downtime today, and we do hope no similar case will happen in the future.

    One additional step we will work on is completely moving shop.hosting4real.net out of our provider and locate it in another datacenter - this ensures that our ticketing systems stay online during complete outages. Our support emails did work during the outage, but we'd like to make it easier for customers to get in contact with us.

    This post will be updated as we have more information available.

    Postmortem from us and partially the provider:

    This morning (9/11/2017) there was an incident with the optical network that interconnects the datacenter (RBX) with 6 of the 33 points of presence that powers the network: Paris (TH2, GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN) and Brussels (BRU).

    The data centers are connected via six optical fibers, and those six fibers are connected to optical node systems (DWDM) that allow 80 wavelengths of 100 gigabits per second on each fiber.

    Each 100 gigabit fiber is connected to the routers and two optical paths that are geographically distinct, in case there's an optical fiber cut the system will do a failover within 50 milliseconds - RBX is connected to a total of 4.4Tbps being 44x100 gigabit, 12x100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to Gravelines data centers and 2x100G to SBG data centers.

    At 08.01 all 100G links (44x 100G) were lost, given the redundancy in place it could not be a physical cutoff of the 6 optical fibers and the possibility to do diagnostics remotely wasn't possible due to the management links were lost as well - for this manual intervention in the routing rooms has to take place: Disconnect cables and reboot the systems to make diagnostics with the equipment manufacturer.

    Each chassis takes roughly 10-12 minutes to boot which is why the incident took so long.

    The diagnostics:

    All the transponder cards that are in use (ncs2k-400g-lk9, ncs2k-200g-cklc) were on "standby" state. One of the possible origins for this is the loss of the configuration. The configuration got recovered from a backup which allowed the systems to reconfigure all the transponder cards. The 100G links came back naturally, and the connection to RBX to all 6 POPs was restored at 10.34.

    The issue lies in a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 supervision cards. Despite all the security measures the configuration disappeared. The provider will work with the vendor to find the source of the problem and to help fix the bug. The provider does not question the equipment manufacturer even though the bug is particularly critical. The uptime is a matter of design that has to be taken into account, including when nothing else works. The provider promises to be even more paranoid in terms of network design.

    Bugs can exist, but they shouldn't impact customers. Despite investments in network, fibers, and technologies, 2 hours of downtime isn't acceptable.

    One of the two solutions that are being worked on is to create a secondary setup for the optical nodes this means two independent databases, in case of configuration loss only one system will be down, and only half the capacity will be affected. This project was started one month ago, and hardware has been ordered and will arrive in the coming days. The configuration and migration work will take roughly two weeks. Given today's incident, this will be handled at a higher priority for all infrastructure in all data centers.

    We are sorry for the 2H and 33 minutes of downtime in RBX.

    The root cause in RBX:

    1: Node controller CPU overload in the master frame
    Each optical node has a master frame that allows exchanging information between nodes. On the master frame, the database is saved on two controller cards.

    At 7.50 am there was detected communication problems with nodes corrected to the master frame which caused a CPU overload.

    2: Cascade switchover

    Following the CPU overload of the node, the master frame made a switchover of the controller boards - the switchover and CPU overload caused a bug in the Cisco software - it happens on large nodes and results in a switchover every 30 seconds. The bug has been fixed in Cisco software release 10.8 which will be available at the end of November.

    3: Loss of the database

    At 8 am, following the cascade switchover events, it hit another software bug that de-synchronizes timing between the two controller cards of the master frame. This caused a command to be sent to the controller cards to set the database to 0, which effectively wiped out the database.

    The action plan is as follows:

    - Replace the controllers with TNCS instead of TNCE - this doubles CPU and RAM power - replacement will be done for Strasbourg and Frankfurt as well.
    - Prepare to upgrade all equipment to Cisco software release 10.8
    - Intermediate upgrade to 10.5.2.7 and then upgrade to 10.8
    - Split large nodes to have two separate nodes

    Compensation:

    10/11/2017 20.24: Accounts on Server8 (RBX8) has been compensated
    10/11/2017 20.51: Accounts on Server7 (RBX7) has been compensated
    10/11/2017 22.43: Accounts on Server6 (RBX6) has been compensated

  • Date - 09/11/2017 08:07 - 09/11/2017 10:38
  • Last Updated - 13/11/2017 11:26
Cooling check - possible downtime (Resolved)
  • Priority - Critical
  • Affecting Server - RBX6
  • We detected an issue with the water cooling of our server meaning that the CPU runs a lot hotter than it should do (currently at 90C) which also causes throttling in terms of performance, we've scheduled an intervention with the datacenter engineers around midnight today (between 3rd and 4th november) - depending on the outcome we might have to temporary shut down the server for the cooling block to be replaced.

    In case the issue lies in the cabling going from the cooling loop to the server itself, then this can often be replaced without shutting down the server.

    We're sorry about the inconvenience and the short notice.

    Update 03/11/2017 11.53pm: Intervention will start shortly

    Update 04/11/2017 12.08am: Machine has been shutdown for the datacenter to perform the intervention.

    Update 04/11/2017 12.25am: Machine back online

    Update 04/11/2017 12.31am: Load on system became normal, verified that temperatures are now correct.

  • Date - 04/11/2017 00:00 - 04/11/2017 00:31
  • Last Updated - 04/11/2017 00:32
Downtime on server4 (Resolved)
  • Priority - High
  • Affecting Server - GRA4
  • Today we experienced a total of 13 minutes of downtime on server4.

    At 10.28 the number of open/waiting connection on Apache on server4 increased from the average of about 20 to a bit above 300.
    We've been aware of the issue since it happened before but we've had the possibility to find the root cause of the issue.
    However today our monitoring notifications were delayed, meaning we first got informed about the downtime at 10.40, so 12 minutes after the webserver stopped responding to traffic.

    We resolved the issue by forcing an apache restart, we found out that the issue is caused when some crawlers open up connections to the server but never closes them again - this usually only happens for rogue crawlers, but it turns out that this also happens in some cases when people do use legit SEO crawlers such as Screaming Frog SEO Spider.

    Our immediate fix is to block anything containing the screaming frog user-agent, this agent can easily be changed when you're paying for the software, so the solution isn't exactly bullet-proof.

    The issue happens because the application opens up connections for every request it makes, and it doesn't close the connection correctly again.
    This results in all workers on the webserver being used up by a crawler.

    We've informed the company behind Screaming Frog SEO Spider about the issue and what causes the exact result and advised them to implement keep-alive support or actually close connections correctly.
    During further investigation we saw that this only happens when using very specific settings in the software.

    We're sorry for the long downtime caused by this, we have additional measurements we can use against it by rate-limiting IP's to a set of connections, however we'd prefer to not do this since it can cause issues with legit traffic as well.

  • Date - 06/09/2017 10:28 - 06/09/2017 10:41
  • Last Updated - 06/09/2017 16:10
Disk replacement (Resolved)
  • Priority - Medium
  • Affecting Server - GRA3
  • We have a failed disk in server3a (GRA3), we'll ask the datacenter for a replacement.

    There will be a downtime with this intervention since the provider will have to reboot the system.

    We're performing a backup of all customer data and databases before requesting the replacement.

    Update 00:20: The server is getting a disk replacement now.

    Update 01:00: We experienced some problems so intervention is still ongoing.

    Update 01:25: System is back up using another kernel, we're going to reboot again after some investigation to bring up all sites.

    Update 01:35: The system is back online serving traffic as it should. The raid will rebuild for the next 40-60 minutes.


    Complete history below:

    - 11.37 PM the 27th of August we received an alert from our monitoring system that a drive started to get errors meaning that a drive failure can happen within minutes, days or sometimes even months. Since we want to avoid the possibility of dataloss we decided to schedule an immediate replacement of the drive.
    - 12.01 AM the 28th of August we finished a complete backup of the system, we perform this backup since there's always a risk that a disk replacement goes wrong or that a raid rebuild will fail.
    - 12.15 AM We receive a notification from the datacenter they'll start disk replacement 15 minutes later (Automated email)
    - 12.22 AM We see the server go offline for the intervention, disk gets replaced.
    - 12.40 AM We receive an email from the datacenter that the intervention has been completed and our system is booted into rescue mode
    - 12.42 AM We change the system to boot from the normal disk to get services back online, however due to a fault with the IPMI device (basically a remote console to access the server), we couldn't bring the service back online.
    - 12.44 AM We call the datacenter to request a new intervention, which is already being taken care of.
    - 12.50 AM to 01.23 AM the engineer intervene, and spot that there's a fault on the IPMI interface and have to perform a hard reset from the motherboard, and at same time the engineer realize that the boot order has to be changed in the BIOS to boot from the correct disk which wasn't performed during the first intervention.
    - 01.25 AM The server comes back online with a default CentOS kernel to ensure the system would boot, this action was performed by the datacenter engineer.
    - 01.34 AM We restart server to boot from the correct CloudLinux LVE kernel.
    - 01.35 AM All services restored.
    - 01.36 AM We start the rebuild of the raid array.
    - 01.52 AM Rebuild currently at 39%
    - 01.56 AM Rebuild currently at 51%
    - 02.08 AM Rebuild currently at 79.5%
    - 02.16 AM Rebuild completed


    We're sorry about the issues caused and the increased downtime. Usually these hardware interventions should be performed with minimal downtime (about 15 minutes start to finish), however due to a mistake from the engineers side and due to the fact the system had an issue with the IPMI interface the downtime sadly became 1 hour and 15 minutes.

  • Date - 27/08/2017 23:22 - 28/08/2017 01:52
  • Last Updated - 28/08/2017 02:16
server8 downtime (Resolved)
  • Priority - Medium
  • Affecting Server - RBX8
  • The downtime at server8 (RBX8) was caused due to automatic software update:

    Usually software updates happen smoothly, only causing a few seconds of unavailability in certain cases, however today a new version of cPanel got released - version 66.
    Together with this version CloudLinux released an update to their Apache module mod_lsapi which offers PHP litespeed support for Apache.

    The update of this module changes how PHP handlers are configured, basically it used to be in a specific config, where it is now moved to the MultiPHP handler setup within cPanel itself.
    The update removed the handler the lsapi.conf file (lsapi_engine on) but did not update the handlers within the MultiPHP handler configuration in cPanel.

    This caused the system to stop serving PHP files after an apache restart, the quick fix we did was to enable PHP-FPM for all 4 accounts on the server which brings sites back online and meanwhile we'd investigate on a test domain on the same server what caused the downtime.

    We did manage to fix it in an odd way without really knowing what fixed it, we continued with another server but enabled PHP-FPM on all accounts before performing the update to ensure sites wouldn't go down.
    We were able to reproduce the problem, and therefore created a ticket with CloudLinux to further investigate and the above findings were discovered.

    We've updated about half of servers to the new version, the rest of them do currently have automatic updates disabled and we'll continue to update these servers on thursday evening.

    We're sorry for the issues caused by the downtime, we've sadly had some issues lately with CloudLinux pushing updates out containing bugs.

  • Date - 22/08/2017 22:14 - 22/08/2017 22:20
  • Last Updated - 23/08/2017 01:26
Opdatering af software (Resolved)
  • Priority - Medium
  • Affecting Server - GRA5
  • Den 29 juli kl 22.00 vil vi lave en større software opdatering på server5 (GRA5).

    Denne server bruger nuværende PHP Selector fra CloudLinux samt EasyApache 3 for at vedligeholde software versioner så som PHP samt Apache.

    Da EasyApache har fået en større opdatering der hedder "EasyApache 4", og derved gor at EasyApache 3 vil blive forældet indenfor nærmere fremtid, er vi nødsaget til at opdatere softwaren.

    Opgraderingen bringer en række muligheder som allerede er tilgængelige på alle andre servere vi har:

    - Forskellig PHP version (5.6, 7.0 og 7.1) per side og ikke kun per konto - dvs. det bliver muligt at en side kører med PHP version 5.6 og en anden kører med 7.0.
    - Mulighed for http2 - dette giver en forbedring i hastighed for levering af statiske filer

    Samtidig tager EasyApache 4 brug af "yum" til at installere pakker hvorimod EasyApache 3 skulle bygge PHP versioner og Apache versioner manuelt hver gang, ved skiftet sikre det hurtigere opdateringer til nyere Apache og PHP versioner, samt vi har muligheden for at tilbyde mere funktionalitet.

    Denne email udsendes da der vil være forbundet en mindre nedetid under skiftet til det nye software, generelt set vil det være ca. 1 minut for alle kunder - dog vil vi manuelt skulle skifte PHP versionen for alle kunder der bruger en version der ikke er vores standard, så for disse konti vil der være en smule længere nedetid såfremt din side ikke understøtter PHP 5.6 (Det gør det meste software).

    Skulle du have spørgsmål kan du altid kontakte os på support@hosting4real.net 

    ----------------------------

    We'll perform a major software maintenance on server5 (GRA5) on July 29th starting 10pm.

    This server currently uses PHP Selector offered by CloudLinux as well as EasyApache 3 to perform software updates for PHP and Apache.

    Since EasyApache some time ago got a major upgrade to "EasyAapche 4" and thus making EasyApache 3 obsolete within a short period of time, we find it necessary to perform this software upgrade.

    The upgrade brings a few new possibilities which are already possible on all other servers we have:

    - Different PHP versions (5.6, 7.0 and 7.1) per website and not only per account - this means it will be possible to run PHP version 5.6 on one website and 7.0 on another.
    - Possibility to use http2 - this can greatly improve the website performance if you have a large amount of static files.

    At same time EasyApache 4 makes use of "yum" to install packages where EasyApache 3 had to manually compile newr versions of PHP and Apache - the change will ensure faster updates to newer Apache and PHP versions and we can offer better functionalities.

    We send you this email since there will be a minor downtime during the migration to the new software - generally speaking we're talking about 1 minute for all customers - due to switching PHP versions manually for all customers that uses a version that is the server standard, might experience a slightly longer downtime in case your website doesn't support PHP 5.6 (Most software does).

    In case you have any questions, please do not hesitate to contact us at support@hosting4real.net 

    [Update 21.51]:
    We start the update shortly

    [Update 22:07]:
    The software upgrade has been completed - continuing with finalizing configurations

    [Update 22:34]:
    PHP versions on hosting has been reset to what they originally was configured to

    PHP 7.0 has been set as default

    [Update 22:55]:

    All modules has been configured as they should.
    Maintenance has been completed.

    [Update 02:20]:

    We experienced a short downtime shortly after the migration:

    Due to some race conditions in the way the systems get converted, when we changed the default PHP handler it would result in the PHP handling to stop working as it should.

    At same time we experienced that certain PHP versions would load from the new system, but other versions would load from the old system, this in itself doesn't do any harm, but it would mean we would have to maintain two different configurations, and due to a rather complex version matrix on how PHP versions are selected, we wanted to correct this to simplify the management.

    After consulting with CloudLinux we resolved both issues and at same time kindly asked CloudLinux to further improve their documentation on the subject since there's plenty of pitfalls which can result in strange results.

    We're sorry for the additional downtime caused after the update.

  • Date - 29/07/2017 22:00 - 29/07/2017 23:59
  • Last Updated - 30/07/2017 02:27
Network outage (Resolved)
  • Priority - High
  • Affecting System - server3a,server4,server5
  • Between 11.15 am and 11.20 am we experienced a network outage on server3a, server4 and server5 - all located in Gravelines datacenter - this resulted in a 90% drop in traffic (some traffic could still pass through). The issue was quickly resolved by our datacenter provider. We're waiting for an update from the datacenter provider what happened, and we'll update this post accordingly.

  • Date - 26/07/2017 11:15 - 26/07/2017 11:20
  • Last Updated - 26/07/2017 11:31
Customer migrations (Resolved)
  • Priority - Low
  • Affecting Server - RBX2
  • Over the next days we'll be migrating customers from RBX2 to RBX7

    01/06/2017 21.00: We're starting migrations for today
    01/06/2017 22.28: Migrations has been completed

    02/06/2017 20.55: We will begin migrations at 21.00
    02/06/2017 22:23: Migrations has been completed

    03/06/2017 20.50: We will begin migrations at 21.00
    03/06/2017 22.17: Migrations has been completed

    04/06/2017 20:56: We'll begin migrations shortly
    04/06/2017 21:51: Migrations has been completed

    05/06/2017 20.55: We will begin migrations shortly
    05/06/2017 21.31: Migrations for today has been completed

    06/06/2017 20.55: We will begin the migration for today shortly
    06/06/2017 21.14: Migrations has been completed

    09/06/2017 21.00: We start migration
    09/06/2017 21.48: All migrations has been completed.

  • Date - 01/06/2017 21:00 - 09/06/2017 22:00
  • Last Updated - 09/06/2017 21:48
Downtime on server7 (Resolved)
  • Priority - Critical
  • Affecting Server - RBX7
  • Server7 went down this morning, it randomly rebooted and went into a grub console.
    After investigation we managed to boot the server.

    We're currently performing a reboot test to see if the same issue would happen again.

    We'll keep you updated.

    Update:
    The system still goes into a grub console, we'll migrate the very few customers on the server (it's completely new), to one of our older servers and get it fully repaired.
    When the job is done we'll move the customers back again.

    Update 2:
    After further investigation, the issue was caused by an update of the operating system - normally when kernels are updated, it has to rewrite something called a grub.cfg file to include the new kernel, this happens as it should, but since we're using EFI boot on the new server, a bug in the operating system caused it to not write the grub.cfg correctly to the location where EFI will look for the config.

    We happened to have a kernel crash this morning, which caused the system to restart in the first place, rendering the error we had.
    After creating the grub.cfg manually and doing a few reboots, we confirmed that it was the cause.

    The vendor has been contacted and acknowledged the problem.

    On the bright side, this crash happened just days before we actually started our migration, meaning it only caused downtime for a few customers and not everyone from server2a.

    We will continue with the migration as normal.

  • Date - 28/05/2017 08:09
  • Last Updated - 28/05/2017 11:28
New site migrations (Resolved)
  • Priority - Low
  • Affecting Other - General
  • From May 5 until May 22, no new customer site migrations will be performed.

    This means in case you want to move to us, and want to get migrated by our staff, you have to wait until the 22nd of May.

  • Date - 05/05/2017 17:00 - 22/05/2017 06:00
  • Last Updated - 28/05/2017 08:51
Server6 downtime (Resolved)
  • Priority - High
  • Affecting Server - RBX6
  • Around noon we experienced two short outages right after each other on server6.
    Total downtime registered were 2 minutes and 42 seconds.

    At 12.12 a big number of requests started to come in, this quickly filled up the amount of apache workers.
    As a way to try to resolve it, we increased the amount of workers allowed to process requests, but due to the backlog of requests coming in, the load were already too high to gracefully reload apache.

    As a result we decided to stop all apache processes - basically killing all incoming traffic (Yes, we know it's not nice).
    We switched back to mpm_event in Apache which are known to handle a large amount of requests better, right after this traffic came back, with no issues whatsoever.

    On all our servers we usually run mpm_event by default, but due to a bug between mod_lsphp and mpm_event that was discovered a few weeks back, we switched mpm_event to mpm_worker on all servers to prevent random crashes from happening.

    This bug is still present, but has been fixed upstream, we're just waiting for the packages to be pushed towards general availability, which is why we've not yet officially planned to switch back to mpm_event again.

    Using mpm_worker in general isn't an issue as long as the traffic pattern is quiet predictible - so it can handle large amount of traffic, if the traffic goes up in a normal way.
    It's known for mpm_worker to get overloaded in case you go from a small amount of requests to a high amount in a matter of seconds (which is what happened today) - this caused Apache to just stop accepting more traffic.

    Since the bugfix has not officially been released for mpm_event, we'll not perform change to mpm_event again for servers until it has been released.
    But in this specific case we've had the need to change, to cope with the traffic spikes that happens from time to time.

    We've a few changes to our systems to prevent the bug from happening, but it's by no means, a permanent solution, until the fix gets released - however, in case we will experience the bug, it will cause about 1 minute of downtime, until we manage to access the machine, and manually kill and start the webserver again.

    We're sorry about the issues caused by this outage.

  • Date - 02/05/2017 12:12 - 02/05/2017 12:16
  • Last Updated - 02/05/2017 12:45
Planned network maintenance (Resolved)
  • Priority - High
  • Affecting Server - RBX6
  • On March 29th, there will be a planned network maintenance that will impact connectivity towards server6 (RBX6).

    Our datacenter provider are performing the network maintenance to ensure the providing a good quality of service in terms of networking. For this to be carried out, they will have to upgrade some network equipment that are located above our server.

    The equipment update are being performed to follow the innovation happening in the network, and to allow our datacenter provider to introduce new features in the future.

    To be more exact the "FEX" (Fabric Extender) will have to be upgraded, meaning they'll have to move cables from one FEX to another one, which does involve a short downtime.

    There will be a total of two network "drops":

    - The first one are expected to last for about 1-2 minutes since the datacenter engineers have to physically connect the server to the new FEX.
    - The second one happens 45 minutes after the first one and are expected to last for only a few seconds.

    The maintenance has a window of 5 hours, since there's multiple upgrades being performed during the night, and it's unclear how long each upgrade will take, therefore we cannot give more specific timings than the announced maintenance window from our datacenter provider.

    The maintenance performed by the provider can be followed at their network status page: http://travaux.ovh.net/?do=details&id=23831 

    Update 29/03/2017 22:00: OVH has started the maintenance.

    Update 30/03/2017 02:41: They managed to finish the first rack but due to a issue with their monitoring systems and extended maintenance time, the second rack (where we're located in) has been postponed. When we know the new date it will be published.

    Update 30/03/2017 12:53: The maintenance for our rack has been replanned for April 5th - between 10pm and 3am (+1 day)

    Update 05/04/2017 21:56: The maintenance for our rack has been postponed again - we're waiting for a new date for the FEX replacement.

    Update 21/04/2017 19:03: Info from DC: The first intervention, will impact the public network and will take place on 26th of April 2017 between 10pm and 6am.

    During the time as informed earlier, there will be two small drops in the networking during that time.

    Update 27/04/2017 22:22: Maintenance on our rack has begun

    Update 28/04/2017 00:05: Maintenance has been completed


    Best Regards,
    Hosting4Real

  • Date - 26/04/2017 22:00 - 26/04/2017 00:05
  • Last Updated - 27/04/2017 00:52
Downtime on server6 (Resolved)
  • Priority - Medium
  • Affecting Server - RBX6
  • Between 00:50:34 and 00:54:00 we experienced an outage on server6.

    During these 3.5 minutes, Apache wasn't running, meaning sites hosted weren't accessible.

    After investigating the issue we saw that the system in cPanel that maintains the update of mod_security rulesets, were updating the ruleset we use - this means the system will also schedule a graceful restart of Apache in the process, but due to a race condition in the system, the restart of Apache happened meanwhile it was rewriting some of the configuration files.
    This means during the restart the Apache configuration will be invalid - which resulted in Apache not being able to start again, thus causing the downtime.

    The issue is known by cPanel, and they're working on fixing it.

    As a temporary workaround, we've disabled auto-update of mod_security rules across our servers, to prevent this from happening again.
    We will enable the automatic update of the rulesets when the bug has been fixed.

    We're sorry about the issues caused in the event of this outage.

  • Date - 14/04/2017 00:50 - 14/04/2017 00:54
  • Last Updated - 14/04/2017 14:55
DDoS on server3a (Resolved)
  • Priority - Medium
  • Affecting Server - GRA3
  • We're currently experiencing a ddos on server3a, our DDoS filtering works, but it might cause a bit of packet loss, and/or false positives for few monitoring systems.

    Response times on websites might be a bit higher, since all traffic has to be filtered, other than that, traffic flows, and we do not see any general drop in traffic on webserver level.

    Update 19:10: Attack has stopped - we move the IPs out of mitigation again.

  • Date - 10/04/2017 14:30 - 10/04/2017 19:10
  • Last Updated - 10/04/2017 21:49
Change of MySQL settings (Resolved)
  • Priority - Medium
  • Affecting System - All systems
  • Over the past few days we experienced short outages on server4 (GRA4), and this morning a short outage on server2a (RBX2)

    After further investigation, we narrowed down the problem to be an issue with the configuration of MySQL.
    Basically the max query size we allow in the query cache was defined to a rather big amount, which generally do not cause any issues, except in a very few edge cases.

    For same reason we only saw this happening mostly on a single server, because it happens to execute these very extensive queries affecting the performance, and causing some rather bad locks on the system.

    We applied the changes to both server2a and server4.

    We're currently applying the same settings to every other server, it will result in restarting MySQL on every server - this only takes a few seconds, so no downtime is expected during the restart.
    Be aware that if you monitor your site, and the monitoring system do cause a check immediately as we restart, it might trigger a false positive.

    Update 14.00: Server5 has been updated.
    Update 14:03: Server6 has been updated.
    Update 14:04: Server3a has been updated.

  • Date - 04/04/2017 13:52 - 04/04/2017 14:04
  • Last Updated - 04/04/2017 14:04
Emergency maintenance (Resolved)
  • Priority - High
  • Affecting Server - GRA4
  • We have to do an emergency maintenance on server4/GRA4 tonight starting at 9PM.

    The maintenance carried out will fix a critical issue on the system that we've detected.
    During the maintenance we might experience a small unavailability of websites, we do try to keep the downtime as low as possible.

    We're sorry to give such short notice.

    Update 21.00: Maintenance starting

    Update 21.17: We've finished the maintenance, there was no impact during the fix

  • Date - 21/03/2017 21:00 - 21/03/2017 21:17
  • Last Updated - 21/03/2017 22:03
System migration (Resolved)
  • Priority - Medium
  • Affecting Server - PAR3
  • As informed by email - all customers on PAR3 (server3) will be migrated to GRA3 (server3a).

    We'll migrate customers over a period of about 2 weeks:

    - 17/02/2017: Starting at 21.00
    - 18/02/2017: Starting at 21.00
    - 19/02/2017: Starting at 21.00
    - 24/02/2017: Starting at 21.00
    - 25/02/2017: Starting at 21.00
    - 26/02/2017: Starting at 21.00
    - 03/03/2017: Starting at 21.00

    Starting tonight we have an amount of customers with external DNS which we migrate at very specific times. All other customers will be informed shortly before their migration, and shortly after the account has been moved to the new server.

    Please make sure if you're using server3.hosting4real.net as your incoming/outgoing mail-server or FTP server, that you update this to server3a.hosting4real.net, or even better - using your own domain such as mail.<domain.com> and ftp.<domain.com>.

    Update 17/02/2017 21.00: We start the migration

    Update 17/02/2017 22.13: We've migrated the first batch of accounts, we'll continue tomorrow evening. We haven't experienced any issues during the migration.

    Update 18/02/2017 20.58: We start the migration in a few minutes.

    Update 18/02/2017 21:30: We're done with migrations for today.

    Update 19/02/2017 20:56: We'll start the migration in a few minutes.

    Update 19/02/2017 21:07: We're done with migrations for today.

    Update 24/02/2017 20:58: We'll start the migration in a few minutes.

    Update 24/02/2017 21:28: We're done with migrations for today.

    Update 25/02/2017 20.57: We'll start the migration in a few minutes.

    Update 25/02/2017 22:06: We're done with migrations for today.

    Update 26/02/2017 20.58: We'll start the migration in a few minutes.

    Update 26/02/2017 21:06: We're done with migrations for today.

    Update 03/03/2017 20.55: We'll start the migration in a few minutes.

    Update 03/03/2017 22:03: We're done with all migrations.

  • Date - 17/02/2017 21:00 - 03/03/2017 23:30
  • Last Updated - 03/03/2017 23:19
.dk registrations (Resolved)
  • Priority - Medium
  • Affecting System - Registrar system
  • The service where we register .dk domains are currently having a service outage affecting only this specific TLD.

    We're waiting for the system to come back online to process .dk registrations again.

    Update 14:35: The issue was caused by a routing issue between the registrars network and the registry.

  • Date - 22/02/2017 12:00
  • Last Updated - 22/02/2017 14:42
Network outage (Resolved)
  • Priority - Critical
  • Affecting Other - All servers
  • At 09.19 we received alerts from our monitoring about unreachability to multiple servers in multiple datacenters.
    After quick investigation we saw between 40 and 90% packet loss to multiple machines - it means majority of the traffic would still go through, but response times being really high.
    From our network traces it showed that traffic were being routed via Poland (which is usually shouldn't), so we knew it was a network fault.

    After updates from the datacenter, it seems to be caused by an issue with GRE tunnels. We're waiting for further updates from the datacenter.

    Some servers were affected more than others, some customers might have not experienced issues - but it started at 09.19 - and recovered fully at 09.30 - so a maximum downtime of 11 minutes.

    We'll update this accordingly when we have more information.

  • Date - 08/02/2017 09:19 - 08/02/2017 09:30
  • Last Updated - 08/02/2017 10:06
Global outage (Resolved)
  • Priority - Critical
  • Affecting Other - All systems
  • We experienced an outage across multiple systems.
    The outage was caused by a faulty IP announcement (64.0.0.0/2) on the global network.
    This caused a BGP routing issue and thus causing the outage.

    A small subset of customers were still able to connect to the servers based on which subnets they were located in.

    The incident happened from 9.22 am to 9.37 am

    Our apologies for the issues caused on behalf of our supplier.

  • Date - 06/01/2017 09:22 - 06/01/2017 09:37
  • Last Updated - 06/01/2017 10:07
Server downtime (Resolved)
  • Priority - High
  • Affecting Server - RBX2
  • Starting today at 09.58 until 10.48 we experienced a total downtime of 13 minutes on server2a

    This was over a course of 3-4 downtimes, the longest lasting for 10 minutes.

    The root cause was an unusual high amount of traffic on a specific site, that caused all apache processes and CPU to be eaten up.
    This caused high load on our systems which resulted in connections being blocked.

    We've worked on a solution with the customer to minimize the load on the system.

    We know the downtime, specially on this day is not acceptable. The box are one of our older ones which we will soon replace.

    The new box will be like our others, which includes software to stabilize systems in case of high load.

    We're sorry for the problems and downtime caused today.

  • Date - 25/11/2016 09:58
  • Last Updated - 25/11/2016 11:33
Possible network maintenance in Gravelines (Resolved)
  • Priority - Low
  • Affecting System - GRA-1
  • We've received notification of a possible network maintenance tonight at 22.00 Europe/Amsterdam time in one of the locations.
    This might affect the availability of GRA4 (server4) and GRA5 (server5) for up to 15 minutes.

    Currently the maintenance can be cancelled - since it's currently depending on another maintenance.

    There's currently updates happening in the network which means firmware needs to be upgraded on something called a fabric extender - this meaning the extender has to be reloaded, which can cause a minor outage for the outside world.

    We'll update this page as soon as we know if the maintenance will happen tonight, and what the exact impact will be.

    Update: We got the confirmation that our IP subnets won't be affected anyway during this maintenance.

  • Date - 23/06/2016 22:00 - 23/06/2016 22:30
  • Last Updated - 23/06/2016 19:32
Server unavailable for 15 minutes (Resolved)
  • Priority - High
  • Affecting Server - GRA4
  • At 17/06/2016 01:44 GMT+2 server4 (GRA4) experienced a network outage.
    At 17/06/2016 01:59 GMT+2 server4 (GRA4) came back online after 15 minutes downtime.

    Due to a critical upgrade for software on two switches at the datacenter provider we use, we experienced an outage of 15 minutes.
    This update was a critical update to fix multiple bugs within the switch, where the expected downtime would be less than 5 minutes because these updates happens in a rolling manner (so meaning 1 switch at a time), and usually so short that most people don't even notice.

    Sadly the upgrade caused minor issues and extended the downtime to a total of 15 minutes, which surely affected the availability of the services behind these switches.

    We're sorry about any issues you've experienced during the upgrade, but even with a fully redundant network sometimes network issues happen.

  • Date - 17/06/2016 01:44 - 17/06/2016 01:59
  • Last Updated - 17/06/2016 02:13
DDoS mitigation test (Resolved)
  • Priority - High
  • Affecting Server - PAR3
  • After we've deployed our Arbor DDoS mitigation solution for this server, we have to perform a test to make sure that every service works as expected.
    This test might cause small outages of up to 5 minutes at a time.

    To make the least impact we'll perform the this test starting at 10pm today.

    What will be done is that we enable the mitigation for a period of 5 minutes, do tests if services such as email, and HTTP traffic works as expected.
    If it doesn't, the mitigation will disable itself after 5 minutes.


    Update 21:57:

    We'll start mitigation test in a few minutes

    Update 22:19:

    Since services seems to be working fine - we'll be disabling the mitigation test in a few minutes.
    A few minutes after the mitigation are turning off, services can reset it's connection.


    -----

    All timestamps are in Europe/Amsterdam

  • Date - 08/06/2016 22:00 - 08/06/2016 22:05
  • Last Updated - 08/06/2016 22:24
Server under ddos (Resolved)
  • Priority - Medium
  • Affecting Server - PAR3
  • The server3 / PAR3 is currently under a DDoS attack, this attack is being filtered by an anti-ddos solution.

    Be aware that it can affect normal traffic as well, either sometimes rejecting or responding slowly.

    We're working together with the datacenter to make as much clean traffic pass through to the server as possible.

    update 20.10:

    The server is still under attack, we've made a few changes which should result in more clean traffic going to the server, sadly it's still blocking about 20% of the traffic.
    In periods the traffic reaches the server as expected, and other times it's insanely slow.

    We've planned with the datacenter tomorrow morning to deploy additional protection measures. We do hope that the attacks will stop as soon as possible.

    update 21.05:

    The attack is still ongoing, but we've managed to restore full traffic to the server, every site should be working now.

    update 03.43:

    The attack stopped

    update 07.48:

    We've reverted our fix, and will continue to deploy Arbor protection today.

  • Date - 07/06/2016 17:59
  • Last Updated - 08/06/2016 07:49
New backup server (Resolved)
  • Priority - Medium
  • Affecting System - Backup
  • We're setting up a new backup server for backups for every server we have, and because of this we'll have a period where backups will be taken on the new system but not on the old one.

    We'll switch to the new system as soon as the first initial backup has been completed, if you need backups restored from older than today, please create a support ticket, and we'll be able to restore it until 14 days has passed.

    Update April 16 at 4pm: All servers has been moved to the new backup server, the old server will be running for another month, which after we'll turn it off.

  • Date - 15/04/2016 17:08 - 16/04/2016 16:00
  • Last Updated - 17/04/2016 17:56
PHP upgrade (Resolved)
  • Priority - Low
  • Affecting Other - AMS1, RBX2, PAR3
  • We'll be upgrading PHP versions to 5.6 on the servers, AMS1 (server1), RBX2 (server2a), and PAR3 (server3).

    This upgrade will start around 10pm on march 12.

    The plan for the maintenance is as following:

    Server1 (Starting at 10pm):
     - Upgrade PHP 5.4 to PHP 5.6
     - Recompile a few PHP modules to be compatible with 5.6

    Server2a:
     - Upgrade PHP 5.4 to PHP 5.6
     - Recompile a few PHP modules to be compatible with 5.6

    Server3:
     - Upgrade PHP 5.5 to PHP 5.6
     - Recompile a few PHP modules to be compatible with 5.6

    All times are in Europe/Amsterdam timezone.

    Since we take one box at a time, we can't give specifics about when we'll proceed with Server2a and Server3, but do expect to be affected between 10pm and 2am.
    We'll try to keep the downtime as low as possible.

    In worst case if the upgrade of a specific server takes too long we might postpone one or more servers until either sunday or another date which would be announced in case of this.

    Best Regards,
    Hosting4Real


    UPDATE 22:09:
    We're starting the upgrade of server1.

    UPDATE 22:22:
    Server1 completed.
    We'll proceed with server2a and server3.

    UPDATE 22:36:
    Server3 has been completed.
    We're still waiting for server2a to complete downloading some updates.

    UPDATE 23:12:
    We had to stop the webserver temporary on server2a since it was overloading the system meanwhile doing the upgrade. This means you'll currently experience that your site is not accessible.

    UPDATE 23:19:
    Upgrade completed - we experienced 8 minutes of downtime on server2a.


  • Date - 12/03/2016 22:00 - 12/03/2016 23:19
  • Last Updated - 12/03/2016 23:20
DC Network Upgrade - Server1 (Resolved)
  • Priority - High
  • Affecting Server - AMS1
  • The data center will do a network maintenance on February 17th from 4AM to 8AM CET.
    This maintenance will affect the connectivity to server1.hosting4real.net (AMS1) for up to 4 hours, but they try to keep it minimal.

    This means that your websites will be affected during this period, the datacenter tries to keep the downtime as low as possible, the NOC post from the data center will be found below:

    Dear Customer,

    As part of our commitment to continually improve the level of service you receive from LeaseWeb we are informing you about this maintenance window.

    We will upgrade the software on one of our AMS-01 distribution routers on Thursday 17-02-2016 between 04:00 and 08:00 CET. There will be an interruption in your services, however we will do our utmost best to keep it as low as possible.

    Affected prefixes:
    37.48.98.0/23
    95.211.94.0/23
    95.211.168.0/21
    95.211.184.0/21
    95.211.192.0/21
    95.211.224.0/21
    95.211.240.0/22
    185.17.184.0/23
    185.17.186.0/23
    37.48.100.0/23
    89.144.18.0/24
    103.41.176.0/23
    104.200.78.0/24
    179.43.174.128/26
    191.101.16.0/24
    193.151.91.0/24
    212.7.198.0/24
    212.114.56.0/23
    2001:1af8:4020::/44


    We're working on a solution to prevent this in the future by moving the server to another data center to increase the stabiity and uptime of the network.
    This new server will first be arriving in about 2 months, so we won't be able to make it before this maintenance is planned.

    We're sorry for the inconvenience and downtime caused by this.

    Be aware that this maintenance will also result in emails not being delivered to any hosting4real.net email address - if you want to create tickets during this timeframe, please create it directly at https://shop.hosting4real.net/

    [UPDATE 04.00]
    The maintenance has started. This will cause downtime on server1.
    We'll update this when we have more news.

    [UPDATE 07:37]
    Server offline - this is due to a reboot of the distribution router.

    [UPDATE 07.55]
    The maintenance has been completed - there should be no more downtime due to this.

    [UPDATE 09.06]
    There's an issue after the maintenance which has caused the connectivity to disappear, a networking guy are on site to work on the problem.
    We'll keep you updated.

    [UPDATE 09:30]
    The issue has been resolved - it was due to a hardware problem after the maintenance which was fixed.

  • Date - 17/02/2016 04:00 - 17/02/2016 09:30
  • Last Updated - 17/02/2016 10:21
Reboot of server2a (Resolved)
  • Priority - High
  • Affecting Server - RBX2
  • We have to do a reboot of server2a due to a bug in the backup software we use which prevents the server from being backed up.
    We've investigated together with R1Soft how to resolve this, but since we haven't found a solution for it yet, and there's not ETA when it will be resolved, we have to do a reboot of the system to be able to back it up again.

    We're sorry about the short notice, we'll do it at 3 AM at night since this is the time the server serves the least amount of traffic, to not affect too many customers, and also it allows us to get the server faster back online.

    The reboot is expected to take 1-5 minutes, but in case of failure to boot we've put the timeframe of 30 minutes which should be enough to resolve possible boot errors.

    We're sorry for the inconvenience.

    UPDATE 03:04:
    We start the reboot of the server

    UPDATE 03:08:
    Server back online

  • Date - 07/01/2016 03:00 - 06/01/2016 03:30
  • Last Updated - 07/01/2016 03:10
Network connectivity (Resolved)
  • Priority - Critical
  • Affecting Server - RBX2
  • We're currently experiencing either full or high packet loss on server2a - this is due to two routers crashed, the datacenter has found the cause being related to IPv6 killing the routers due to a bug.
    They're working as we speak resolving this but it might result in that we see connectivity returning and disappearing a few times.

    We're sorry for the issues caused, and we'll update when we know more.

    So far the routers run at 100% CPU usage - there's 75% connectivity again for both routers, the last 25% will be back online within short time.

    UPDATE 5:55PM:
    All network has been stabilized after monitoring for around 10-15 minutes.
    The cause was an overload in the IPv6 configuration of the routers, there was made a quickfix by disabling IPv6 fully on the routers, and which resulted in the traffic slowly returning, after stabilizing, the IPv6 traffic was enabled again except some specific segment which seems to be the cause of the problem.

    Since we do not run IPv6 on our servers we're currently not affected by this.
    We're sorry for the issues caused by downtime you experienced, we do the best we can to ensure all our providers run a redundant setup, which is also the case here, it usually prevents a lot of issues, but in this case the bug caused both routers to crash at same time.

  • Date - 04/01/2016 17:16
  • Last Updated - 04/01/2016 18:01
server2a blacklisted by Microsoft (Resolved)
  • Priority - High
  • Affecting Server - RBX2
  • Hello,

    We had an account that was defaced, which caused a so called "darkmailer" to be triggered. Usually we block these kinds of attacks on the mail server level, but since this bypasses the mail server, it allowed the server to send out around 400 emails before we stopped the attack.
    This sadly caused a few triggers in some blacklists, some of them where we've already been delisted again based on our request.

    We tested sending to major email service providers, and saw that Microsoft temporarily have blocked the email sending to their service.
    This means currently all email towards Microsoft (including outlook.com, hotmail.com, live.com) is currently being blocked by Microsoft.
    We've opened a case with Microsoft to get our IPs delisted from their service and awaiting response from their abuse team for this to happen.

    We do not expect this to be an issue for too long.

    We're sorry for the issues caused by this block, and we're working on alternative solutions meanwhile.

    INFO 20.00:
     We're currently ensuring that no emails are being sent out that could cause another block.

    UPDATE 21.00:
     Request sent to Microsoft to request delisting after confirming that the issue was resolved.

    UPDATE 21:15:
     Microsoft has removed the block from our IP, and emails should be sent as usual to Microsoft, with a few delays.

  • Date - 25/09/2015 20:00 - 25/09/2015 21:15
  • Last Updated - 25/09/2015 21:24
Server2a Outage (Resolved)
  • Priority - High
  • Affecting Server - RBX2
  • Today we had an outage on server2a, and we wanted to explain what happened.

    One of our datacenter providers did a human error on a OSPF configuration which did cut off a router.
    This usually isn't a problem except for some minor routing problems, or worst case that a small amount of traffic is impacted.
    But the issue here was that some of the route reflectors didn't communicate that the router that was cut off was actually down, so the network saw the router as still active.

    This resulted in bad routing in the network, which was leter fixed by taking all BGP sessions for that specific segment of the network.

    Later they saw that there was a bug in one of the reflectors which is why the route reflectors didn't communicate in the first place.

    They did reset the broken route reflector, which solved the problem with communication.

    After this the traffic was enabled again and traffic started to flow as normal.

    The reason why it resulted in connectivity issues to the server was because it did impact connectivity to Cogent, Tata, Level3 and Telia's network from the datacenter.

  • Date - 29/07/2015 16:00 - 29/07/2015 16:10
  • Last Updated - 29/07/2015 20:49
Server2a Outage (Resolved)
  • Priority - Critical
  • Affecting Server - RBX2
  • We're currently experiencing an outage on server2a due to a webserver crash, it has caused high load on our system, and we're working as fast as possible to solve this issue.

    Update 0015:
    The machine is back online after forcing a reboot of the system.
    The system that we use to force the reboot seemed to have some issues meaning we needed to call the technical support department in the datacenter to force the reboot from their end.
    Meaning it took a bit longer.

    We're currently investigating what went wrong, and how we can prevent it in the future.
    Until then we'll leave this ticket open.

    For the next hours the server will have longer response times due to we need to ensure all data on the system is intact.

    We're sorry for the issues caused. 

  • Date - 04/06/2015 23:56 - 05/06/2015 00:15
  • Last Updated - 19/06/2015 11:20
[Cloud] Frankfurt-A issues (Resolved)
  • Priority - High
  • We're currently experincing issues with our Frankfurt-A POP for some Cloud servers.

    The issue is getting worked on, there's no current ETA when this will be fixed.

    The issue was resolved at 13:00

  • Date - 09/06/2015 12:10 - 09/06/2015 13:00
  • Last Updated - 19/06/2015 11:20
Updating SSL certificates (Resolved)
  • Priority - Low
  • Affecting System - All servers
  • We're currently updating all services using the *.hosting4real.net SSL since it's about to expire, this means you might need to accept the new certificate if you force TLS over your own domain.
    [UPDATE 07:42PM]
    All certificates has been updated.

  • Date - 14/05/2015 19:07 - 14/05/2015 19:42
  • Last Updated - 14/05/2015 19:45
Downtime of server2a (Resolved)
  • Priority - Critical
  • Affecting Server - RBX2
  • This morning at 6AM we had downtime on Server2a (RBX2), this was due to a kernel panic of the machine.

    We're working on finding the root cause of this kernel panic.
    We rebooted the system to get back online, and after around 10 minutes of downtime, all our checks reported up again.

    The server is currently resyncing the whole raid on the server to ensure data integrity, this causes a bit higher load on the system than usual. Due to the size of our disks, this will take quite some time, but we'll try speeding up the process by putting higher priority without affecting performance too much.

    If you should experience any problems with your website. Please contact us at support@hosting4real.net.

    [UPDATE 1:30PM]
    We had a permission issue with a few files that prevented some customers uploading files to a range of CMS's based on how the CMS would handle file uploads.
    This was reported by a customer, and got fixed.

    [UPDATE 9:19PM]
    There's still around 70-110 minutes left of the raid synchronization, we decided to let the syncing work most of the day, to affect performance the least, but since the traffic on the server is quite low at this point we increased the synchronization speed to stay around 70-90 megabyte per second whenever possible. We still have another 400 gigabyte of disk to verify.
    We still give the process low priority meaning all other processes still get highest priority and might affect the synchronization time.

    We'll update as when the synchronization finishes.

    [UPDATE 11.47PM]
    The raid synchronization has now finished, and performance is 100% back to normal.

    The only last step, is that our backup system will need to check for the backup integrity as well, meaning it will also need to scan the data on the disks, this takes a few hours, and it will start at 4am.

    [UPDATE 7:49AM]
    Backup finished as normal.

    This means everything is back to normal.

    Best Regards,
    Hosting4Real


  • Date - 13/05/2015 06:00 - 15/05/2015 07:49
  • Last Updated - 14/05/2015 08:00
TDC having connection issues. (Resolved)
  • Priority - High
  • Affecting System - Network
  • Some danish customers might experience connectivity issues to our servers, and other hosting providers throughout Denmark and the rest of the world, due to an outage at TDC.

    ISP's is doing their best to route traffic via other providers, but be aware that all customers located on the TDC network might experience complete outage to a lot of websites.

    We're sorry for the inconvenience.

    UPDATE 16:13:

    Seems like TDC has resolved their issues.

  • Date - 22/04/2015 15:07 - 22/04/2015 16:13
  • Last Updated - 22/04/2015 16:14
Reboot of all servers (Resolved)
  • Priority - Critical
  • Affecting System - All servers
  • Due to a major security issue with glibc on Linux, we're required to reboot all servers in our infrastructure.
    This is expected to only take 5-10 minutes per server, but we've put a maintenance window of 2 hours in case of problems.

    We're sorry for the short notice, but the update was first available this morning, and it requires rebooting all systems.

    We'll start with server1, at 09.30, when it's back up, we will proceed to server2a, and after this the remaining nameservers and backup servers which doesn't affect customers directly.

    We're very sorry for the inconvenience.

    - Hosting4Real

    UPDATE 09:35:
    server1 was rebooted, had 239 seconds of downtime.

    Proceeding with server2a.

    UPDATE 09:41:
    Server2 was rebooted, and had 173 seconds of downtime.

    We will proceed with nameservers, backup servers, etc.
    These won't have direct impact on customers. We will keep you updated.

    UPDATE 09:53:
    mx2 (ns3) was rebooted, and had 92 seconds of downtime.

    UPDATE 09:54:
    ns4 was rebooted and had 14 seconds of downtime.
    backup server was rebooted and had 191 seconds of downtime.

    Updating is done for today.

  • Date - 28/01/2015 09:30 - 28/01/2015 09:54
  • Last Updated - 28/01/2015 10:25
POODLE SSL attack (Resolved)
  • Priority - High
  • Affecting System - All servers
  • Currently people might be hearing about the POODLE SSL attack, which affects SSLv3.

    The CVE attached to this attack is CVE-2014-3566. We've decided to disable SSLv3 on all our servers, and only allow TLS1, 1.1 and 1.2 which doesn't have these issues.

    Dropping SSLv3 also means that IE6 users and really really old browsers cannot visit any sites using SSL on our servers anymore. But due to the age and amount of traffic seen from these browsers we don't see it as a problem.

    If you have any questions, please contact support@hosting4real.net.

    Best Regards,
    Hosting4Real

  • Date - 15/10/2014 22:00 - 16/10/2014 22:30
  • Last Updated - 09/11/2014 00:48
Recreation of Backups for server2a (Resolved)
  • Priority - High
  • Affecting System - Backup server2a
  • Due to storage amounts for the disk safe of server2a, we need to recreate this disk safe.
    This means that we'll delete the old disk safe including backups, and create a new one, and start it immediately, we also store one weekly backup on other servers, so if any backup recovery is needed, please contact our support for this matter.

    Sorry for the inconvenience.

    [UPDATE 08:40]
    The disk safe has been recreated, and we've now queued the server for backup.

    [UPDATE 12:04]
    First few backups are now done, and it will keep taking backup as usual.

  • Date - 08/10/2014 07:47 - 08/10/2014 12:04
  • Last Updated - 08/10/2014 19:40
NL network maintenance (Resolved)
  • Priority - High
  • Affecting Server - AMS1
  • Our datacenter provider will do maintenance on their network in AMS-01 datacenter. This maintenance will impact IP connectivity, they'll be working on multiple routers and expect that each segment can be affected for up to two hours.
    There will be a service unavailability. While connectivity is restored you might also experience higher latency and/or packet loss.

    The IPs affected by this maintenance is:

    91.215.156.0/23
    91.215.158.0/23
    95.211.94.0/23
    95.211.168.0/21
    95.211.184.0/21
    95.211.192.0/21
    95.211.224.0/21
    95.211.240.0/22
    179.43.174.128/26
    185.17.184.0/23
    185.17.186.0/23
    188.0.225.0/24
    192.162.136.0/23
    192.162.139.0/24
    193.151.91.0/24
    212.7.198.0/24
    212.114.56.0/23
    
    2001:1af8:4020::/44
    This will affect our connectivity as well for Server1.

    Sorry for the inconvenience.

    UPDATE 03-09-2014 18:01:

    We received updates from the datacenter, we've been informed that the expected downtime should be no longer than 30 minutes.
    What is going to happen is that all IPs are undergoing an AS change and therefore the BGP AS Number needs to be changed for the routers.
    Due to the nature of how the internet works, multiple routers need to pick these changes up, resulting in connectivity issues.

    UPDATE 16-09-2014 08:05:

    The maintenance is still ongoing, we're awaiting update from LeaseWeb with more information, sorry for the inconvenience.
    We see pings going through from time to time, so network is getting restored at this very moment. 

    Sincerely,
    Hosting4Real

  • Date - 16/09/2014 06:00 - 16/09/2014 08:07
  • Last Updated - 17/09/2014 08:30
IPs on server1 blacklisted in spamhaus (Resolved)
  • Priority - Critical
  • Affecting Server - AMS1
  • The IPs 95.211.199.40, 41, 42, which is our main site, and server1 is currently blacklisted in the spamhaus SBL blacklist.

    This is due to the IPs is a part of the subnet 95.211.192.0/20 owned by LeaseWeb has been blacklisted.
    We're awaiting reply from LeaseWeb for a status update and an estimated resolution time.

    You can find more about the blacklisting here: http://www.spamhaus.org/sbl/query/SBL230484

    Due to the good reputation of our IPs assigned to us, this blacklisting will only affect our customers if ISPs is fully relying on spamhaus for their spam filtering.

    We're truly sorry for the inconvenience caused by this issue.

    UPDATE 13/08/2014 22:16
    We've received the notification from LeaseWeb that their abuse team is aware of this issue and is awaiting reply from Spamhaus.

    UPDATE 15/08/2014 09:47
    The IP subnet is now removed from the blacklist at spamhaus.
    We haven't received any bounces about emails, so no problems caused for any of our customers.

  • Date - 13/08/2014 21:23 - 15/08/2014 09:47
  • Last Updated - 15/08/2014 16:32
VPS management down (Resolved)
  • Priority - High
  • Affecting System - WHMCS
  • We're currently experiencing some issues getting the correct data for customer VPS's, this means that you might currently not see your VPS in the control panel.
    We're working on a fix.

    UPDATE 17/07/2014 19:47:
    All issues should be resolved, meaning all nodes should be visible to customers.

    One thing that you might notice is that there has been some changes to how your VMs is shown in your product list, this change was done since some customers found it confusing that all VMs existing was stored under same product.
    This means that now you'll see your 'cloud nodes' - this is the main product that you pay for on a monthly basis.

    The other products you'll see is the actual VPS servers you've deployed, these will have a price of 0, since you already pay for the actual resources from the main product.

    From here you'll be able to manage your VPS, as well as see the amount of storage, compute and memory resources you've allocated as well as the location and operating system.

  • Date - 16/07/2014 11:32 - 17/07/2014 19:47
  • Last Updated - 17/07/2014 19:52
Flytning af kunder Del 1 og 2 (Resolved)
  • Priority - Medium
  • Vi foretager flytning af en række kunder fra Server2 (RBX1), til en ny maskine for at forbedre hastigheden endnu mere med nyt hardware og netværk.

    Under flytningen vil der kunne komme nogle udfald, og derfor bedes de informeret kunder ikke foretage for mange ændringer på deres webhoteller i perioden fra kl 21.00 til 07.00.
    Der vil blive sendt email ud ved hver webhotel der bliver flyttet, og når webhotellet er flyttet, så vi holder dig løbende opdateret om status på din flytning.

    UPDATE 23.17:
    Vi har flyttet størstedelen af kunderne fra gammel til ny server, og er færdig for denne weekend.

    Vi vil forsætte i næste weekend med endnu flere flytninger, evt allerede i løbet af ugen, vi informere løbende de eksisterende kunder inden de flyttes.

    UPDATE d. 4 Juli 21.23:
    Alle kunder er flyttet fra gamle server2 til den nye server2a, den gamle server vil blive slukket d. 20 juli, og herefter vil vi lave skiftet af hostname fra server2a tilbage til server2.

  • Date - 27/06/2014 21:00 - 28/06/2014 07:00
  • Last Updated - 06/07/2014 22:38
Server2 down (Resolved)
  • Priority - Critical
  • Server2 is current down due to hardware failure. The defect part is currently being replaced.
    Sorry for the incovenience.

    Server2 er nuværende nede grundet af defekt hardware. Delen er ved at blive skiftet.
    Vi undskylder meget.

    UPDATE 09.07:
    Server2 is back online. We had some issues with some customer MySQL databases due to the way MySQL crashed, so some databases became corrupt. MySQL made a recovery by itself from the InnoDB log files.
    The system is slower than usual, because we're resyncing all data in the raid to ensure no data is lost.

    Server2 er tilbage online. Vi havde nogle fejl på nogle kunders MySQL databaser grundet af måden MySQL serveren stoppede på som gjorde nogle filer blev korrupte. MySQL lavede en automatisk gendannelse af disse databaser ud fra InnoDB log filerne.
    Systemet er langsommere end normalt, da vi synkronisere alt data i vores raid for at sikre at data er intakt.

    UPDATE 11.00:
    The server was down again, we made a reboot and was back online after about 4 minutes.
    We've investigated the issue and the problem should be resolved.
    We'll monitor the server extensively the rest of the day.
    Sorry for the incovenience.

    Serveren var nede igen, vi lavede en genstart og var oppe 4 minutter efter.
    Vi har undersøgt problemet, og det burde være løst nu.
    Vi holder tæt øje med serveren resten af dagen.

  • Date - 03/03/2014 07:20 - 03/03/2014 11:12
  • Last Updated - 06/03/2014 10:21
Server updates (Resolved)
  • Priority - High
  • Affecting System - All servers
  • Back in January we was performing upgrades, which was cancelled due to some issues with one of our servers.

    We'll be performing these upgrades which will require all servers to restart, estimated downtime per server will be approx. 5-10 minutes.
    We apologies for any inconvenience.

    We'll do our best to keep the downtime as low as possible.

    Due to unforseen circumstances, we've decided to postpone the upgrade until next week. We're sorry about the short notice.


    Tilbage i Januar lavede vi nogle opdateringer, som blev annulleret grundet nogle fejl med en af vores servere.

    Vi vil lave disse opdateringer, hvilket kræver at vi genstarter alle servere, den forventet nedetid per server er ca. 5-10 minutter.
    Vi undskylder for besværet.

    Vi vil gøre vores bedste for at minske nedetiden så vidt muligt.

    Grundet af uforudsete grunde, har vi valgt at udskyde opdateringen til næste uge. Vi undskylder den korte udmelding.

  • Date - 07/02/2014 22:00 - 07/02/2014 23:59
  • Last Updated - 07/02/2014 13:36
Server updates (Resolved)
  • Priority - High
  • Affecting System - All servers
  • During this time we'll be upgrading all our servers with the latest security patches for all services.

    This requires that we do a restart of all servers, the estimated downtime for each server is around 5 minutes if everything goes as planned.

    This affects all of our infrastructure, except VPS customers.
    We'll keep all clients informed when we begin the update on twitter, as well as when we finish, or any issues that might occur during this 2 hours timeframe.
    We try to keep the downtime as low as possible.

    ----------

    Vi vil opdatere vores servere med de seneste sikkerhedsopdateringer for alle vores services.

    Dette kræver at vi gestarter alle servere, og den forventet nedetid for hver server vil være ca. 5 minutter hvis alt går som planlagt.

    Dette pårøre hele vores infrastruktur, undtaget VPS kunder.
    Vi vil holde alle kunder informeret når vi begynder opdateringen via twitter, og igen når vi er færdige, eller hvis der skulle opstå nogle problemer i dette 2 timers tidsrum.
    Vi vil holde nedetiden så lav så mulig.


    UPDATE 1:
    Under opdatering af server2, skete der en fejl i vores load af kernel.
    Den er ved at genstarte igen.

    UPDATE 2 (00.59):
    Grundet af større nedetid på server2, har vi valgt at stoppe opdateringen af resten af vores servere, vi udskyder opdateringen for alle vores services. Grundet den lange nedetid, vælger vi at forlænge alle webhoteller på server2 med 1 måned gratis.

    Vi vil skrive en blogpost imorgen omkring de problemer vi har oplevet her til aften.

    Vi undskylder rigtig mange gange, for det ekstra nedetid der blev forudsaget.

  • Date - 27/12/2013 21:59 - 28/12/2013 00:58
  • Last Updated - 05/02/2014 21:01
Opgradering af nginx (Resolved)
  • Priority - Critical
  • Dansk:
    Vi bliver nød til at lave en opdatering af nginx på server 2, dette bliver gjordt grundet af en fejl i den nuværende version, som er blevet fixet i den nye.
    Vi undskylder mange gange for den korte udmelding.

    English:
    We need to perform an upgrade of nginx on server 2, We've found a small issue in current version which is fixed in the new one.
    We're sorry for the short notice.


    - Hosting4Real

  • Date - 02/10/2013 22:30 - 02/10/2013 22:59
  • Last Updated - 03/10/2013 00:41
Upgrade of mysql from 5.1 to 5.5 (Resolved)
  • Priority - High
  • English:
    We'll be upgrading our MySQL version from 5.1 to 5.5
    This upgrade will improve the performance of MySQL on server2.

    We've planned 2 hours for the upgrade, and should be enough to perform the upgrade.
    We'll backup all databases before we start, and we might run into either small timeouts or downtimes of some websites, but we will try to keep it at the very minimal.

    This is a part of our normal service upgrades, and it's a must that we do this.


    Danish:
    Vi vil opgradere vores MySQL version fra 5.1 til 5.5
    Denne opgradering vil øge hastigheden af MySQL på server2.

    Vi har planlagt 2 timer for denne opgradering og det skulle bære nok.
    Vi vil tage backup af alle databaser før vi starter, og vi kan løbe ind i små udfald af nogle websites under opgraderingen, men vi vil prøve at holde nedetiden så lille så mulig.

    Dette er en del af normal service opdatering, og det er et must at vi gør dette.

    Best regards/Venlig Hilsen
    Hosting4Real


    UPDATE 22.11:
    Alt gik som planlagt - Everything went as expected.

  • Date - 07/09/2013 22:00 - 07/09/2013 22:11
  • Last Updated - 07/09/2013 22:16
Server2 network (Resolved)
  • Priority - High
  • Danish

    Der var i nat kl 01.09 et udfald på server2 (RBX1). Problemet opstod i at alle Routing Reflectors crashede samtidig, datacenteret har arbejdet på at finde grunden, som test blev nogle reflectors nedgraderet til en tidligere version af det software de bruger. Dog var problemet der stadig.
    Disse udfald ses meget sjældent, da alle route reflectors er redundant, hvilket vil sige at 1 reflector kan gå ned, uden at skabe nedetid.

    Det pårørte ikke kun vores segment, da det var globalt på netværket, dvs. det pårørte i alt 7 datacentre (3 forskellige placeringer). Der er i alt 10 route reflectors i disse datacentre, hvor alle gik ned samtidig. Datacenteret arbejder sammen med Cisco om at finde kerne fejlen. Fejlen vil blive ændret så snart en løsning er fundet.

    - Hosting4Real

    English

    Tonight at 01.09pm CEST, we saw a small outage of 7-10 minutes on server2 (RBX1). The problem was that all route reflectors crashed at same time. The datacenter is working on finding a solution, as a test some of the reflectors was downgraded to a earlier version of the software. This didn't solve the problem.
    These outages is very rare, due to a redundant system, that allow 1 reflector to go down, and another one will still be able to handle the routing, without causing any downtime.

    This problem existed on a global network, in total of 7 datacenters (in 3 different locations). Those 7 datacenters have a total amount of 10 route reflectors, which all went down at almost same time. The datacenter is working together with Cisco to find the root cause of this problem, and will fix it as soon as possible.

    - Hosting4Real

  • Date - 18/07/2013 01:09 - 01/08/2013 00:00
  • Last Updated - 05/08/2013 07:36
Upgrade of nginx (Resolved)
  • Priority - Critical
  • English:
    We'll be changing the webserver of RBX1 (Server2) this evening due to a upcoming release of cpanel.
    What we will do is to remove nginx until the new cpanel have been upgraded, and we've tested the webserver properly.

    When we do the uninstall, all websites will become unavailable for a very short time, since we need to stop nginx, and start Apache up again.

    Sorry for this short notice, but it was first announced this morning that cpanel 11.38 will be released within' a few days, which can be today. So we wan't to make sure that everything works.

    - Hosting4Real

    Danish:
    Vi vil ændre webserveren på RBX1 (Server2) her til aften, grundet af en opdatering til cpanel der vil komme inden for de næste par dage.
    Hvad vi vil gøre er at fjerne nginx indtil den nye opdatering er ude, og at vi har testet nginx, og er sikker på den virker 100% med den nye version.

    Når vi fjerner nginx, betyder det at siderne ikke kan tilgås i meget kort tid, grundet af vi skal stoppe nginx, og starte Apache op igen.

    Vi undskylder for den korte varsel, da det først var annonceret her til morgen at cpanel 11.38 vil blive udgivet inden for de næste få dage, hvilket allerede kan være i nat. Så vi vil være sikker på at alt virker.

    - Hosting4Real


    UPDATE 1:
    We've successfully made the change, and everything is working. We'll get nginx back on the server when we've tested it properly with the new release.

    UPDATE 2:
    Nginx is up and running again, it caused 1-2 minutes downtime for a few people, due to a small mistake in the nginx configuration file, with log formatting.

  • Date - 24/05/2013 21:00 - 24/05/2013 21:02
  • Last Updated - 15/06/2013 15:16
Upgrade of nginx and php (Resolved)
  • Priority - Medium
  • Affecting System - All servers
  • We'll be software upgrades affecting PHP and NGINX.
    This upgrade have been tested, in our development environment, and resulted in 3 seconds downtime, We'll try to keep it at same amount if not able to make no downtime at all.

    During the upgrade of PHP, the Memcache, htscanner and new relic modules will be removed for short amount of time. This means if you use any of these modules, your code might be returning some errors during the upgrade.
    If you're making use of htscanner, please make sure you use the in your htaccess files, else this will result in internal server error for sites until it's recompiled (2-3 minutes).

    We've set the maintenance to two hours. Downtime will be minimal.

    - Hosting4Real

    UPDATE 1:
    Server1 is now fully updated, next server is going to be server2.

    UPDATE 2:
    All servers is now up to date, everything went as it should.

    Thanks you for your patience. 

  • Date - 17/05/2013 22:59 - 18/05/2013 00:01
  • Last Updated - 18/05/2013 00:05
High packet loss on RBX1 (Server2) (Resolved)
  • Priority - Critical
  • We're currently seeing a high packet loss, on server2, this is affecting all IPs: 94.23.28.169, 94.23.147.217 and 94.23.148.5.

    We've narrowed the problem down, to be a problem with the part of the network called vss-1-6k, and also eur-1-1c. The packet loss is between 53% and 98% currently, this means that sites might be accessible in small periods.

    Sorry for the downtime it causes, we will try to get things back to normal as fast as possible.

    UPDATE:

    The network is back to normal.

  • Date - 17/05/2013 04:35 - 17/06/2013 06:19
  • Last Updated - 17/05/2013 06:35
Udfald i 6-7 minutter på RBX1 (Resolved)
  • Priority - Critical
  • Dansk:
    Der var et mindre udfald på RBX1 i en kort periode af 6-7 minutter, fejlen opstod ved generering af vhosts, som gik galt, vi er ved at undersøge grunden til dette og vil prøve at forhindre det i fremtiden. Vi undskylder.

    English:
    We had a small outage on RBX1 for a short period of 6-7 minutes, the error occurred when generating vhosts, we're looking into why this did happen, and will do our best to prevent it in the future. We're sorry for the small outage. 

  • Date - 13/02/2013 17:24 - 13/02/2013 17:52
  • Last Updated - 13/02/2013 17:51
Opdatering af software (Resolved)
  • Priority - Low
  • Dansk:
    Vi har en lille opdatering af noget software på AMS1, dette gør at der kan være mulighed for lidt udfald i få minutter. Vi vil gøre vores bedste for dette ikke opstår.

    English:
    We're making a small update of some software on AMS1, which means that there can be some downtime for few minutes, we'll do our best to not make this happen.


    ---------
    Hosting4Real

  • Date - 08/02/2013 23:50 - 08/02/2013 23:59
  • Last Updated - 09/02/2013 00:08
Cpanel not accessible (Resolved)
  • Priority - Medium
  • All customers located on AMS1 (Server1) will not be able to log into cPanel currently, We found out, that the renew of the license didn't work as expected, this should be working again soon.


    Sorry for the time, not being able to log in.

    Email, and websites will keep function as normal.

    UPDATE:
    The problem is fixed. Everything should now function normally.

  • Date - 02/02/2013 12:16 - 02/02/2013 14:24
  • Last Updated - 02/02/2013 14:24
Server Maintenance (Resolved)
  • Priority - Low
  • The night between December 14 and December 15, we'll have a server maintenance - this requires that we power of the affected server, our estimated time is 2 hours, but we've reserved 4 hours for doing this maintenance.

    During this time, services will not be available.

    We're sorry for the downtime it may cause, but we'll do our best to keep it as short as possible!

  • Date - 14/12/2012 23:59 - 15/12/2012 03:29
  • Last Updated - 04/12/2012 15:53
Expecting 10 minutes of downtime for VPS's (Resolved)
  • Priority - Medium
  • Affecting Other - Network
  • The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Amsterdam is expected to be down for 10 minutes.

  • Date - 14/11/2012 20:00 - 14/11/2012 20:10
  • Last Updated - 17/11/2012 16:12
Expecting 10 minutes of downtime for VPS's (Resolved)
  • Priority - Medium
  • Affecting Other - Network
  • The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Frankfurt is expected to be down for 10 minutes.

  • Date - 13/11/2012 20:00 - 13/11/2012 20:10
  • Last Updated - 17/11/2012 16:12
Expecting 10 minutes of downtime for VPS's (Resolved)
  • Priority - Medium
  • Affecting Other - Network
  • The Core Routers in front of the VPS's will be updated with new software, due to a new security release. This means that VPS's in Paris is expected to be down for 10 minutes.

  • Date - 12/11/2012 20:00 - 12/11/2012 20:10
  • Last Updated - 17/11/2012 16:12