> traceroute a.ns.facebook.com
traceroute to a.ns.facebook.com (18.104.22.168), 30 hops max, 60 byte packets
1 dsldevice.attlocal.net (192.168.1.254) 0.484 ms 0.474 ms 0.422 ms
2 107-131-124-1.lightspeed.sntcca.sbcglobal.net (22.214.171.124) 1.592 ms 1.657 ms 1.607 ms
3 126.96.36.199 (188.8.131.52) 1.676 ms 1.697 ms 1.705 ms
4 184.108.40.206 (220.127.116.11) 11.446 ms 11.482 ms 11.328 ms
5 18.104.22.168 (22.214.171.124) 7.641 ms 7.668 ms 11.438 ms
6 cr83.sj2ca.ip.att.net (126.96.36.199) 4.025 ms 3.368 ms 3.394 ms
7 * * *
I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.
Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.
"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.
That's what you have to do to really own a domain.
Wow, I had no idea it was so cheap once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?
 "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...
Personally saw this kind of thing as early as 2001.
Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.
I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(
Emailed the site's new owner D: but fairly predictably got no reply.
Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.
I now exclusively do all my queries via the WHOIS protocol directly. Welp.
Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.
You are off by a factor of almost 50.
They want you to have $70k liquid.
To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.
In any case, no, they will not need to send an email to fix this issue.
So yes, the registrar that is to blame is themselves.
Source: I know someone within the company that works in this capacity.
That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.
Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.
% traceroute -q1 -I a.ns.facebook.com
traceroute to a.ns.facebook.com (188.8.131.52), 64 hops
max, 48 byte packets
1 torix-core1-10G (184.108.40.206) 0.133 ms
2 facebook-a.ip4.torontointernetxchange.net (220.127.116.11) 1.317 ms
3 18.104.22.168 (22.214.171.124) 1.209 ms
4 126.96.36.199 (188.8.131.52) 15.604 ms
5 184.108.40.206 (220.127.116.11) 21.716 ms
% traceroute6 -q1 -I a.ns.facebook.com
traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets
1 toronto-torix-6 0.146 ms
2 facebook-a.ip6.torontointernetxchange.net 17.860 ms
3 2620:0:1cff:dead:beef::2154 9.237 ms
4 2620:0:1cff:dead:beef::d7c 16.721 ms
5 2620:0:1cff:dead:beef::3b4 17.067 ms
»The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«
Well at least it will in 2036, when IPv6 goes mainstream.
Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).
...the doors are glass right?
And I guess beyond that point, walls are glass. Or you need explosives.
A few hundred bucks of glass Vs a billion wiped off the share price if the service is down for a day and all the user's go find alternatives.
I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.
A small company has to keep all of its customers happy (or at least be responsive when issues arise, at a bare minimum).
Massive companies deal in error budgets, where a fraction of a percent can still represent millions of users.
I guess good decentralized public communication services could solve those issues for everybody.
at the lowest level in case of severe outage we resort to IRC, Plain Old Telephone Service and, sometimes, stick-it notes taped to windows...
I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.
Backup tapes and in production servers are kept at different colocation sites to protect data from fire and other catastrophes of that level
Using colo sites on separate tectonic plates would protect you from catastrophes on a geological cataclysm level
Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.
What use is it if it runs on the same stack as what you might be trying to fix?
IRC does use DNS at least to get hostnames during connection. I'd be surprised if it didn't use it at other points.
My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...
yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...
The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)
Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"
But as they say: make something foolproof and nature will create a better fool.
Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.
I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.
While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?
Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.
While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.
If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...)
and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).
At some point, they must run out of names, right?
Google has more than 1 L8 SRE.
I was not trying to establish a trust chain.
Take from it what you will.
No shit Google has plans in place for outages.
But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.
No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.
I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.
Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.
its not loading for me. could you say what it said?
The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.
- not arrogant
- or complacent
- haven't inadvertently acquired the company
- know your tech peers well enough to have confidence in their identity during an emergency
- do regular drills to simulate everything going wrong at once
Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.
I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.
I can imagine this affects many other sites that use FB for authentication and tracking.
If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...
Hopefully law makers all study up and pay close attention.
What transpires next may prove to be very interesting.
>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
"anyone have a Cisco console cable lying around?"
If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.
(and yes, fb.com resolves)
Note that resiliency and efficiency are often working against each other.
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
Or when trying ips directly:
I would have expected a DNS issue to not affect either of these.
I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.