Pingdom stats are a bit strange sometimes. For our services (check my profile if you want to know), a normal month is 100% uptime, simple. But sometimes, your website is marked as down because they have an issue. Several times, we had downtime because of errors between their servers in California and us in Europe where I was able to backtrace them as a network issue deep in California.
Do note that Pingdom, unless specifically configured otherwise, counts a successful HTTP request as uptime. If that HTTP request is returning a page saying "our database servers are down, try again later", it still shows as uptime.
You can configure it to send POST data and expect a particular response, but there's no way of telling (other than "yep, that's what we're doing") whether someone's uptime stats are based on that sort of check.
We configured all the benchmark apps to expect a response. So on Github we go and checkout a repo. On Freshbooks a known invoice. On Assistly the agents index. Etc.
This is "the app is functional" checking. Not just 200 OK.
The most reliable measure of true uptime would be transaction monitoring, e. g. simulating a login into all of these web apps. It also catches those issues were the server itself is up but the site does not work for some other reasons (database issue, javascript issue,...).
Or do you know that all your downtime was "total downtime", i. e. server not reachable at all?
This is one of those posts that drives me a little crazy :P
We recently started a web monitoring company (in beta) http://www.verelo.com to tackle situations just like this. Pingdom monitors every 60 seconds, its just not enough if you want to make claims like "we were only down for 6 minutes". If you want to make claims at a single digit minute level you need to check more regularly, and your checks need to be robust. We offer monitors that check as regularly as every 5 seconds.
We're going to monitor those same 6 sites for the next 30 days using our service at Verelo and provide comparative results when the next 37signals blog post comes out.
I use Pingdom, but wish they could monitor more frequently for exactly the reasons you mention. In theory my site can be unavailable for several seconds each minute, without showing downtime in the stats. I signed up for your service, but looks like it is invite only right now?
I think most peoples invites have been approved at the moment.
We have a big release scheduled for Feb. Some cool new features like custom useragent strings and PagerDuty integration are going to be pushed out. We're excited :)
Sounds good! Remember to setup transaction tracking, though. Just doing pings or http get and expecting 200 OK on a page is not enough for this to be interesting.
For Basecamp, we login, go to a project, and post a message, or something like that. So you catch as much as possible.
Looks useful. I've got a long-poll based monitor part written that I must get into a sensible state and chukc on github at some point. I was intending for it to monitor a few bits of mine in a semi distributed manner (each f my little bits of the net monitoring each other. The long-poll would mean I could "tick" every tens seconds (or even every second, though 10 was the plan) using very little bandwidth (it wouldn't have to be polling a web server either, but in all my cases it would be with the web server doing other local checks (Is the DB up?, Can I login?, No new errors in the critical-stuff log?, ...).
I might have to give your service a look until I finish my little toy, if I ever do.
It is called a one time password for a reason. Since it has been sent out via some medium in plain text, it is a good practice to force change password on next login.
We certainly have! I would be embarrassed if we hadn't.
Also, that's why we're comparing uptime against big, established web applications. I would not hold these standards up to a 1 year-old, newly started business. You have far more important things to worry about than getting an extra 9 on your uptime.
The newer organizations on your blog spot don't seem to have problems meeting a similar uptime to you guys. Doesn't sound like this holds true anymore.
The stats are only for a 6-week period which in no way a great indicator of the "uptimes" of a product. Curious to know how they fared the whole of previous year.
I loved the second comment from DHH replying to GB:
I didn’t think it was a very long article? Your first comment
missed the first paragraph and your second comment missed the
second paragraph. There’s only two more paragraphs to go, so
please take a swing at them :)
So, like any stats, take them with a bit of salt.