Page 2 of 2
Posted: 19 Jul 2017, 20:16
by Wodan
webwit wrote: I am not, it's just how the hoster set it up.
Maybe get in touch with their support. Since we're on the default config maybe they can do the mdadm magic as well when replacing that hdd!
My experience with Hetzner support has been very good so far.
Posted: 19 Jul 2017, 20:41
by webwit
Ok, I sent them a support request explaining the issues.
Posted: 19 Jul 2017, 20:48
by webwit
OK, they are quick. What time is convenient?
Dear Client,
We would like to check the hard drives briefly from our side. Please tell us when we may turn off the server for approx 30-45 minutes in order to perform the test.
Kind regards
xxxxxx
Posted: 19 Jul 2017, 21:00
by webwit
I told them you are all just silly people so any time is convenient, better sooner than later, but not at night, so I can check after reboot if everything is running well, and they should let me know one hour in advance, so I can announce the downtime.
Posted: 19 Jul 2017, 21:06
by Wodan
Thanks for taking care of this. Really hoping Hetzner doesn't drop the ball!
Posted: 19 Jul 2017, 22:11
by webwit
The server and thus deskthority will be down from 22:25 UTC July 19th (00:25 CEST July 20th, 18:25 EST July 19th, 15:25 PST July 19th) for an estimated 30 to 45 minutes, for a health check of our hard drives. This is 2 hours and 15 minutes from now. See you on the other side of the event horizon!
P.S. I just completed another off-site backup, just in case.
Posted: 19 Jul 2017, 23:22
by webwit
matt3o wrote: and that's where raid on just 2 drives is a little pointless since not always the machine can tell which data is actually bad and which one is good (50-50).
I am just talking out of my ass here, since the last time I got deep into drive technology was Amiga floppy disks. If I remember correctly, such a disk was divided in a bunch of tracks, which were divided in a bunch of sectors. Each sector had a checksum. So when you read data from the sector, and then compared with the checksum, you knew if the data was healthy or corrupt. I presume technology hasn't deteriorated and modern HDD and SSD also checksum or otherwise validate parts, so in a raid 1 setup you know which disk has the right data and which the broken?
Posted: 20 Jul 2017, 07:16
by Wodan
Any update?
And yeah, RAID1 would be pretty pointless if the bad drive couldn't be told apart from the good drive
Posted: 20 Jul 2017, 07:43
by webwit
Yeah, they took 2 hours 15 minutes instead of 30-45 minutes, and then told me this:
Dear Client
Both hard drives are fine. We have started your server back into your installed system. But note there is currently a rebuild of one device running.
Kind regards
xxxxxxx
However, I just checked smartctl -a again, and the numbers seem significantly worse than yesterday.
Code: Select all
root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
1 Raw_Read_Error_Rate 0x000f 070 063 044 Pre-fail Always - 12163138
Posted: 20 Jul 2017, 07:53
by Wodan
Aw rats. Did they take note of the SMART readouts?
Maybe we should get our shit together and move to a new server. I've really learned to appreciate AWS lately. Depending on the performance we need it might even be cheepcheeper than a small Hetzner root EX server.
Posted: 20 Jul 2017, 08:16
by webwit
I did sent them yesterday's readouts. It's probably best to keep it simple right now and just hop to another hetzner server. Not the right time for a bigger move.
Posted: 20 Jul 2017, 08:26
by matt3o
Unfortunately it seems that hetzner only checks smarctl selftest errors and not the single values. Yesterday the Raw_Read_Error_Rate value was 78, today it's 70 already. Basically you have to wait until the HDD fails, at that point they will change it in few minutes. At this rate at 10 points per day we should have 4-5 days autonomy.
webwit wrote: I am just talking out of my ass here, since the last time I got deep into drive technology was Amiga floppy disks. If I remember correctly, such a disk was divided in a bunch of tracks, which were divided in a bunch of sectors. Each sector had a checksum. So when you read data from the sector, and then compared with the checksum, you knew if the data was healthy or corrupt. I presume technology hasn't deteriorated and modern HDD and SSD also checksum or otherwise validate parts, so in a raid 1 setup you know which disk has the right data and which the broken?
RAID is not a backup system, it's just a way to have some redundancy (or a nice way to be able to add disk space to an array).
Without a raid after yesterday's failure we would probably have a dead server. So hurrah for us! But if it worked the way you are saying we wouldn't have corrupted data, the good bits should have been sync'ed from the healthy HDD, but we had data loss anyway. RAID1 is fine and dandy, but it doesn't save you from data loss, actually since the failure rate of an HDD is around 1.5-3%, having 2 HDD we double our chances of a broken HDD. In a sense having just 1 new HDD is better than having 2 old ones... but hetzner uses hard drives that are running non-stop for ages, so raid even with just two drives makes sense.
But if data loss is your concern, backup is the only solutions.
Posted: 20 Jul 2017, 08:31
by webwit
Weirdly the raw value is going up but the value is at 72 now.
Code: Select all
root@server [~]# while true; do smartctl -a /dev/sdb |grep Raw_Read_Error_Rate; sleep 300; done
1 Raw_Read_Error_Rate 0x000f 070 063 044 Pre-fail Always - 12163138
1 Raw_Read_Error_Rate 0x000f 070 063 044 Pre-fail Always - 12518172
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 12762654
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 13082807
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 13765149
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 14005397
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 14182096
1 Raw_Read_Error_Rate 0x000f 071 063 044 Pre-fail Always - 14432541
1 Raw_Read_Error_Rate 0x000f 072 063 044 Pre-fail Always - 14697695
1 Raw_Read_Error_Rate 0x000f 072 063 044 Pre-fail Always - 14840703
Posted: 20 Jul 2017, 08:39
by matt3o
yeah the values fluctuate. If you look at the 4th column that is the worst value that has even been registered, while the 5th is the threshold that we should never reach.
Posted: 20 Jul 2017, 09:10
by webwit
I ordered a new server.
Posted: 20 Jul 2017, 09:47
by matt3o
check the hard drives before installing anything
Posted: 20 Jul 2017, 10:14
by Wodan
matt3o wrote: check the hard drives before installing anything
Very good point, they re-use servers and we should request a brand new one considering their policy with worn out hdds!
Posted: 20 Jul 2017, 11:05
by webwit
Both sda and sdb on the new server have a fluctuating Raw_Read_Error_Rate, which after a few queries stabilizes at 080.
I'll run some longer tests.
Posted: 20 Jul 2017, 11:09
by matt3o
80 is fine if the raw value is more or less stable
Posted: 20 Jul 2017, 11:38
by seebart
matt3o wrote: 80 is fine if the raw value is more or less stable
Great work webwit and matt3o, I promise I'll refrain from posting memes excessively if that helps.

Posted: 20 Jul 2017, 17:22
by XMIT
So, not SSD time, yet?
If not as the primary, I'd love to see an SSD being used in a write-through cache.
Posted: 20 Jul 2017, 19:21
by webwit
This one:
https://www.hetzner.de/dedicated-rootserver/ex41
When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR per month), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.
Posted: 20 Jul 2017, 19:26
by tobsn
What you think about hitting up AWS and getting Elastic Beanstalk hosting for free? I think thats a possibility... then you wouldn't ever have to bother with server hardware
Posted: 20 Jul 2017, 19:27
by wobbled
If you don't want to go SSD, at least get 15k SAS.
Posted: 20 Jul 2017, 23:20
by Wodan
webwit wrote: This one:
https://www.hetzner.de/dedicated-rootserver/ex41
When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.
Unless we are experiencing HDD performance bottlenecks I would prefer a good enterprise hdd over a ssd.
Most HDDs die slowly and give you time to react .. while some SSDs just stop working and there is no way to recover your data.
Maybe get weekls SMART reports from the server for an early watch

Posted: 20 Jul 2017, 23:56
by Norman_
Wodan wrote: webwit wrote: This one:
https://www.hetzner.de/dedicated-rootserver/ex41
When you order you can pick options such as extra SSD drive (cheapest one 250 GB 11,90 EUR), but the real question is, do we need it? In any case, that's a different discussion, priority is now to get a stable environment asap. I'm planning the move on Saturday or Sunday.
Unless we are experiencing HDD performance bottlenecks I would prefer a good enterprise hdd over a ssd.
Most HDDs die slowly and give you time to react .. while some SSDs just stop working and there is no way to recover your data.
Maybe get weekls SMART reports from the server for an early watch

Losing your data to something like SSD failure as opposed to catching and replacing a failing HDD is kind of irrelevant IMO, because SSD failure is much less common than HDD failure, by like, an order of magnitude, and i find that generally early warning measures for HDD failure aren't as reliable as one would hope. It can be just as sudden and unexpected as SSD failure.
Not to mention if you don't have some sort of redundancy/backup you might as well just delete everything manually.
And while I'm here, friendly reminder that raid is not a backup.
Of course, not that this entire conversation matters...because deskthority is actually pretty fast and doesn't even need SSDs at all lol.