一次集群硬盘故障记录
执行hadoop fsck /
会发现报错很多缺失块的问题,因为还抱有幻想希望盘能修好,因此没有执行hadoop fsck --delete操作。
/user/root/test.txt: Under replicated BP-1069288-141-1454466502803:blk_1074786045275. Target Replicas is 2 but found 1 replica(s)./user/root/test1.txt: Under replicated BP-10619288-10.54466502803:blk_1091327_2852203. Target Replicas is 3 but found 2 replica(s).........Status: HEALTHY Total size: 6478807583062 B (Total open files size: 12661477 B) Total dirs: 74799 Total files: 1028230 Total symlinks: 0 (Files currently being written: 169) Total blocks (validated): 1045666 (avg. block size 6195867 B) (Total open file blocks (not validated): 166) Minimally replicated blocks: 1045666 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 209121 (19.998833 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 1.8725061 Corrupt blocks: 0 Missing replicas: 209121 (9.649644 %) Number of data-nodes: 7 Number of racks: 1FSCK ended at Tue Apr 19 11:19:12 CST 2016 in 50218 millisecondsThe filesystem under path '/' is HEALTHY过了段时间后,运维把盘搞定了,听说是什么snapshot的问题。然后重新加节点加入集群,这个时候再次执行hadoop fsck /操作,会发现,每次执行的时候namenode都能发现更多的数据块,也就是说丢失的数据都找回来了。最后的一次执行结果就是没有块丢失了。
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Status: HEALTHY Total size: Total dirs: Total files: Total symlinks: 0 (Files currently being written: ) Total blocks (validated): Minimally replicated blocks: (99.99999 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.072726 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 8 Number of racks: 1FSCK ended at Tue Apr 19 11:41:58 CST 2016 in 13683 millisecondsThe filesystem under path '/' is HEALTHY总结
集群使用的是2备份,这样能极大的减少集群的使用空间,但是也有着数据丢失的风险,因此下一步的一个策略是使用一台大容量的服务器做冷备份,每天定时将重要数据备份到该机器上。
2016-04-19 12:17:00 hzct
作者:dantezhao | 简书 | CSDN | GITHUB 文章推荐:http://dantezhao.com/readme 个人主页:http://dantezhao.com 文章可以转载, 但必须以超链接形式标明文章原始出处和作者信息