DSM-G600, DNS-3xx and NSA-220 Hack Forum

dayre · 2007-04-13 05:40:33

Hi all,

I'm running 1.03 with 2 500GB drives in RAID 1 configuration. I've been using 1.03 for the last week without any issues until today when i received this:

Code:

Hello Administrator,

A Hard Drive Has Failed

Sincerely,
Your DNS-323

I haven't powered down the box as i'm not sure if this is the best thing to do... thought i would consult the forum before i tried anything myself. Here's the situation.

- both drive lights are lit (green) on the box, which i assume means they are both "up".. no amber lights
- status in the web admin monitor displays:

Code:

Volume Name:                     Volume_1
Volume Type:                      RAID 1
Sync Time Remaining:          Completed
Total Hard Drive Capacity:    490402 MB
Used Space:                       368830 MB
Unused Space:                     121572 MB

- cat'ing /proc/mstat shows

Code:

Personalities : [linear] [raid0] [raid1] 
md0 : active raid1 sdb2[1] sda2[0]
      486544512 blocks [2/2] [UU]
      
unused devices: <none>

- trying to mount a samba share from OSX worked before the e-mail notification, but now i get: "The Finder cannot complete the operation because some data in "smb://pigpen" could not be read or written. (Error code -36)"

My instinct is to reboot the puppy and see what happens... but i'm not sure if there some other diagnostic tools i should run from telnet to give me any more information. For example, if a drive has failed... how can i tell which one ? The lights on the front are both lit... so that doesn't tell me anything.

I just finished backing up my life on this little thing and am freaking out a little. If anyone needs any more info in helping me diagnose this problem, let me know an i'll try to supply it asap.

Any help/information/pointers would be greatly greatly appreciated.... thanx

Last edited by dayre (2007-04-13 05:50:30)

dayre · 2007-04-13 06:20:52

Some more information... i ran the mdadm --detail command and it's showing:

Code:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Thu Apr 12 08:29:48 2007
     Raid Level : raid1
     Array Size : 486544512 (464.01 GiB 498.22 GB)
    Device Size : 486544512 (464.01 GiB 498.22 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Apr 12 20:21:38 2007
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 16470500:1f7c0db4:bfe79507:4a9ee7ce
         Events : 0.4963

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

Looks like everything is ok ? Should i perhaps just try a reboot and see if it comes up ? Has anyone else gotten a false drive failure notification ?

Last edited by dayre (2007-04-13 06:21:43)

zero · 2007-04-13 07:46:37

this all looks like an intact array. i would trust these tools more that some d-link monitoring/email interface.

dayre · 2007-04-13 07:57:24

Thanks for the quick reply zero...

I went ahead and did the a reboot and everything looks good, the mdadm --detail shows the same as it did above... that everything looks good.

That was just weird. I'm not sure what happened there. The only non standard apps i'm running is the mt-daapd, ctorrent via a cron and utelnetd. The torrent client was running at the time... perhaps the raid got "overloaded" ? Whatever that might mean...

Whatever it was things look ok. PHEW !

dayre · 2007-04-14 19:57:04

Ok... weird, it happened again this morning. I got the same "A Hard Drive has failed" e-mail again at 2:30am this morning. I ran the mdadm --detail again and the raid reports "clean". Samba is screwed up again as i can't connect to the box without the "The Finder cannot complete the operation because some data in "smb://pigpen" could not be read or written. (Error code -36)" message popping up. Samba is ok after a reboot of the dns-323.

The only thing the box was activley running was a ctorrent cron job. I am using the ctorrent script someone else had posted which dumps status lines to a log file every minute and had a look at the log. It showed something very interesting at 2:30am.

Code:

[Sat Apr 14 02:30:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 02:31:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:27:30 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:28:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:29:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:30:01 GMT 2007] tqueue: ctorrent already running

At 2:30am the time was corrected on the box by an hour... which i think is done by the "rtc" cron job:

Code:

32 2 * * * /usr/sbin/rtc -s
30 2 2 * * /usr/sbin/rtc -c
59 1 * * * /usr/sbin/daylight&
*/1 * * * * /mnt/HD_a2/lnx_bin/run_torrent

I'm guessing a major shift in time might screw up the raid monitoring tool ? Or perhaps caused a major file system error from the ctorrent app which triggered a false hard drive failure ?

Any ideas on what's going on here and what, if anything, i should do to fix it ? Should i have a look at the "clock drift" wiki page and apply that ?

Any suggestions would be greatly appreciated !

Last edited by dayre (2007-04-14 19:59:20)

mig · 2007-04-14 21:50:05

dayre wrote:

...It showed something very interesting at 2:30am.

Code:

[Sat Apr 14 02:30:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 02:31:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:27:30 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:28:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:29:01 GMT 2007] tqueue: ctorrent already running
[Sat Apr 14 01:30:01 GMT 2007] tqueue: ctorrent already running

At 2:30am the time was corrected on the box by an hour... which i think is done by the "rtc" cron job:

Code:

32 2 * * * /usr/sbin/rtc -s
30 2 2 * * /usr/sbin/rtc -c
59 1 * * * /usr/sbin/daylight&
*/1 * * * * /mnt/HD_a2/lnx_bin/run_torrent

Looking at the log, I believe the problem is occurring at 2:32am, so the suspect cron job is

32 2 * * * /usr/sbin/rtc -s

from the rtc usage command help (rtc /?)

Code:

 rtc - query and set the hardware clock (RTC)

Usage: rtc [function]

Functions:
  -h    show this help
  -r    read RTC time and print it
  -w    read time from system and write to rtc(SYS -> RTC)
  -s    read time from rtc and write to sys(RTC -> SYS)
  -c    increase RTC 227 sec. each month
  -d    set rtc and system time to default time 2005/01/01 00:00:00

this shows that "rtc -s" resets the system time to whatever time the rtc is using,
so for your case, at 2:32am, your system time is being reset to 1:27am (the time the rtc is using)

try this command # date ; rtc -r
this will show you the time of your systetm clock and hardware clock

your system clock is (most likely) drifting fast, and the hardware (rtc) clock is drifting slow. The longer your DNS-323 is running the further these clock drift apart. The clock_drift section of the wiki will help minimize this problem.

//Mig

DSM-G600, DNS-3xx and NSA-220 Hack Forum

Announcement

#1 2007-04-13 05:40:33

Advice/help on disk failure with RAID 1

Code:

Code:

Code:

#2 2007-04-13 06:20:52

Re: Advice/help on disk failure with RAID 1

Code:

#3 2007-04-13 07:46:37

Re: Advice/help on disk failure with RAID 1

#4 2007-04-13 07:57:24

Re: Advice/help on disk failure with RAID 1

#5 2007-04-14 19:57:04

Re: Advice/help on disk failure with RAID 1

Code:

Code:

#6 2007-04-14 21:50:05

Re: Advice/help on disk failure with RAID 1

dayre wrote:

Code:

Code:

Code:

Board footer