DSM-G600, DNS-3xx and NSA-220 Hack Forum

pnin · 2010-09-28 02:00:11

While daydreaming about the functionalities I've yet to be able to setup in my DNS-323, I came across this:

http://www.opendedup.org/

(CW article here: www.computerworld.com.au/article/340870/open_source_deduplication_software_released_linux/)

Now, I understand this to be some heavy weight feature for cloud servers, but I couldn't help wishing it to be made possible in my humble box.

A question to the gurus here: could a script be cooked to run a service that would crawl NAS storage for bit-wise duplicates, deleting ones at deeper levels and substituting them for symlinks to the older/uppermost leveled file? Would this be reasonable?

TiA,

Last edited by pnin (2010-09-28 02:01:28)

FunFiler · 2010-09-28 02:28:59

A shell script could do it, but it would be pretty easy to come up with a perl (or more advanced) script that could do that. Just run it on a schedule.

chriso · 2010-09-29 11:19:38

Your goals and the goals given the article are quite different. Your goals are easier to achieve.

I'm not sure why the algorithm would be to connect to older/uppermost. There is nothing that says the uppermost is anything different from the bottommost ones or one on the same level for that matter. Much easier to start a search building up a checksums for files, and on the matches do the linking. So you are just linking to the first file with the same checksum (of course filtering by size first so that you don't have to read every file every time). I might worry about the low amount of RAM on the DNS-323 since the best way to do this would be in memory in a Perl hash, and with a large amount of files that might exceed what the DNS-323 can do, and any kind of pushing it out to a file might be slow/complicated.

Even though I some times duplicate pictures and put them in different directories to categorize them (don't trust Picasa's database) I don't know that I would ever have enough duplicate files to even justify writing the Perl script.

FunFiler · 2010-09-29 11:55:51

Also, not all applications support the use of symbolic links so it would have to be implemented with caution.

chriso · 2010-09-30 21:07:04

Except for not being easy to see what is connected, hard links would be better anyway, and no application would know. They can't go across file systems, but this kind of operation probably shouldn't be across file systems anyways.
Of course you really would want to do this only for "data" disks so I'm not sure if would really matter to most applications if they are soft links. Of course all of this is on the DNS-323 side if you are using SMB to export, since to the OS that mounts that drive there are no links, it just looks like a copy of the file to it. If NFS is used for the export I guess the mounting OS would see the links.

I'm trying remember if an rsync to another drive will maintain those hard links properly. I think it does, you wouldn't want it make copies of the file on the backup drive.

pnin · 2010-10-01 02:53:57

Wow! Thanks for all the input, still digesting it. B)

Chriso, that explanation pretty much sums up what I've been doing by hand and Windows hasn't complained once.

chriso · 2010-10-01 07:24:41

pnin wrote:
Wow! Thanks for all the input, still digesting it. B)

Chriso, that explanation pretty much sums up what I've been doing by hand and Windows hasn't complained once.

The warnings about links are about applications running on the DNS-323 not on Windows.

Windows networks (using SMB protocol) know nothing about links, so as far as Windows is concerned the links on the DNS-323 are just files (that is how SMB shows them to Windows), the magic to trace the link back and give the real file data is all done on the DNS-323 side.

When you do this manually how do you determine, which files to link?

jdoering · 2010-10-01 07:38:08

If you use hardlinks then I don't think level in the hierarchy matters at all.

Would one of these do the trick? If so then no need for new scripts, etc. These tools should already handle efficient detection by size/hash with byte-by-byte final verification.

http://linux.die.net/man/1/fdupes
http://linux.die.net/man/1/hardlink

I just came across hardlink; but have been using fdupes with the -L option to replace all duplicates with hard links (that option is missing from the outdates man page above but is included in fdupes 1.50-PR2 that Debian packages).

I'm using it on Debian squeeze but I imagine someone could compile it for FFP pretty easily or maybe it's availabe in optware already.

-Jeff

chriso · 2010-10-01 10:21:12

I just played with it (in Perl) and have decided it is too dangerous for my tastes.

The reason I call it dangerous is because of something I thought about when I ran my Perl program.
If you copy A to B, run the program, then either A or B is going to be linked to the other.
Now you decide to change B, both A and B will be changed. So heavens forbid if you like to copy things and then change them later, and intended A to be a backup copy. I noticed this because I had copied some program code from one directory to another and then changed some of the files. Well for the files I didn't change before running this program, if I changed them in the future I would be in a real surprise if I ever wanted the original copy back, because I will have been editing the original and the "copy" at the same time.

You have to be real sure of what files you are going to run something like this over.

Also a few more things. Symbolic links are not going to cut it, because of what happens on a delete.
Link X to real file Y
From Windows delete Y. On Windows side both X and Y disappear. On the DNS-323 side X exists and points to nothing. They both disappear on Windows side because if SMB can't trace the real file it doesn't make X appear on the Windows side. If instead of deleting Y you delete X from Windows then only X disappears from both Windows and the DNS-323.

This is certainly not what you want. You want hard links for sure.
With a hard link if you delete X or Y, the other will remain.

On the subject of performance and capabilities (and don't think that a "regular (C)" program is going to be more efficient, I'm and expert in both C and Perl and I can tell you in this application Perl is going to be equal or more efficient to the C program, both in memory and the speed).

The test directories have about 86,000 files and 14.4 GB in size.
The DNS-323 processed these in about 20 minutes. The speed is mostly determined by the total size of the files, but also a bit by how many since opening a file is slower operation then say reading X blocks of data.

The program consumed about 25% of the memory on the DNS-323. I have found that the DNS-323 does not do swap well if its memory is maxed out. You probably need to stay under 90%. The amount of memory consumed is the hash (md5 16 bytes) and the file name and size, times the number of unique files (duplicates you process, but do not store, you ignore soft links, and directories). So a rough guess for the max files (depends on length of path) is 300,000 files, unless a file is used to store the hash data instead of keeping it in memory (if a file is used to store the data processing will be MUCH slower).

Well that is my take for what it is worth.

Last edited by chriso (2010-10-01 10:27:26)

jdoering · 2010-10-01 20:03:31

Good point on the fact that hardlinks result in mirrored updates by definition. This is definitely best for read-only content. IMO best usage is to do this on directory trees that you've forced to Unix read-only permissions. That way you can't accidently update multiple copies of the file. If you do need to update a particular duplicate file; you'd copy it and reset the permissions (thus breaking the hardlink) rather than just changing the permissions directly. Works fine as long as updates are atypical. This is definitely a scenario where a filesystem with copy-on-write semantics would be a lot more powerful than hardlink based deduplication. But for my simple case it is sufficient for eliminating redudancy is read-only backup files.

On the memory, I'd expect the optimal consumption to be a fixed function of the number of files as inode number would suffice instead of file name for the matching table wouldn't it? But based on a quick peek at fdupes code I don't think it's optimized like that (presumably it was coded with memory-rich PCs in mind).

-Jeff

chriso · 2010-10-01 23:57:06

The inode is not sufficient, when you go to do the linking you can't do a link from inode to new file name. You have to link from file name to file name. So you have to save the file path of the "principal copy". I did keep track of the inode too (didn't mention it above, but it is small in comparison to the file name). You really want it for optimizing. For instance if you run across X with inode 7777, and save its path(and read for its MD5 hash) just in case you have to link to it, and then you encounter Y with a inode of 7777. You know that you don't have to even read the file for its MD5 hash because you know it is just another link of the same file.

On the performance front, the speed I mentioned would only be for the first time through the data.
You would could store the date/time you when through the data and only work on modified files. But you still need a copy of all the "principal copy" files data (MD5 hash, file path, ...) (and you should store this in the data file too) in memory just like on the first pass.

So the performance would be MUCH faster on passes through the data after the first pass. Memory requirements are about the same though.

I should mention also the original poster was using this for Windows side data, so I don't think he would want "read-only" for the directories in question.

Last edited by chriso (2010-10-01 23:59:56)

pnin · 2010-11-05 13:22:29

Thanks to everyone, and especially to jdoering and chriso, for such big contributions to my understanding of the DNS-323 filesystem and filing management in general.

Right now, I'm so much more aware of the implications of what I'm doing while playing with both the Windows and the DNS CLI side of file management.

Still have no one stop solution, but thank you very much for the enlightenment... (^_^)

DSM-G600, DNS-3xx and NSA-220 Hack Forum

Announcement

#1 2010-09-28 02:00:11

Suggestion: SDFS

#2 2010-09-28 02:28:59

Re: Suggestion: SDFS

#3 2010-09-29 11:19:38

Re: Suggestion: SDFS

#4 2010-09-29 11:55:51

Re: Suggestion: SDFS

#5 2010-09-30 21:07:04

Re: Suggestion: SDFS

#6 2010-10-01 02:53:57

Re: Suggestion: SDFS

#7 2010-10-01 07:24:41

Re: Suggestion: SDFS

pnin wrote:

#8 2010-10-01 07:38:08

Re: Suggestion: SDFS

#9 2010-10-01 10:21:12

Re: Suggestion: SDFS

#10 2010-10-01 20:03:31

Re: Suggestion: SDFS

#11 2010-10-01 23:57:06

Re: Suggestion: SDFS

#12 2010-11-05 13:22:29

Re: Suggestion: SDFS

Board footer