Discussion:
[ale] interesting problem
Jim Kinney via Ale
2018-01-11 20:04:33 UTC
Permalink
Imagine a giant collection of files, several TB, of unknown directory
names and unknown directory depths at any point. From the top of that
tree, you need to cd into EVERY directory, find the symlinks in each
directory and remake them in a parallel tree on the same system but in
a different starting point. Rsync is not happy with the relative links
so that fails as each link looks to be relative to the location of the
process running rsync.

It is possible given the source of this data tree that recursive,
looping symlinks exist. That must be recreated in the new location.

It looks like a find to list all symlinks in the entire tree then cd to
each final location to recreate is best. That can be sped up with
running multiple processes splitting the link list into sections.

Better ideas?
--
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/
Putnam, James M. via Ale
2018-01-11 20:23:09 UTC
Permalink
Tar (with some combination of switches) may be able to do all this for you. A
quick test would tell.

Upping the block size to some multiple of the native file system block size may
let the OS DMA directly/from to user space (at least it did in SunOS/Solaris/BSD*,
not sure if Linux does that these days) which would kill some of the tar overhead.

--
James M. Putnam
Visiting Professor of Computer Science

The air was soft, the stars so fine,
the promise of every cobbled alley so great,
that I thought I was in a dream.
________________________________________
From: Ale [ale-***@ale.org] on behalf of Jim Kinney via Ale [***@ale.org]
Sent: Thursday, January 11, 2018 3:04 PM
To: Atlanta User Group (E-mail)
Subject: [ale] interesting problem

Imagine a giant collection of files, several TB, of unknown directory names and unknown directory depths at any point. From the top of that tree, you need to cd into EVERY directory, find the symlinks in each directory and remake them in a parallel tree on the same system but in a different starting point. Rsync is not happy with the relative links so that fails as each link looks to be relative to the location of the process running rsync.

It is possible given the source of this data tree that recursive, looping symlinks exist. That must be recreated in the new location.

It looks like a find to list all symlinks in the entire tree then cd to each final location to recreate is best. That can be sped up with running multiple processes splitting the link list into sections.

Better ideas?

--

James P. Kinney III Every time you stop a school, you will have to build a jail. What you gain at one end you lose at the other. It's like feeding a dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark Twain http://heretothereideas.blogspot.com/

_______________________________________________
Ale mailing list
***@ale.org
http://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
Jim Kinney via Ale
2018-01-11 20:57:56 UTC
Permalink
James and Ed,
One thing I've found is a HUGE number of bad symlinks - most are simply
pointing back to a non-existent source file (user deletion likely)
while a second batch has a source file tree that appears to have been
moved after the links were made. Additionally, links were made
(attempted) from <old dir>/foo* to ../<new dir>/foo* (literal * in the
names!) in what looks like an attempt to mass link to a collection of
similar named files in folders. Yeah. _that_ worked :-) NOT!
<sigh>
find is being quite helpful. I didn't think about the old tar | tar
process. Tar may be a better way to move the actual data and the links
(that are real) as it's blissfully ignorant and low level enough.
Thanks!
Some further background: The destination is a gluster storage cluster
mounted on a machine that also has 4 4TB drives attached (RAID5
backup) as the source. The glusterfs reports a zillion or so issues and
all see to involve symlinks. The rsync from backup raid to new storage
space reported a zillion issues with symlinks and IO errors. Of course
sysadmin panic sets in with IO errors > 0 and especially > 7000!
As soon as I get the error numbers to 0, I can reconfig to support a
third machine with the checksum bricks and add more storage overall.
This is the last of the repairs from the RAID6 3-drive crash that
trashed the one node (all 100+TB) last August.
Post by Putnam, James M. via Ale
Tar (with some combination of switches) may be able to do all this for you. A
quick test would tell.
Upping the block size to some multiple of the native file system block size may
let the OS DMA directly/from to user space (at least it did in SunOS/Solaris/BSD*,
not sure if Linux does that these days) which would kill some of the tar overhead.
--
James M. Putnam
Visiting Professor of Computer Science
The air was soft, the stars so fine,
the promise of every cobbled alley so great,
that I thought I was in a dream.
________________________________________
ale.org]
Sent: Thursday, January 11, 2018 3:04 PM
To: Atlanta User Group (E-mail)
Subject: [ale] interesting problem
Imagine a giant collection of files, several TB, of unknown directory
names and unknown directory depths at any point. From the top of that
tree, you need to cd into EVERY directory, find the symlinks in each
directory and remake them in a parallel tree on the same system but
in a different starting point. Rsync is not happy with the relative
links so that fails as each link looks to be relative to the location
of the process running rsync.
It is possible given the source of this data tree that recursive,
looping symlinks exist. That must be recreated in the new location.
It looks like a find to list all symlinks in the entire tree then cd
to each final location to recreate is best. That can be sped up with
running multiple processes splitting the link list into sections.
Better ideas?
--
James P. Kinney III Every time you stop a school, you will have to
build a jail. What you gain at one end you lose at the other. It's
like feeding a dog on his own tail. It won't fatten the dog. - Speech
11/23/1900 Mark Twain http://heretothereideas.blogspot.com/
--
James P. Kinney III

Every time you stop a school, you will have to build a jail. What you
gain at one end you lose at the other. It's like feeding a dog on his
own tail. It won't fatten the dog.
- Speech 11/23/1900 Mark Twain

http://heretothereideas.blogspot.com/
Ed Cashin via Ale
2018-01-11 20:23:30 UTC
Permalink
Can you confirm that this doesn't work for some reason? (What's the
reason?)

(cd $SOURCE_DIR && tar cf -) | (cd $DEST_DIR && tar xf -)

Also, cpio is surprisingly useful in situations like this, because you can
use the find command to feed it the names of the things you want to
transfer.

Also also, I cannot help but mention that if the stuff being transferred
has tons of huge sparse files, BSD tar is crucial. Contrary to docs, rsync
doesn't handle sparse files the way you'd hope. (Not the versions I tried
last year, anyway.)
Post by Jim Kinney via Ale
Imagine a giant collection of files, several TB, of unknown directory
names and unknown directory depths at any point. From the top of that tree,
you need to cd into EVERY directory, find the symlinks in each directory
and remake them in a parallel tree on the same system but in a different
starting point. Rsync is not happy with the relative links so that fails as
each link looks to be relative to the location of the process running rsync.
It is possible given the source of this data tree that recursive, looping
symlinks exist. That must be recreated in the new location.
It looks like a find to list all symlinks in the entire tree then cd to
each final location to recreate is best. That can be sped up with running
multiple processes splitting the link list into sections.
Better ideas?
--
James P. Kinney III Every time you stop a school, you will have to build a
jail. What you gain at one end you lose at the other. It's like feeding a
dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark
Twain http://heretothereideas.blogspot.com/
_______________________________________________
Ale mailing list
http://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
--
Ed Cashin <***@noserose.net>
Steve Litt via Ale
2018-01-12 05:21:49 UTC
Permalink
On Thu, 11 Jan 2018 15:04:33 -0500
Post by Jim Kinney via Ale
Imagine a giant collection of files, several TB, of unknown directory
names and unknown directory depths at any point. From the top of that
tree, you need to cd into EVERY directory, find the symlinks in each
directory and remake them in a parallel tree on the same system but in
a different starting point. Rsync is not happy with the relative links
so that fails as each link looks to be relative to the location of the
process running rsync.
I can't exactly visualize what you want to do with the discovered
symlinks, but my first thought would be to write a treewalker program.
It's pretty easy to write a C tree-walker program that performs a
specific action upon encountering a symlink. Because your wrote it in
C, it will be as fast as its algorithm. Depth-first tree walker
algorithms are pretty darn fast.

HTH,

SteveT

Steve Litt
January 2018 featured book: Troubleshooting: Why Bother?
http://www.troubleshooters.com/twb
_______________________________________________
Ale mailing list
***@ale.org
http://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
Jim Kinney via Ale
2018-01-12 12:25:06 UTC
Permalink
It's actually looking like a far simpler problem than originally surmised.

Originally I thought the bad symlinks were due to a relative path being outside of the rsync scope of operation. It's looking more like the original source just has tens of thousands of dangling symlinks that were created by mistakes (mostly students) and resolved/cleaned before the backup occured.

Find is working. So is the symlinks command.
Post by Steve Litt via Ale
On Thu, 11 Jan 2018 15:04:33 -0500
Post by Jim Kinney via Ale
Imagine a giant collection of files, several TB, of unknown directory
names and unknown directory depths at any point. From the top of that
tree, you need to cd into EVERY directory, find the symlinks in each
directory and remake them in a parallel tree on the same system but
in
Post by Jim Kinney via Ale
a different starting point. Rsync is not happy with the relative
links
Post by Jim Kinney via Ale
so that fails as each link looks to be relative to the location of
the
Post by Jim Kinney via Ale
process running rsync.
I can't exactly visualize what you want to do with the discovered
symlinks, but my first thought would be to write a treewalker program.
It's pretty easy to write a C tree-walker program that performs a
specific action upon encountering a symlink. Because your wrote it in
C, it will be as fast as its algorithm. Depth-first tree walker
algorithms are pretty darn fast.
HTH,
SteveT
Steve Litt
January 2018 featured book: Troubleshooting: Why Bother?
http://www.troubleshooters.com/twb
--
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.
Loading...