Copying files with specific size to other directory - shell

Its a interview question. Interviewer asked this "basic" shell script question when he understand i don't have experience in shell scripting. Here is question.
Copy files from one directory which has size greater than 500 K to another directory.
I can do it immediately in c lang but seems difficult in shell script as never tried it.I am familiar with unix basic commands so i tried it, but i can just able to extract those file names using below command.
du -sk * | awk '{ if ($1>500) print $2 }'
Also,Let me know good shell script examples book.

It can be done in several ways. I'd try and use find:
find $FIRSTDIRECTORY -size +500k -exec cp "{\} $SECONDDIRECTORY \;
To limit to the current directory, use -maxdepth option.

du recurses into subdirectories, which is probably not desired (you could have asked for clarification if that point was ambiguous). More likely you were expected to use ls -l or ls -s to get the sizes.
But what you did works to select some files and print their names, so let's build on it. You have a command that outputs a list of names. You need to put the output of that command into the command line of a cp. If your du|awk outputs this:
Makefile
foo.c
bar.h
you want to run this:
cp Makefile foo.c bar.h otherdirectory
So how you do that is with COMMAND SUBSTITUTION which is written as $(...) like this:
cd firstdirectory
cp $(du -sk * | awk '{ if ($1>500) print $2 }') otherdirectory
And that's a functioning script. The du|awk command runs first, and its output is used to build the cp command. There are a lot of subtle drawbacks that would make it unsuitable for general use, but that's how beginner-level shell scripts usually are.

find . -mindepth 1 -maxdepth 1 -type f -size +BYTESc -exec cp -t DESTDIR {}\+
The c suffix on the size is essential; the size is in bytes. Otherwise, you get probably-unexpected rounding behaviour in determining the result of the -size check. If the copying is meant to be recursive, you will need to take care of creating any destination directory also.

Related

Copying a file into multiple directories in bash

I have a file I would like to copy into about 300,000 different directories, these are themselves split between two directories, e.g.
DirA/Dir001/
...
DirB/Dir149000/
However when I try:
cp file.txt */*
It returns:
bash: /bin/cp: Argument list too long
What is the best way of copying a file into multiple directories, when you have too many to use cp?
The answer to the question as asked is find.
find . -mindepth 2 -maxdepth 2 -type d -exec cp script.py {} \;
But of course #triplee is right... why make so many copies of a file?
You could, of course, instead create links to the file...
find . -mindepth 2 -maxdepth 2 -type d -exec ln script.py {} \;
The options -mindepth 2 -maxdepth 2 limit the recursive search of find to elements exactly two levels deep from the current directory (.). The -type d matches all directories. -exec then executes the command (up to the closing \;), for each element found, replacing the {} with the name of the element (the two-levels-deep subdirectory).
The links created are hard links. That means, you edit the script in one place, the script will look different in all places. The script is, for all intents and purposes, in all the places, with none of them being any less "real" than the others. (This concept can be surprising to those not used to it.) Use ln -s if you instead want to create "soft" links, which are mere references to "the one, true" script.py in the original location.
The beauty of find ... -exec ... {}, as opposed to many other ways to do it, is that it will work correctly even for filenames with "funny" characters in them, including but not limited to spaces or newlines.
But still, you should really only need one script. You should fix the part of your project where you need that script in every directory; that is the broken part...
Extrapolating from the answer to your other question you seem to have code which looks something like
for TGZ in $(find . -name "file.tar.gz")
do
mkdir -p work
cd work
tar xzf $TGZ
python script.py
cd ..
rm -rf work
done
Of course, the trivial fix is to replace
python script.py
with
python ../script.py
and voilá, you no longer need a copy of the script in each directory at all.
I woud further advice to refactor out the cd and changing script.py so you can pass it the directory to operate on as a command-line argument. (Briefly, import sys and examine the value of sys.argv[1] though you'll often want to have option parsing and support for multiple arguments; argparse from the Python standard library is slightly intimidating, but there are friendly third-party wrappers like click.)
As an aside, many beginners seem to think the location of your executable is going to be the working directory when it executes. This is obviously not the case; or /bin/ls woul only list files in /bin.
To get rid of the cd problem mentioned in a comment, a minimal fix is
for tgz in $(find . -name "file.tar.gz")
do
mkdir -p work
tar -C work -x -z -f "$tgz"
(cd work; python ../script.py)
rm -rf work
done
Again, if you can change the Python script so it doesn't need its input files in the current directory, this can be simplified further. Notice also the preference for lower case for your variables, and the use of quoting around variables which contain file names. The use of find in a command substitution is still slightly broken (it can't work for file names which contain whitespace or shell metacharacters) but maybe that's a topic for a separate question.

shell-script -cd in all subdirecories of a directory, execute command on their files

I am new to bash and i am trying to cd to all subdirectories of a parent directory and execute a command in all files these subdirecories contain.But it s not working.
for subdir in $parentdirectory
do
for file in $subdir
do
ngram - lm somefilename.lm - ppl file
done
done
There's many ways to do this, but one would require you to explicitly change to that directory. Assuming $parentdirectory is correctly initialized, then you could look into something like:
for subdir in ${parentdirectory}
do
cd ${subdir} # go into the subdir
for file in * # glob expansion
do
ngram - lm somefilename.lm - ppl ${file}
done
cd .. # go back up
done
Also have a look at the excellent Advanced Bash-Scripting Guide: http://tldp.org/LDP/abs/html/loops1.html
If you're wanting to do this with a small amount of space, you could do something using find -exec.
Such as:
# add a file called foo into every subdirectory
find . -type d -exec sh -c 'touch "$0/foo"' {} \;
Or, if you wanted to echo a string into each of those files you just created:
# find all files and append 'ABC' into them
find . -type f -exec sh -c 'echo "ABC" >> $0' {} \;
The find -exec combo is an extremely powerful tool that can save you on a bit of directory / file navigation, and allows you to achieve what it sounds like is the desired functionality without having to play descend/ascend through the directory structure.
Also, as you can probably guess, this kind of thing can go horribly wrong if you're not careful, so use with great caution.

mv Bash Shell Command (on Mac) overwriting files even with a -i?

I am flattening a directory of nested folders/picture files down to a single folder. I want to move all of the nested files up to the root level.
There are 3,381 files (no directories included in the count). I calculate this number using these two commands and subtracting the directory count (the second command):
find ./ | wc -l
find ./ -type d | wc -l
To flatten, I use this command:
find ./ -mindepth 2 -exec mv -i -v '{}' . \;
Problem is that when I get a count after running the flatten command, my count is off by 46. After going through the list of files before and after (I have a backup), I found that the mv command is overwriting files sometimes even though I'm using -i.
Here's details from the log for one of these files being overwritten...
.//Vacation/CIMG1075.JPG -> ./CIMG1075.JPG
..more log
..more log
..more log
.//dog pics/CIMG1075.JPG -> ./CIMG1075.JPG
So I can see that it is overwriting. I thought -i was supposed to stop this. I also tried a -n and got the same number. Note, I do have about 150 duplicate filenames. Was going to manually rename after I flattened everything I could.
Is it a timing issue?
Is there a way to resolve?
NOTE: it is prompting me that some of the files are overwrites. On those prompts I just press Enter so as not to overwrite. In the case above, there is no prompt. It just overwrites.
Apparently the manual entry clearly states:
The -n and -v options are non-standard and their use in scripts is not recommended.
In other words, you should mimic the -n option yourself. To do that, just check if the file exists and act accordingly. In a shell script where the file is supplied as the first argument, this could be done as follows:
[ -f "${1##*/}" ]
The file, as first argument, contains directories which can be stripped using ##*/. Now simply execute the mv using ||, since we want to execute when the file doesn't exist.
[ -f "${1##*/}" ] || mv "$1" .
Using this, you can edit your find command as follows:
find ./ -mindepth 2 -exec bash -c '[ -f "${0##*/}" ] || mv "$0" .' '{}' \;
Note that we now use $0 because of the bash -c usage. It's first argument, $0, can't be the script name because we have no script. This means the argument order is shifted with respect to a usual shell script.
Why not check if file exists, prior move? Then you can leave the file where it is or you can rename it or do something else...
Test -f or, [] should do the trick?
I am on tablet and can not easyly include the source.

Unix Shell scripting for copying files and creating directory

I have a source directory eg /my/source/directory/ and a destination directory eg /my/dest/directory/, which I want to mirror with some constraints.
I want to copy files which meet certain criteria of the find command, eg -ctime -2 (less than 2 days old) to the dest directory to mirror it
I want to include some of the prefix so I know where it came from, eg /source/directory
I'd like to do all this with absolute paths so it doesn't depend which directory I run from
I'd guess not having cd commands is good practice too.
I want the subdirectories created if they don't exist
So
/my/source/directory/1/foo.txt -> /my/dest/directory/source/directory/1/foo.txt
/my/source/directory/2/3/bar.txt -> /my/dest/directory/source/directory/2/3/bar.txt
I've hacked together the following command line but it seems a bit ugly, can anyone do better?
find /my/source/directory -ctime -2 -type f -printf "%P\n" | xargs -IFILE rsync -avR /my/./source/directory/FILE /my/dest/directory/
Please comment if you think I should add this command line as an answer myself, I didn't want to be greedy for reputation.
This is remarkably similar to a (closed) question: Bash scripting copying files without overwriting. The answer I gave cites the 'find | cpio' solution mentioned in other answers (minus the time criteria, but that's the difference between 'similar' and 'same'), and also outlines a solution using GNU 'tar'.
ctime
When I tested on Solaris, neither GNU tar nor (Solaris) cpio was able to preserve the ctime setting; indeed, I'm not sure that there is any way to do that. For example, the touch command can set the atime or the mtime or both - but not the ctime. The utime() system call also only takes the mtime or atime values; it does not handle ctime. So, I believe that if you find a solution that preserves ctime, that solution is likely to be platform-specific. (Weird example: hack the disk device and edit the data in the inode - not portable, requires elevated privileges.) Rereading the question, though, I see that 'preserving ctime' is not part of the requirements (phew); it is simply the criterion for whether the file is copied or not.
chdir
I think that the 'cd' operations are necessary - but they can be wholly localized to the script or command line, though, as illustrated in the question cited and the command lines below, the second of which assumes GNU tar.
(cd /my; find source/directory -ctime -2 | cpio -pvdm /my/dest/directory)
(cd /my; find source/directory -ctime -2 | tar -cf - -F - ) |
(cd /my/dest/directory; tar -xf -)
Without using chdir() (aka cd), you need specialized tools or options to handle the manipulation of the pathnames on the fly.
Names with blanks, newlines, etc
The GNU-specific 'find -print0' and 'xargs -0' are very powerful and effective, as noted by Adam Hawes. Funnily enough, GNU cpio has an option to handle the output from 'find -print0', and that is '--null' or its short form '-0'. So, using GNU find and GNU cpio, the safe command is:
(cd /my; find source/directory -ctime -2 -print0 |
cpio -pvdm0 /my/dest/directory)
Note:This does not overwrite pre-existing files under the backup directory. Add -u to the cpio command for that.
Similarly, GNU tar supports --null (apparently with no -0 short-form), and could also be used:
(cd /my; find source/directory -ctime -2 -print0 | tar -cf - -F - --null ) |
(cd /my/dest/directory; tar -xf -)
The GNU handling of file names with the null terminator is extremely clever and a valuable innovation (though I only became aware of it fairly recently, courtesy of SO; it has been in GNU tar for at least a decade).
You could try cpio using the copy-pass mode, -p. I usually use it with overwrite all (-u), create directories (-d), and maintain modification time (-m).
find myfiles | cpio -pmud target-dir
Keep in mind that find should produce relative path names, which doesn't fit your absolute path criteria. This cold be of course be 'solved' using cd, which you also don't like (why not?)
(cd mypath; find myfiles | cpio ... )
The brackets will spawn a subshell, and will keep the state-change (i.e. the directory switch) local. You could also define a small procedure to abstract away the 'uglyness'.
IF you're using find always use -print0 and pipe the output through xargs -0; well almost always. The first file with a space in its name will bork the script if you use the default newline terminator output of find.
I agree with all the other posters - use cpio or tar if you can. It'll do what you want and save the hassle.
An alternative is to use tar,
(cd $SOURCE; tar cf - .) | (cd $DESTINATION; tar xf -)
EDIT:
Ah, I missed the bit about preserving CTIME. I believe most implementations of tar will preserve mtime, but if preserving ctime is critical, then cpio is indeed the only way.
Also, some tar implementations (GNU tar being one) can select the files to include based on atime and mtime, though seemingly not ctime.
#!/bin/sh
SRC=/my/source/directory
DST=/my/dest/directory
for i in $(find $SRC -ctime -2 -type f) ; do
SUBDST=$DST$(dirname $i)
mkdir -p $SUBDST
cp -p $i $SUBDST
done
And I suppose, since you want to include "where it came from", that you are going to use different source directories. This script can be modified to take source dir as an argument simply by replacing SRC=/my/source/directory, with SRC=$1
EDIT: Removed redundant if statement.
Does not work when filenames have whitespaces.
!/usr/bin/sh
script to copy files with same directory structure"
echo "Please enter Full Path of Source DIR (Starting with / and ending with /):"
read spath
echo " Please enter Full Path of Destination location (Starting with / and ending with /):"
read dpath
si=echo "$spath" | awk -F/ '{print NF-1}'
for fname in find $spath -type f -print
do
cdir=echo $fname | awk -F/ '{ for (i='$si'; i<NF; i++) printf "%s/", $i; printf "\n"; }'
if [ $cdir ]; then
if [ ! -d "$dpath$cdir" ]; then
mkdir -p $dpath$cdir
fi
fi
cp $fname $dpath$cdir
done

Quick ls command

I've got to get a directory listing that contains about 2 million files, but when I do an ls command on it nothing comes back. I've waited 3 hours. I've tried ls | tee directory.txt, but that seems to hang forever.
I assume the server is doing a lot of inode sorting. Is there any way to speed up the ls command to just get a directory listing of filenames? I don't care about size, dates, permission or the like at this time.
ls -U
will do the ls without sorting.
Another source of slowness is --color. On some linux machines, there is a convenience alias which adds --color=auto' to the ls call, making it look up file attributes for each file found (slow), to color the display. This can be avoided by ls -U --color=never or \ls -U.
I have a directory with 4 million files in it and the only way I got ls to spit out files immediately without a lot of churning first was
ls -1U
Try using:
find . -type f -maxdepth 1
This will only list the files in the directory, leave out the -type f argument if you want to list files and directories.
This question seems to be interesting and I was going through multiple answers that were posted. To understand the efficiency of the answers posted, I have executed them on 2 million files and found the results as below.
$ time tar cvf /dev/null . &> /tmp/file-count
real 37m16.553s
user 0m11.525s
sys 0m41.291s
------------------------------------------------------
$ time echo ./* &> /tmp/file-count
real 0m50.808s
user 0m49.291s
sys 0m1.404s
------------------------------------------------------
$ time ls &> /tmp/file-count
real 0m42.167s
user 0m40.323s
sys 0m1.648s
------------------------------------------------------
$ time find . &> /tmp/file-count
real 0m2.738s
user 0m1.044s
sys 0m1.684s
------------------------------------------------------
$ time ls -U &> /tmp/file-count
real 0m2.494s
user 0m0.848s
sys 0m1.452s
------------------------------------------------------
$ time ls -f &> /tmp/file-count
real 0m2.313s
user 0m0.856s
sys 0m1.448s
------------------------------------------------------
To summarize the results
ls -f command ran a bit faster than ls -U. Disabling color might have caused this improvement.
find command ran third with an average speed of 2.738 seconds.
Running just ls took 42.16 seconds. Here in my system ls is an alias for ls --color=auto
Using shell expansion feature with echo ./* ran for 50.80 seconds.
And the tar based solution took about 37 miuntes.
All tests were done seperately when system was in idle condition.
One important thing to note here is that the file lists are not printed in the terminal rather
they were redirected to a file and the file count was calculated later with wc command.
Commands ran too slow if the outputs where printed on the screen.
Any ideas why this happens ?
This would be the fastest option AFAIK: ls -1 -f.
-1 (No columns)
-f (No sorting)
Using
ls -1 -f
is about 10 times faster and it is easy to do (I tested with 1 million files, but my original problem had 6 800 000 000 files)
But in my case I needed to check if some specific directory contains more than 10 000 files. If there were more than 10 000 files, I am not anymore interested that how many files there is. I just quit the program so that it will run faster and wont try to read the rest one-by-one. If there are less than 10 000, I will print the exact amount. Speed of my program is quite similar to ls -1 -f if you specify bigger value for parameter than amount of files.
You can use my program find_if_more.pl in current directory by typing:
find_if_more.pl 999999999
If you are just interested if there are more than n files, script will finish faster than ls -1 -f with very large amount of files.
#!/usr/bin/perl
use warnings;
my ($maxcount) = #ARGV;
my $dir = '.';
$filecount = 0;
if (not defined $maxcount) {
die "Need maxcount\n";
}
opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
$filecount = $filecount + 1;
last if $filecount> $maxcount
}
print $filecount;
closedir(DIR);
exit 0;
You can redirect output and run the ls process in the background.
ls > myls.txt &
This would allow you to go on about your business while its running. It wouldn't lock up your shell.
Not sure about what options are for running ls and getting less data back. You could always run man ls to check.
This is probably not a helpful answer, but if you don't have find you may be able to make do with tar
$ tar cvf /dev/null .
I am told by people older than me that, "back in the day", single-user and recovery environments were a lot more limited than they are nowadays. That's where this trick comes from.
I'm assuming you are using GNU ls?
try
\ls
It will unalias the usual ls (ls --color=auto).
If a process "doesn't come back", I recommend strace to analyze how a process is interacting with the operating system.
In case of ls:
$strace ls
you would have seen that it reads all directory entries (getdents(2)) before it actually outputs anything. (sorting… as it was already mentioned here)
Things to try:
Check ls isn't aliased?
alias ls
Perhaps try find instead?
find . \( -type d -name . -prune \) -o \( -type f -print \)
Hope this helps.
Some followup:
You don't mention what OS you're running on, which would help indicate which version of ls you're using. This probably isn't a 'bash' question as much as an ls question. My guess is that you're using GNU ls, which has some features that are useful in some contexts, but kill you on big directories.
GNU ls Trying to have prettier arranging of columns. GNU ls tries to do a smart arrange of all the filenames. In a huge directory, this will take some time, and memory.
To 'fix' this, you can try:
ls -1 # no columns at all
find BSD ls someplace, http://www.freebsd.org/cgi/cvsweb.cgi/src/bin/ls/ and use that on your big directories.
Use other tools, such as find
There are several ways to get a list of files:
Use this command to get a list without sorting:
ls -U
or send the list of files to a file by using:
ls /Folder/path > ~/Desktop/List.txt
What partition type are you using?
Having millions of small files in one directory it might be a good idea to use JFS or ReiserFS which have better performance with many small sized files.
How about find ./ -type f (which will find all files in the currently directory)? Take off the -type f to find everything.
You should provide information about what operating system and the type of filesystem you are using. On certain flavours of UNIX and certain filesystems you might be able to use the commands ff and ncheck as alternatives.
I had a directory with timestamps in the file names. I wanted to check the date of the latest file and found find . -type f -maxdepth 1 | sort | tail -n 1 to be about twice as fast as ls -alh.
Lots of other good solutions here, but in the interest of completeness:
echo *
You can also make use of xargs. Just pipe the output of ls through xargs.
ls | xargs
If that doesn't work and the find examples above aren't working, try piping them to xargs as it can help the memory usage that might be causing your problems.

Resources