using pipes with a find command - bash

I have a series of delimited files, some of which have some bad data and can be recognized by doing a column count on them. I can find them with the following command:
find ./ -name 201201*gz -mtime 12
They are all gzipped and I do not want to un-archive them all. So to check the column counts I've been doing I'm running this as a second command on each file:
zcat ./path/to/file.data | awk '{print NF}' | head
I know I can run a command on each file through find with -exec, but how can I also get it to run through the pipes? A couple things I tried, neither of which I expected to work and neither of which did:
find ./ -name 201201*gz -mtime 12 -print -exec zcat {} \; | awk '{print NF}'| head
find ./ -name 201201*gz -mtime 12 -print -exec "zcat {} | awk '{print NF}'| head" \;

I'd use a explicit loop aproach:
find . -name 201201*gz -mtime 12 | while read file; do
echo "$file: "
zcat "$file" | awk '{print NF}' | head
done

More or less you pipe things through find like:
find . -name "foo" -print0 | xargs -0 echo
So your command would look like:
find ./ -name "201201*gz" -mtime 12 -print0 | xargs -0 zcat | awk '{print NF}'| head
-print0 and xargs -0 just helps to make sure files with special characters dont break the pipe.

Related

Bash command: head

I am trying to find all files with dummy* in the folder named dummy. Then I need to sort them according to time of creation and get the 1st 10 files. The command I am trying is:
find -L /home/myname/dummy/dummy* -maxdepth 0 -type f -printf '%T# %p\n' | sort -n | cut -d' ' -f 2- | head -n 10 -exec readlink -f {} \;
But this doesn't seem to work with the following error:
head: invalid option -- 'e'
Try 'head --help' for more information.
How do I make the bash to not read -exec as part of head command?
UPDATE1:
Tried the following:
find -L /home/myname/dummy/dummy* -maxdepth 0 -type f -exec readlink -f {} \; -printf '%T# %p\n' | sort -n | cut -d' ' -f 2- | head -n 10
But this is not according to timestamp sort because both find and printf are printing the files and sort is sorting them all together.
Files in dummy are as follows:
dummy1, dummy2, dummy3 etc. This is the order in which they are created.
How do I make the bash to not read -exec as part of head command?
The -exec and subsequent arguments appear intended to be directed to find. The find command stops at the first |, so you would need to move those arguments ahead of that:
find -L /home/myname/dummy/dummy* -maxdepth 0 -type f -printf '%T# %p\n' -exec readlink -f {} \; | sort -n | cut -d' ' -f 2- | head -n 10
However, it doesn't make much sense to both -printf file details and -exec readlink the results. Possibly you wanted to run readlink on each filename that makes it past head. In that case, you might want to look into the xargs command, which serves exactly the purpose of converting data read from the standard input into arguments to a command. For example:
find -L /home/myname/dummy/dummy* -maxdepth 0 -type f -printf '%T# %p\n' |
sort -n |
cut -d' ' -f 2- |
head -n 10 |
xargs -rd '\n' readlink -f
I think you are over-complicating things here. Using just ls and head should get you the results you want:
ls -lt /home/myname/dummy/dummy* | head -10
To sort by ctime specifically, use the -c flag for ls:
ls -ltc /home/myname/dummy/dummy* | head -10

What is the correct Linux command of find, grep and sort?

I am writing a command using find, grep and sort to display a sorted list of all files that contain 'some-text'.
I was unable to figure out the command.
Here is my attempt:
$find . -type f |grep -l "some-text" | sort
but it didn't work.
You need to use something like XARGS so that the content of each file passed through the pipe | is made available for grep.
XARGS: converts input from standard input into arguments to a command
In my case, I have files1,2,3 and they contain the word test. This will do it.
za:tmp za$ find . -type f | xargs grep -l "test" | sort
./file1.txt
./file2.txt
./file3.txt
or
za:tmp za$ find . -type f | xargs grep -i "test" | sort
./file1.txt:some test string
./file2.txt:some test string
./file3.txt:some test string
You can use it in any unix:
find . -type f -exec sh -c 'grep "some text" {} /dev/null > /dev/null 2>&1' \; -a -print 2> /dev/null|sort
A more optimized solution that works only with GNU-grep:
find . -type f -exec grep -Hq "some-text" {} \; -a -print 2> /dev/null|sort

How do I print all the files older than 10 days containing particular string?

I have tried this but not working.
find . -mtime +10 -print| grep -H -r "test" | cut -d: -f1
You can make use of xargs and process the files found by find, but find alone can make it:
find . -mtime +10 -exec grep -l "test" {} \+
find ... -exec XXX {} \; (or \+, thanks Kevin) performs the XXX command on the files found by find.
grep -l just shows the name of the files, as I think you are trying to get with cut -d: -f1.
You may also need to add -type f to just find files, no directories.
You have to execute using xargs like:
find . -mtime +10 -print0 | xargs -0 grep -H -r "test" | cut -d: -f1
edit
I inserted options so that you won't have problems with spaces in the filenames.

bash shell script not working as intended using cmp with output redirection

I am trying to write a bash script that remove duplicate files from a folder, keeping only one copy.
The script is the following:
#!/bin/sh
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
for f2 in `find ./ -name "*.txt"`
do
if [ -f $f2 ] && [ "$f1" != "$f2" ]
then
# if cmp $f1 $f2 &> /dev/null # DOES NOT WORK
if cmp $f1 $f2
then
rm $f2
echo "$f2 purged"
fi
fi
done
fi
done
I want to redirect the output and stderr to /dev/null to avoid printing them to screen.. But using the commented statement this script does not work as intended and removes all files but the first..
I'll give more informations if needed.
Thanks
Few comments:
First, the:
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
is the same as (find only plain files with the txt extension)
for f1 in `find ./ -type f -name "*.txt"`
Better syntax (bash only) is
for f1 in $(find ./ -type f -name "*.txt")
and finally the whole is wrong, because if the filename contains a space, the f1 variable will not get the full path name. So instead the for do:
find ./ -type f -name "*.txt" -print | while read -r f1
and as #Sir Athos pointed out, the filename can contain \n so the best is to use
find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1
Second:
Use "$f1" instead of $f1 - again, because the $f1 can contain space.
Third:
doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt file. When the checksum is identical - the files are dups.
If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)
Making checksums are slow to, so you should 1st compare ony files with the same size. Different sized files are not duplicates...
You can skip empty txt files - they are duplicates all :).
so the final command can be:
find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
commented:
#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\
#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\
#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\
#make an md5 checksum for them
xargs -0 md5sum |\
#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate
The output now, you can simply edit the output file and decide what want remove and what file want keep.
&> is bash syntax, you'll need to change the shebang line (first line) to #!/bin/bash (or the appropriate path to bash.
Or if you're really using the Bourne Shell (/bin/sh), then you have to use old-style redirection, i.e.
cmp ... >/dev/null 2>&1
Also, I think the &> was only introduced in bash 4, so if you're using bash, 3.X you'll still need the old-style redirections.
IHTH
Credit to #kobame for this answer: this is really a comment but for the formatting.
You don't need to call find twice, print out the size and the filename in the find command
find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 |
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum
An example
$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum
76fd4c1589ef708d9203f3cf09cfd032 ./a.txt
e2d75fd6a1080efb6230d0608b1f9014 ./b.txt
76fd4c1589ef708d9203f3cf09cfd032 ./d.txt
e2d75fd6a1080efb6230d0608b1f9014 ./e.txt
To keep one and delete the rest, I would pipe the output into:
... | awk '++seen[$1] > 1 {print $2}' | xargs echo rm
rm ./d.txt ./e.txt
Remove the echo if your testing is satisfactory.
Like many complex pipelines, filenames containing newlines will break it.
All nice answers, so only one short suggestion: you can install and use the
fdupes -r .
from the man:
Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison.
Added by #Francesco
fdupes -rf . | xargs rm -f
for remove dupes. (the -f in fdupes omit the 1st occurence the file, so list only dupes)

Use find, wc, and sed to count lines

I was trying to use sed to count all the lines based on a particular extension.
find -name '*.m' -exec wc -l {} \; | sed ...
I was trying to do the following, how would I include sed in this particular line to get the totals.
You may also get the nice formatting from wc with :
wc `find -name '*.m'`
Most of the answers here won't work well for a large number of files. Some will break if the list of file names is too long for a single command line call, others are inefficient because -exec starts a new process for every file. I believe a robust and efficient solution would be:
find . -type f -name "*.m" -print0 | xargs -0 cat | wc -l
Using cat in this way is fine, as its output is piped straight into wc so only a small amount of the files' content is kept in memory at once. If there are too many files for a single invocation of cat, cat will be called multiple times, but all the output will still be piped into a single wc process.
You can cat all files through a single wc instance to get the total number of lines:
find . -name '*.m' -exec cat {} \; | wc -l
On modern GNU platforms wc and find take -print0 and -files0-from parameters that can be combined into a command that count lines in files with total at the end. Example:
find . -name '*.c' -type f -print0 | wc -l --files0-from=-
you could use sed also for counting lines in place of wc:
find . -name '*.m' -exec sed -n '$=' {} \;
where '$=' is a "special variable" that keep the count of lines
EDIT
you could also try something like sloccount
Hm, solution with cat may be problematic if you have many files, especially big ones.
Second solution doesn't give total, just lines per file, as I tested.
I'll prefer something like this:
find . -name '*.m' | xargs wc -l | tail -1
This will do the job fast, no matter how many and how big files you have.
sed is not the proper tool for counting. Use awk instead:
find . -name '*.m' -exec awk '{print NR}' {} +
Using + instead of \; forces find to call awk every N files found (like with xargs).
For big directories we should use:
find . -type f -name '*.m' -exec sed -n '$=' '{}' + 2>/dev/null | awk '{ total+=$1 }END{print total}'
# alternative using awk twice
find . -type f -name '*.m' -exec awk 'END {print NR}' '{}' + 2>/dev/null | awk '{ total+=$1 }END{print total}'

Resources