Recursively compare specific files in different directories - bash

Similar posts here:
Diff files present in two different directories
and here:
https://superuser.com/q/602877/520666
But not quite what I'm looking for.
I have 2 directories (containing subdirectories and different file types -- binary, images, html, etc.).
I want to be able to recursively compares files with specific extensions (e.g. .html, .strings, etc.) between the two directories -- they may or may not exist in either (sub)directory.
How can I accomplish this? Diff only seems to support exclusions, and I'm not sure how I can leverage Find for this.
Advice?

You could exclude all unwanted fileendings with find:
(this version only matches against file endings)
diff -r -x `find . -type f -name '*.*' | sed 's|.*\.|.*\.|' | sort -u | grep -v YOURFILETYPE | paste -sd "|"` ...rest of diff command
Or you generate the list of excluded files upfront and pass it to the diff:
(this version also matches against filenames and every other regex you specify in include.file)
find /dirA -type f | grep -v YOURFILEENDING > exclude.list
find /dirB -type f | grep -v YOURFILEENDING >> exclude.list
diff -X exclude.list -r /dirA /dirB
If you chain these commands via && you'll get a handy oneliner ;)
WITH INCLUDE FILE
If you want to use an include file, you can use this Method:
You specify the include file
grep matches against all files in the folders and turns your includefile into an exclude file for diff (diff only takes exclude files)
Here is an example:
Complicated inline version:
(this version only matches against file endings)
diff -r -x `find . -type f -name '*.*' | sed 's|.*\.|.*\.|' sort -u | grep -v -f include.file | paste -sd "|"` /dirA /dirB
Slightly longer simpler version:
(this version also matches against filenames and every other regex you specify in include.file)
find /dirA -type f | grep -v -f include.file > exclude.list
find /dirB -type f | grep -v -f include.file >> exclude.list
diff -X exclude.list -r /dirA /dirB
with each line in include.file being a grep regex/expression:
log
txt
fileending3
whateverfileendingyoulilke
fullfilename.txt
someotherregex.*
NOTE
I did not run these because I'm nowhere near a computer.
I hope I got all syntax correct.

The simplest thing you can do is to compare the whole directories:
diff -r /path/the/first /path/the/second
It will show which files are only in one of the directories, which files differ in a binary fashion, and the full diff for any textual files in both directories.
You can loop over a set of relative paths by simply reading a file with a path per line thusly:
while IFS= read -u 9 relative_path
do
diff "/path/the/first/%{relative_path}" "/path/the/second/%{relative_path}"
done 9< relative_paths.txt
Doing this for a specific set of extensions is similarly easy:
shopt -s globstar
while IFS= read -u 9 extension do
diff "/path/the/first/"**/*."${extension}" "/path/the/second/"**/*."${extension}"
done 9< extensions.txt

Related

Enable wildcards behave recursively [duplicate]

This question already has answers here:
What expands to all files in current directory recursively?
(5 answers)
Closed 4 years ago.
I'd like to make a statistics of the words from all the txt file from the current directory and its subdiretories.
In [39]: ls
about.txt distutils/ installing/ whatsnew/
bugs.txt extending/ library/ word.txt
c-api/ faq/ license.txt words_frequency.txt
contents.txt glossary.txt reference/
copyright.txt howto/ tutorial/
distributing/ install/ using
I firstly tried command:
In [46]: !grep -Eoh '[a-zA-Z]+' *.txt | nl
There's a problem that files in the subdiretories were not found:
In [45]: !echo *.txt
about.txt bugs.txt contents.txt copyright.txt glossary.txt license.txt word.txt words_frequency.txt
I improved it as:
In [48]: ! echo */*.txt | grep "about.txt"
In [49]:
Problem again, It failed to find the files of Level one directory and cannot traverse the files of random length.
It's interesting that python has a soluton to this problem:
In [50]: files = glob.glob("**/*.txt", recursive=True)
In [54]: files.index('about.txt')
Out[54]: 4
It could traverse dirs recursively to find all txt files.
However, python is cumbersome to move around files and change text data as grep "pattern" *.txt
How to enable the wildcards as greedy as a recursive behavior.
As an alternative, find command helps
find . -regex -E ".*\.txt" -print0 -exec grep -Eoh "{}" "[a-zA-Z]+" | nl \;
Which not handy as a greedy wildcards if possible.
The globstar could not be activated on Macos.
$ shopt -s globstar
-bash: shopt: globstar: invalid shell option name
$ bash --version
GNU bash, version 4.4.19(1)-release (x86_64-apple-darwin17.3.0)
If I understood the question correctly you may use something like this:
find -type f -name '*.txt' -exec /bin/grep -hEo '\w+' {} \; \
| sort \
| uniq -c \
| sort -k1,1n

How to find many files from txt file in directory and subdirectories, then copy all to new folder

I can't find posts that help with this exact problem:
On Mac Terminal I want to read a txt file (example.txt) containing file names such as:
20130815 144129 865 000000 0172 0780.bmp
20130815 144221 511 000003 1068 0408.bmp
....100 more
And I want to search for them in a certain folder/subfolders (example_folder). After each find, the file should be copied to a new folder x (new_destination).
Your help would be much appreciated!
Chers,
Mo
You could use a piped command with a combination of ls, grep, xargs and cp.
So basically you start with getting the list of files
ls
then you filter them with egrep -e, grep -e or whatever flavor of grep Mac uses for their terminal. If you want to find all files ending with text you can use the regex .txt$ (which means ends with '.txt')
ls | egrep -e "yourRegexExpression"
After that you get an input stream, but cp doesn't work with input streams and only takes a bunch of arguments, that's why we use xargs to convert it to arguments. The final step is to add the flag -t to the argument to signify that the next argument is the target directory.
ls | egrep -e "yourRegexExpression" | xargs cp -t DIRECTORY
I hope this helps!
Edit
Sorry I didn't read the question well enough, I updated to be match your problem. Here you can see that the egrep command compiles a rather large regex string with all the file names in this way (filename1|filename2|...|fileN). The $() evaluates the command inside and uses the tr to translate newLines to "|" for the regex.
ls | egrep -e "("$(cat yourtextfile.txt | tr "\n" "|")")" | xargs cp -t DIRECTORY
You could do something like:
$ for i in `cat example.txt`
find /search/path -type f -name "$i" -exec cp "{}" /new/path \;
This is how it works, for every line within example.txt:
for i in `cat example.txt`
it will try to find a file matching the line $i in the defined path:
find /search/path -type f -name "$i"
And if found it will copy it to the desired location:
-exec cp "{}" /new/path \;

grep with multiple NOT and AND operators

I am looking for a multiple grep with NOT and AND conditions. I have a directory with with txt files and some csv files which have the date included in the filename. I want to delete the csv files that do not include today’s date. The directory does include csv files with previous dates. So I am trying the code below in bash
#!/bin/bash
now=$(date "+%m-%d-%Y")
dir="/var/tmp/"
for f in "$dir"/*; do
ls -1 | grep -v *$now* | grep *csv* | xargs rm -f
done
This is not deleting anything. If I take out the grep csv operator then it deletes the text files. Only the CSV files have dates on them, the text files don’t. Please advise.
I suggest using the find utility for this:
find /var/tmp -maxdepth 1 -name '*.csv' ! -name "*$now*" -delete
If you want to do it with grep,
ls -1 /var/tmp/*.csv | grep -v "$now" | xargs rm -f
should also work.
EDIT: -delete in the find command instead of -exec rm '{}' \;. Thanks #amphetamachine.

Copying list of files to a directory

I want to make a search for all .fits files that contain a certain text in their name and then copy them to a directory.
I can use a command called fetchKeys to list the files that contain say 'foo'
The command looks like this : fetchKeys -t 'foo' -F | grep .fits
This returns a list of .fits files that contain 'foo'. Great! Now I want to copy all of these to a directory /path/to/dir. There are too many files to do individually , I need to copy them all using one command.
I'm thinking something like:
fetchKeys -t 'foo' -F | grep .fits > /path/to/dir
or
cp fetchKeys -t 'foo' -F | grep .fits /path/to/dir
but of course neither of these works. Any other ideas?
If this is on Linux/Unix, can you use the find command? That seems very much like fetchkeys.
$ find . -name "*foo*.fit" -type f -print0 | while read -r -d $'\0' file
do
basename=$(basename $file)
cp "$file" "$fits_dir/$basename"
done
The find command will find all files that match *foo*.fits in their name. The -type f says they have to be files and not directories. The -print0 means print out the files found, but separate them with the NUL character. Normally, the find command will simply return a file on each line, but what if the file name contains spaces, tabs, new lines, or even other strange characters?
The -print0 will separate out files with nulls (\0), and the read -d $'\0' file means to read in each file separating by these null characters. If your files don't contain whitespace or strange characters, you could do this:
$ find . -name "*foo*.fit" -type f | while read file
do
basename=$(basename $file)
cp "$file" "$fits_dir/$basename"
done
Basically, you read each file found with your find command into the shell variable file. Then, you can use that to copy that file into your $fits_dir or where ever you want.
Again, maybe there's a reason to use fetchKeys, and it is possible to replace that find with fetchKeys, but I don't know that fetchKeys command.
Copy all files with the name containing foo to a certain directory:
find . -name "*foo*.fit" -type f -exec cp {} "/path/to/dir/" \;
Copy all files themselves containing foo to a certain directory (solution without xargs):
for f in `find . -type f -exec grep -l foo {} \;`; do cp "$f" /path/to/dir/; done
The find command has very useful arguments -exec, -print, -delete. They are very robust and eliminate the need to manually process the file names. The syntax for -exec is: -exec (what to do) \;. The name of the file currently processed will be substituted instead of the placeholder {}.
Other commands that are very useful for such tasks are sed and awk.
The xargs tool can execute a command for every line what it gets from stdin. This time, we execute a cp command:
fetchkeys -t 'foo' -F | grep .fits | xargs -P 1 -n 500 --replace='{}' cp -vfa '{}' /path/to/dir
xargs is a very useful tool, although its parametrization is not really trivial. This command reads in 500 .fits files, and calls a single cp command for every group. I didn't tested it to deep, if it doesn't go, I'm waiting your comment.

How to create a backup of files' lines containing "foo"

Basically I have a directory and sub-directories that needs to be scanned to find .csv files. From there I want to copy all lines containing "foo" from the csv's found to new files (in the same directory as the original) but with the name reflecting the file it was found in.
So far I have
find -type f -name "*.csv" | xargs egrep -i "foo" > foo.csv
which yields one backup file (foo.csv) with everything in it, and the location it was found in is part of the data. Both of which I don't want.
What I want:
For example if I have:
csv1.csv
csv2.csv
and they both have lines containing "foo", I would like those lines copied to:
csv1_foo.csv
csv2_foo.csv
and I don't anything extra entered in the backups, other than the full line containing "foo" from the original file. I.e. I don't want the original file name in the backup data, which is what my current code does.
Also, I suppose I should note that I'm using egrep, but my example doesn't use regex. I will be using regex in my search when I apply it to my specific scenario, so this probably needs to be taken into account when naming the new file. If that seems too difficult, an answer that doesn't account for regex would be fine.
Thanks ahead of time!
try this if helps it anyway.
find -type f -name "*.csv" | xargs -I {} sh -c 'filen=`echo {} | sed 's/.csv//' | sed "s/.\///"` && egrep -i "foo" {} > ${filen}_foo.log'
You can try this:
$ find . -type f -exec grep -H foo '{}' \; | perl -ne '`echo $2 >> $1_foo` if /(.*):(.*)/'
It uses:
find to iterate over files
grep to print file path:line tuples (-H switch)
perl to echo those line to the output files (using backslashes, but it could be done prettier).
You can also try:
find -type f -name "*.csv" -a ! -name "*_foo.csv" | while read f; do
grep foo "$f" > "${f%.csv}_foo.csv"
done

Resources