delete files with partial match to another list of names - bash

I have a list of names in file1 that is a substring of the true filenames of the files I want to delete. I want to delete the files that are partially matched by the names in file1. Any idea how to specify the files to delete?
file1:
file123
file313
file355
True files:
file123.sam.out
file313.sam.out
file355.sam.out
file342455.sam.out
file34455.sam.out
Files to keep:
file342455.sam.out
file34455.sam.out

Assuming you don't have any filenames containing newline literals...
printf '%s\n' * | grep -Fvf file1 | xargs -d $'\n' rm -f --
Let's walk through this piece-by-piece:
printf '%s\n' * generates a list of files in your current directory.
grep -Fvf file1 removes from that list any string that contains as a substring a line from file1
xargs -d $'\n' rm -f -- splits its stdin on newlines, and passes anything it's left as an argument to rm -f --
If you have GNU tools (and a shell, like bash, with process substitution support), you can use NUL delimiters and thus work with all possible filenames:
printf '%s\0' * |
grep -zFvf <(tr '\n' '\0' <file1) |
xargs -0 rm -f --
printf '%s\0' * puts a NUL, instead of a newline, after each filename.
tr '\n' '\0' <file1 emits the contents of file1 with newlines replaced with NULs
grep -zFvf reads both its inputs as NUL-delimited, and writes NUL-delimited output, but otherwise behaves as above.
xargs -0 rm -f -- reads content, splitting on NULs, and passes input as arguments to rm -f --.

#!/bin/bash
PATTERN_FILE=file1
FILE_TO_REMOVE_FOLDER=files
cat $PATTERN_FILE | while read x
do
if [ "" != "$x" ]
then
echo "rm $FILE_TO_REMOVE_FOLDER/$x*"
rm $FILE_TO_REMOVE_FOLDER/$x*
fi
done

Related

combining all files that contains the same word into a new text file with leaving new lines between individual files

it is my first question here. I have a folder called "materials", which has 40 text files in it. I am basically trying to combine the text files that contain the word "carbon"(both in capitalized and lowercase form)in it into a single file with leaving newlines between them. I used " grep -w carbon * " to identify the files that contain the word carbon. I just don't know what to do after this point. I really appreciate all your help!
grep -il carbon materials/*txt | while read line; do
echo ">> Adding $line";
cat $line >> result.out;
echo >> result.out;
done
Explanation
grep searches the strings in the files. -i ignores the case for the searched string. -l prints on the filename containing the string
while command loops over the files containing the string
cat with >> appends to the results.out
echo >> adds new line after appending each files content to result.out
Execution
$ ls -1 materials/*.txt
materials/1.txt
materials/2.txt
materials/3.txt
$ grep -i carbon materials/*.txt
materials/1.txt:carbon
materials/2.txt:CARBON
$ grep -irl carbon materials/*txt | while read line; do echo ">> Adding $line"; cat $line >> result.out; echo >> result.out; done
>> Adding materials/1.txt
>> Adding materials/2.txt
$ cat result.out
carbon
CARBON
Try this (assuming your filenames don't contain newline characters):
grep -iwl carbon ./* |
while IFS= read -r f; do cat "$f"; echo; done > /tmp/combined
If it is possible that your filenames may contain newline characters and your shell is bash, then:
grep -iwlZ carbon ./* |
while IFS= read -r -d '' f; do cat "$f"; echo; done > /tmp/combined
grep is assumed to be GNU grep (for the -w and -Z options). Note that these will leave a trailing newline character in the file /tmp/combined.

Print program path and its symlink using which

Whenever I use which I do this: $ which -a npm
Which results in: /usr/local/bin/npm
Then to find the real path, I run:
ls -l /usr/local/bin/npm
I would like a fast way of doing this. The best I have come up with is defining a function:
which(){
/usr/bin/which -a "$#" | xargs ls -l | tr -s ' ' | cut -d ' ' -f 9-
}
Now it has a nice output of: /usr/local/bin/npm -> ../lib/node_modules/npm/bin/npm-cli.js
Is there a better way to do this? I don't like using cut like this.
This won't print the -> output ls -l does, but it will resolve symlinks:
which() {
command which -a "$#" | xargs -d '\n' readlink -m
}
If you want the -> output but want to do it more robustly, you could mimic ls -l with:
which() {
command which -a "$#" | while IFS= read -r file; do
if [[ -L $file ]]; then
echo "$file -> $(readlink -m "$file")"
else
echo "$file"
fi
done
}
What does command do?
command suppresses function lookup and runs the built-in which command as if which() weren't defined. This way you don't have to hardcode /usr/bin/which.
awk: if first character is "l" (link), print fields 9,10,11; else only 9.
[ranga#garuda ~]$ ls -l $(which -a firefox)|awk '{print $1 ~ /^l/ ? $9 $10 $11 : $9 }'
/usr/bin/firefox->/usr/lib64/firefox/firefox
$1 is the first field
$1 ~ /^l/ ? tests whether the first field matches the pattern ^l (first character is "l")
if the test passes, print receives $9 $10 $11; else, only $9.
sed : remove first 8 non-space and space character bunches.
[ranga#garuda ~]$ ls -l $(which firefox) | sed 's/^\([^ ]*[ ]*\)\{8\}//'
/usr/bin/firefox -> /usr/lib/firefox/firefox
[ ]* matches a bunch of contiguous spaces
[^ ]* matches a contiguous bunch of non-space characters
the grouping \([^ ]*[ ]*\) matches a text with one non-space bunch and one space bunch (in that order).
\{8\} matches 8 contiguous instances of this combination. ^ at the beginning pins the match to the beginning of the line.
's/^\([^ ]*[ ]*\)\{8\}//' replaces a match with empty text - effectively removing it.
seems to work so long as you aren't running "which" on an alias.
these commands are not presented inside a function but can be used in one (which you already know how to).

Optimal way to recursively find files that match one or more patterns

I have to optimize a shell script, but after one week, i didn't succeed to optimize it enough.
I have to search recursively for .c .h and .cpp file in a directory, and check if word like this exist:
"float short unsigned continue for signed void default goto sizeof volatile do if static while"
words=$(echo $# | sed 's/ /\\|/g')
files=$(find $dir -name '*.cpp' -o -name '*.c' -o -name '*.h' )
for file in $files; do
(
test=$(grep -woh "$words" "$file" | sort -u | awk '{print}' ORS=' ')
if [ "$test" != "" ] ; then
echo "$(realpath $file) contains : $test"
fi
)&
done
wait
I have tried with xargs and -exec, but no result, i have to keep this result format:
/usr/include/c++/6/bits/stl_set.h contains : default for if void
Maybe you can help me (to optimize it)..
EDIT: I have to keep one occurence of each word
YES: while, for, volatile...
NOPE: while, for, for, volatile...
If you are interested in finding all files that have at least one match of any of your patterns, you can use globstar:
shopt -s globstar
oldIFS=$IFS; IFS='|'; patterns="$*"; IFS=$oldIFS # make a | delimited string from arguments
grep -lwE "$patterns" **/*.c **/*.h **/*.cpp # list files with matching patterns
globstar
If set, the pattern ‘**’ used in a filename expansion context
will match all files and zero or more directories and subdirectories.
If the pattern is followed by a ‘/’, only directories and
subdirectories match.
Here is an approach that keeps your desired format while eliminating the use of find and bash looping:
words='float|short|unsigned|continue|for|signed|void|default|goto|sizeof|volatile|do|if|static|while'
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path | awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last && !($2 in a){printf " %s",$2; a[$2]} END{print""}'
How it works
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path
This searches recursively through directories starting with path looking only in files whose names match the globs *.[ch] or *.cpp.
awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last{printf " %s",$2} END{print""}'
This awk command reformats the output of grep to match your desired output. The script uses a variable last and array a. last keeps track of which file we are on and a contains the list of words seen so far. In more detail:
-F:
This tells awk to use : as the field separator. In this way, the first field is the file name and the second is the word that is found. (limitation: file names that include : are not supported.)
'$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]}
Every time that the file name, $1, does not match the variable last, we start the output for a new file. Then, we update last to contain the name of this new file. We then delete array a and then assign key $2 to a new array a.
$1==last && !($2 in a){printf " %s",$2; a[$2]}
If the current file name is the same as the previous and the current word has not been seen before, we print out the new word found. We also add this word, $2 as a key to array a.
END{print""}
This prints out a final newline (record separator) character.
Multiline version of code
For those who prefer their code spread out over multiple lines:
grep -rwoE \
--include '*.[ch]' \
--include '*.cpp' \
"$words" path |
awk -F: '
$1!=last{
printf "%s%s: contains %s",r,$1,$2
last=$1
r=ORS
delete a
a[$2]
}
$1==last && !($2 in a){
printf " %s",$2; a[$2]
}
END{
print""
}'
You should be able to do most of this with a single grep command:
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words"
This will put it in file:word format, so all that's left is to change it to produce the output that you want.
echo $output | sed 's/:/ /g' | awk '{print $1 " contains : " $2}'
Then you can add | sort -u to get only single occurrences for each word in each file.
#!/bin/bash
#dir=.
words=$(echo $# | sed 's/ /\\|/g')
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words" \
| sort -u \
| sed 's/:/ /g' \
| awk '{print $1 " contains : " $2}'

Bash: Filter directory when piping from `ls` to `tee`

(background info)
Writing my first bash psuedo-program. The program downloads a bunch of files from the network, stores them in a sub-directory called ./network-files/, then removes all the files it downloaded. It also logs the result to several log files in ./logs/.
I want to log the filenames of each file deleted.
Currently, I'm doing this:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | tee -a $network_files_log $verbose_log $network_log
($base_directory is a variable defining the base directory for the app, $network_files_log etc are variables defining the location of various log files)
This produces some pretty grody and unreadable output:
Tue Jun 21 04:55:46 UTC 2016 >>> Removing files: /home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png /home/vagrant/load-simulator/network-files/442119100.png /home/vagrant/load-simulator/network-files/464324101.png /home/vagrant/load-simulator/network-files/525787337.png /home/vagrant/load-simulator/network-files/581100197.png /home/vagrant/load-simulator/network-files/640387393.png /home/vagrant/load-simulator/network-files/650797708.png /home/vagrant/load-simulator/network-files/827538696.png /home/vagrant/load-simulator/network-files/833069509.png /home/vagrant/load-simulator/network-files/8580204.png /home/vagrant/load-simulator/network-files/858174053.png /home/vagrant/load-simulator/network-files/998266826.png
Any good way to strip out the /home/vagrant/load-simulator/network-files/ part from each of those file paths? I suspect there's something I should be doing with sed or grep, but haven't had any luck so far.
You might also consider using find. Its perfect for walking directories, removing files and using customized printf for output:
find $PWD/x -type f -printf "%f\n" -delete >>$YourLogFile.log
Don't use ls at all; use a glob to populate an array with the desired files. You can then use parameter expansion to shorten each array element.
d=$base_directory/network-files
files=( "$d"/* )
printf '%s Removing files: %s' "$(date -u)" "${files[*]#$d/}" | tee ...
You could do it a couple of ways. To directly answer the question, you could use sed to do it with the substitution command like:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | sed -e "s,$base_directory/network-files/,," | tee -a $network_files_log $verbose_log $network_log
which adds sed -e "s,$base_directory/network-files/,," to the pipeline. It will substitute the string found in base_directory with the empty string, so long as there are not any commas in base_directory. If there are you could try a different separator for the parts of the sed command, like underscore: sed -e "s_$base_directory/network-files__"
Instead though, you could just have the subshell cd to that directory and then the string wouldn't be there in the first place:
echo -e "$(date -u) >>> Removing files: $(cd "$base_directory/network-files/"; ls -1 | tr '\n' ' ')" | tee -a "$network_files_log" "$verbose_log" "$network_log"
Or you could avoid some potential pitfalls with echo and use printf like
{ printf '%s >>>Removing files: '; printf '%s ' "$(cd "$base_directory/network-files"; ls -1)"; printf '\n'; } | tee -a ...
testdata="/home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png"
echo -e $testdata | sed -e 's/\/[^ ]*\///g'
Pipe your output to sed the replace that captured group with nothing.
The regex: \/[^ ]*\/
Start with a /, captured everything that is not a space until it gets to the last /.

awk parse filename and add result to the end of each line

I have number of files which have similar names like
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out
etc.
I need to get number before .csv(1 or 2) from the file name and put it into end of every line in file with TAB separator.
I have written this code, it finds number that I need, but i do not know how to put this number into file. There is space in the filename, my script breaks because of it.
Also I am not sure, how to send to script list of files. Now I am working only with one file.
My code:
#!/bin/sh
string="DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out"
out=$(echo $string | awk 'BEGIN {FS="_"};{print substr ($7,0,1)}')
awk ' { print $0"\t$out" } ' $string
for file in *
do
sfx=$(echo "$file" | sed 's/.*_\(.*\).csv.*/\1/')
sed -i "s/$/\t$sfx/" "$file"
done
Using sed:
$ sed 's/.*_\(.*\).csv.*/&\t\1/' file
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out 1
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out 2
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out 1
To make this for many files:
sed 's/.*_\(.*\).csv.*/&\t\1/' file1 file2 file3
OR
sed 's/.*_\(.*\).csv.*/&\t\1/' file*
To make this changed get saved in the same file(If you have GNU sed):
sed -i 's/.*\(.\).csv.*/&\t\1/' file
Untested, but this should do what you want (extract the number before .csv and append that number to the end of every line in the .out file)
awk 'FNR==1 { split(FILENAME, field, /[_.]/) }
{ print $0"\t"field[7] > FILENAME"_aaaa" }' *.out
for file in *_aaaa; do mv "$file" "${file/_aaaa}"; done
If I understood correctly, you want to append the number from the filename to every line in that file - this should do it:
#!/bin/bash
while [[ 0 < $# ]]; do
num=$(echo "$1" | sed -r 's/.*_([0-9]+).csv.*/\t\1/' )
#awk -e "{ print \$0\"\t${num}\"; }" < "$1" > "$1.new"
#sed -r "s/$/\t$num/" < "$1" > "$1.mew"
#sed -ri "s/$/\t$num/" "$1"
shift
done
Run the script and give it names of the files you want to process. $# is the number of command line arguments for the script which is decremented at the end of the loop by shift, which drops the first argument, and shifts the other ones. Extract the number from the filename and pick one of the three commented lines to do the appending: awk gives you more flexibility, first sed creates new files, second sed processes them in-place (in case you are running GNU sed, that is).
Instead of awk, you may want to go with sed or coreutils.
Grab number from filename, with grep for variety:
num=$(<<<filename grep -Eo '[^_]+\.csv' | cut -d. -f1)
<<<filename is equivalent to echo filename.
With sed
Append num to each line with GNU sed:
sed "s/\$/\t$num" filename
Use the -i switch to modify filename in-place.
With paste
You also need to know the length of the file for this method:
len=$(<filename wc -l)
Combine filename and num with paste:
paste filename <(seq $len | while read; do echo $num; done)
Complete example
for filename in DWH_Export*; do
num=$(echo $filename | grep -Eo '[^_]+\.csv' | cut -d. -f1)
sed -i "s/\$/\t$num" $filename
done

Resources