Bash Shell awk/xargs magic - bash

I'm trying to learn a little awk foo. I have a CSV where each line is of the format partial_file_name,file_path. My goal is to find the files (based on partial name) and move them to their respective new paths. I wanted to combine the forces of find,awk and mv to achieve this but I'm stuck implementing. I wanted to use awk to separate the terms from the csv file so that I could do something like
find . -name '*$1*' -print | xargs mv {} $2{}
where $1 and $2 are the split terms from the csv file. Anyone have any ideas? -peace

I suggest doing this:
$ cat korv
foo.txt,/hello/
bar.jpg,/mullo/
$ awk -F, '{print $1 " " $2}' korv
foo.txt /hello/
bar.jpg /mullo/
-F sets the delimiter, so the above will split using ",". Next, add * to the filenames:
$ awk -F, '{print "*"$1"*" " " $2}' korv
*foo.txt* /hello/
*bar.jpg* /mullo/
**
This shows I have an empty line. We don't want this match, so we add a rule:
$ awk -F, '/[a-z]/{print "*"$1"*" " " $2}' korv
*foo.txt* /hello/
*bar.jpg* /mullo/
Looks good, so encapsulate all this to mv using a subshell:
$ mv $(awk -F, '/[a-z]/{print "*"$1"*" " " $2}' korv)
$
Done.

You don't really need awk for this. There isn't really anything here which awk does better than the shell.
#!/bin/sh
IFS=,
while read file target; do
find . -name "$file" -print0 | xargs -ir0 mv {} "$target"
done <path_to_csv_file
If you have special characters in the file names, you may need to tweak the read.

what about using awk's system command:
awk '{ system("find . -name " $1 " -print | xargs -I {} mv {} " $2 "{}"); }'
example arguments in the csv file: test.txt ./subdirectory/

find . -name "*err" -size "+10c" | awk -F.err '{print $1".job"}' | xargs -I {} qsub {}

Related

Get second part of output separated by two spaces

I have this script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | sort | uniq -D -w 32
It outputs this:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2-2.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2.txt
Now I only want to save the last part (the path) in an array.
When I add this after the sort
| awk -F " " '{ print $1 }'
I get this as output:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change the $1 to $2, I get nothing, but I want to get the path of the file.
How should I do this?
EDIT:
This script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | awk '{ print $1 }' | sort | uniq -D -w 32
Outputs this
parallels#mbp:~/bin$ duper ./dups
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change it to $2 it outputs this
parallels#mbp:~/bin$ duper ./dups
parallels#mbp:~/bin$
Expected Output
./dups/dup1-1.txt
./dups/dup1.txt
./dups/subdups/dup2-2.txt
./dups/subdups/dup2.txt
There are some files in the directory that are no duplicates of each other. Such as nodup1.txt and nodup2.txt. That's why it doesn't show up.
Change your find command to this:
find "$path" -type f -exec sha1sum {} \; | uniq -D -w 41 | awk '{print $2}' | sort
I moved the uniq as the first filter and it is taking into consideration just the first 41 characters, aiming to match just the sha1sum hash.
You can achieve the same result piping to tr and then cut:
echo '3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt' |\
tr -s ' ' | cut -d ' ' -f 2
Outputs:
./dups/dup1-1.txt
-s ' ' on tr is to squeeze spaces
-d ' ' -f 2 on cut is to output the second field delimited by spaces
I like to use cut for stuff like this. With this input:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
I'd do cut -d ' ' -f 2 which should return:
./dups/dup1-1.txt
I haven't tested it though for your case.
EDIT: Gonzalo Matheu's answer is better as he ensured to remove any extra spaces between your outputs before doing the cut.

Searching for .extension files recursively and print the number of lines in the files found?

I ran into a problem I am trying to solve but can't think about a way without doing the whole thing from the beginning. My script gets an extension and searches for every .extension file recursively, then outputs the "filename:row #:word #". I would like to print out the total amount of row #-s found in those files too. Is there any way to do it using the existing code?
for i in find . -name '*.$1'|awk -F/ '{print $NF}'
do
echo "$i:`wc -l <$i|bc`:`wc -w <$i|bc`">>temp.txt
done
sort -r -t : -k3 temp.txt
cat temp.txt
I think you're almost there, unless I am missing something in your requirements:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
lines=`wc -l < $f`
words=`wc -w < $f`
total=`echo "$lines+$total" | bc`
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Edit:
Per recommendation of #Mark Setchel in the comments, this is a more refined version of the script above:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
read lines words _ < <(wc -wl "$f")
total=$(($lines+$total))
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Cheers
This is a one-liner printing the lines found per file, the path of the file and at the end the sum of all lines found in all the files:
find . -name "*.go" -exec wc -l {} \; | awk '{s+=$1} {print $1, $2} END {print s}'
In this example if will find for all files ending *.go then will execute use wc -l to get the number of lines and print the output to stdout, awk then is used to sum all the output of column 1 in the variable s the one will be only printed at the end: END {print s}
In case you would also like to get the words and the total sum at the end you could use:
find . -name "*.go" -exec wc {} \; | \
awk '{s+=$1; w+=$2} {print $1, $2, $4} END {print "Total:", s, w}'
Hope this can give you an idea about how to format, sum etc. your data based on the input.

Optimal way to recursively find files that match one or more patterns

I have to optimize a shell script, but after one week, i didn't succeed to optimize it enough.
I have to search recursively for .c .h and .cpp file in a directory, and check if word like this exist:
"float short unsigned continue for signed void default goto sizeof volatile do if static while"
words=$(echo $# | sed 's/ /\\|/g')
files=$(find $dir -name '*.cpp' -o -name '*.c' -o -name '*.h' )
for file in $files; do
(
test=$(grep -woh "$words" "$file" | sort -u | awk '{print}' ORS=' ')
if [ "$test" != "" ] ; then
echo "$(realpath $file) contains : $test"
fi
)&
done
wait
I have tried with xargs and -exec, but no result, i have to keep this result format:
/usr/include/c++/6/bits/stl_set.h contains : default for if void
Maybe you can help me (to optimize it)..
EDIT: I have to keep one occurence of each word
YES: while, for, volatile...
NOPE: while, for, for, volatile...
If you are interested in finding all files that have at least one match of any of your patterns, you can use globstar:
shopt -s globstar
oldIFS=$IFS; IFS='|'; patterns="$*"; IFS=$oldIFS # make a | delimited string from arguments
grep -lwE "$patterns" **/*.c **/*.h **/*.cpp # list files with matching patterns
globstar
If set, the pattern ‘**’ used in a filename expansion context
will match all files and zero or more directories and subdirectories.
If the pattern is followed by a ‘/’, only directories and
subdirectories match.
Here is an approach that keeps your desired format while eliminating the use of find and bash looping:
words='float|short|unsigned|continue|for|signed|void|default|goto|sizeof|volatile|do|if|static|while'
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path | awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last && !($2 in a){printf " %s",$2; a[$2]} END{print""}'
How it works
grep -rwoE --include '*.[ch]' --include '*.cpp' "$words" path
This searches recursively through directories starting with path looking only in files whose names match the globs *.[ch] or *.cpp.
awk -F: '$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]} $1==last{printf " %s",$2} END{print""}'
This awk command reformats the output of grep to match your desired output. The script uses a variable last and array a. last keeps track of which file we are on and a contains the list of words seen so far. In more detail:
-F:
This tells awk to use : as the field separator. In this way, the first field is the file name and the second is the word that is found. (limitation: file names that include : are not supported.)
'$1!=last{printf "%s%s: contains %s",r,$1,$2; last=$1; r=ORS; delete a; a[$2]}
Every time that the file name, $1, does not match the variable last, we start the output for a new file. Then, we update last to contain the name of this new file. We then delete array a and then assign key $2 to a new array a.
$1==last && !($2 in a){printf " %s",$2; a[$2]}
If the current file name is the same as the previous and the current word has not been seen before, we print out the new word found. We also add this word, $2 as a key to array a.
END{print""}
This prints out a final newline (record separator) character.
Multiline version of code
For those who prefer their code spread out over multiple lines:
grep -rwoE \
--include '*.[ch]' \
--include '*.cpp' \
"$words" path |
awk -F: '
$1!=last{
printf "%s%s: contains %s",r,$1,$2
last=$1
r=ORS
delete a
a[$2]
}
$1==last && !($2 in a){
printf " %s",$2; a[$2]
}
END{
print""
}'
You should be able to do most of this with a single grep command:
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words"
This will put it in file:word format, so all that's left is to change it to produce the output that you want.
echo $output | sed 's/:/ /g' | awk '{print $1 " contains : " $2}'
Then you can add | sort -u to get only single occurrences for each word in each file.
#!/bin/bash
#dir=.
words=$(echo $# | sed 's/ /\\|/g')
grep -Rw $dir --include \*.c --include \*.h --include \*.cpp -oe "$words" \
| sort -u \
| sed 's/:/ /g' \
| awk '{print $1 " contains : " $2}'

Creating directories from list preserving whitespaces

I have list of names in a file that I need to create directories from. The list looks like
Ada Lovelace
Jean Bartik
Leah Culver
I need the folders to be the exact same, preserving the whitespace(s). But with
awk '{print $0}' myfile | xargs mkdir
I create separate folders for each word
Ada
Lovelace
Jean
Bartik
Leah
Culver
Same happens with
awk '{print $1 " " $2}' myfile | xargs mkdir
Where is the error?
Using gnu xargs you can use -d option to set delimiter as \n only. This way you can avoid awk also.
xargs -d '\n' mkdir -p < file
If you don't have gnu xargs then you can use tr to convert all \n to \0 first:
tr '\n' '\0' < file | xargs -0 mkdir
#birgit:try: Completely based on your sample Input_file provided.
awk -vs1="\"" 'BEGIN{printf "mkdir ";}{printf("%s%s%s ",s1,$0,s1);} END{print ""}' Input_file | sh
awk '{ system ( sprintf( "mkdir \"%s\"", $0)) }' YourFile
# OR
awk '{ print"mkdir "\"" $0 "\"" | "/bin/sh" }' YourFile
# OR for 1 subshell
awk '{ Cmd = sprintf( "%s%smkdir \"%s\"", Cmd, (NR==1?"":"\n"), $0) } END { system ( Cmd ) }' YourFile
Last version is better due to creation of only 1 subshell.
If there are a huge amount of folder (shell parameter limitation), you could loop and create smaller command several times

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

Resources