Applying awk pattern to all files with same name, outputting each to a new file - bash

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!

To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

Related

filename group by a pattern and select only one from each group

I have following files(as an example, 60000+ actually) and all the log files follows this pattern:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008795-84866-201911261249.log
analyse-ABC008795-84867-201911261249.log
analyse-ABC008795-84868-201911261249.log
analyse-ABC008795-84869-201911261249.log
analyse-ABC008796-84870-201911261249.log
analyse-ABC008796-84871-201911261249.log
analyse-ABC008796-84872-201911261249.log
analyse-ABC008796-84873-201911261249.log
Only numbers get change in log files. I want to take one file from each category where files should be categorized by ABC.... number. So, as you can see, there are only two categories here:
analyse-ABC008795
analyse-ABC008796
So, what I want to have is one file(let's say first file) from each category. Output should look like this:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008796-84870-201911261249.log
This should be done in Bash/linux environment, so that after I get this, I should use grep to check if my "searching string" contain in those files
ls -l | <what should I do to group and get one file from each category> | grep "searching string"
With bash and awk.
files=(*.log)
printf '%s\n' "${files[#]}" | awk -F- '!seen[$2]++'
Or use find instead of a bash array for a more portable approach.
find . -type f -name '*.log' | awk -F- '!seen[$2]++'
If your find has the -printf flag and you don't want the leading ./ from the filename add it before the pipe |
-printf '%f\n'
The !seen[$2]++ Remove second and subsequent instances of each input line, without having to sort them first. The $2 means the second field which -F is using.

Counting number of occurrences in several files

I want to check the number of occurrences of, let's say, the character '[', recursively in all the files of a directory that have the same extension, e.g. *.c. I am working with the SO Solaris in Unix.
I tried some solutions that are given in other posts, and the only one that works is this one, since with this OS I cannot use the command grep -o:
sed 's/[^x]//g' filename | tr -d '012' | wc -c
Where x is the occurrence I want to count. This one works but it's not recursive, is there any way to make it recursive?
You can get a recursive listing from find and execute commands with its -exec argument.
I'd suggest like:
find . -name '*.c' -exec cat {} \; | tr -c -d ']' | wc -c
The -c argument to tr means to use the opposite of the string supplied -- i.e. in this case, match everything but ].
The . in the find command means to search in the current directory, but you can supply any other directory name there as well.
I hope you have nawk installed. Then you can just:
nawk '{a+=gsub(/\]/,"x")}END{print a}' /path/*
You can write a snippet code itself. I suggest you to run the following:
awk '{for (i=1;i<=NF;i++) if ($i=="[") n++} END{print n}' *.c
This will search for "[" in all files in the present directory and print the number of occurrences.

Find files that contain string match1 but does not contain match2

I am writing a shell script to find files which contain string "match1" AND does not contain "match2".
I can do this in 2 parts:
grep -lr "match1" * > /tmp/match1
grep -Lr "match2" * > /tmp/match2
comm -12 /tmp/match1 /tmp/match2
Is there a way I can achieve this directly without going through the process of creating temporary files ?
With bash's process substitution:
comm -12 <(grep -lr "match1" *) <(grep -Lr "match2" *)
Using GNU awk for multi-char RS:
awk -v RS='^$' '/match1/ && !/match2/ {print FILENAME}' *
I would use find together with awk. awk can check both matches in a single run, meaning you don't need to process all the files twice:
find -maxdepth 1 -type f -exec awk '/match1/{m1=1}/match2/{m2=1} END {if(m1 && !m2){print FILENAME}}' {} \;
Better explained in multiline version:
# Set flag if match1 occurs
/match1/{m1=1}
# Set flag if match2 occurs
/match2/{m2=1}
# After all lines of the file have been processed print the
# filename if match1 has been found and match2 has not been found.
END {if(m1 && !m2){print FILENAME}}
Is there a way I can achieve this directly without going through the process of creating temporary files ?
Yes. You can use pipelines and xargs:
grep -lr "match1" * | xargs grep -Lr "match2"
The first grep prints the names of files containing matches to its standard output, as you know. The xargs command reads those file names from its standard input, and converts them into arguments to the second grep command, appending them after the ones already provided.
You can initially search for the files containing match1 and then using xargspass it to other grep using -L or --files-without-match option.
grep -lr "match1" *|xargs grep -L "match2"

Pad/Fill missing columns in CSV file (using tabs)

I have some CSV files with TAB as separator. The lines have variable amount of columns and I want to normalize that.
I need exactly say 10 columns so effectively I want to add empty column up until 10th column in case it has fewer columns.
Also I would like to loop all files in a folder and update the corresponding file and not just output or write to a new file.
I can manage to do it with commas like this:
awk -F, '{$10=""}1' OFS=',' file.txt
But when changing it to \t i breaks and adds too many columns:
awk -F, '{$10=""}1' OFS='\t' file.txt
Any inputs?
If you have GNU awk (sometimes called gawk), this will make sure that you have ten columns and it won't erase tenth if it is already there:
awk -F'\t' -v OFS='\t' '{NF=10}1' file >file.tmp && mv file.tmp file
Awk users value brevity and a further simplification, as suggested by JID, is possible. Since, under awk, NF=10 evaluates to true, we can set NF to 10 at the same time that we cause the line to be printed:
awk -F'\t' -v OFS='\t' 'NF=10' file >file.tmp && mv file.tmp file
MacOS: On a Mac, the default awk is BSD but GNU awk (gawk) can be installed using brew install gawk.
find /YourFolder -name "*.csv" -exec sed -i 's/$/\t\t\t\t\t\t\t\t\t/;s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/' {} \;
The find for taking all your CSV files
the sed
-i for inline editing and avoid temporary file
add 9 tab on each line then keep only the 10 first element (separated by 9 tab)
A version that only change line that are not compliant:
find /YourFolder -name "*.csv" -exec sed -i '/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
s/$/\t\t\t\t\t\t\t\t\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
}' {} \;
Auto adapt column number
# change the 2 occurance of "9" by the number of wanted column - 1
find /YourFolder -name "*.csv" -exec sed -i ':cycle
/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
# optimize with number ot \t on line below
s/$/\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
b cycle
}' {} \;
you can optimize your case by adding several \t instead of 1 per cycle (best should be the average missing column with a normal distribution)

Bash/Shell - paths with spaces messing things up

I have a bash/shell function that is supposed to find files then awk/copy the first file it finds to another directory. Unfortunately if the directory that contains the file has spaces in the name the whole thing fails, since it truncates the path for some reason or another. How do I fix it?
If file.txt is in /path/to/search/spaces are bad/ it fails.
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
cp: /path/to/search/spaces: No such file or directory
*If file.txt is in /path/to/search/spacesarebad/ it works, but notice there are no spaces. :-/
Awk's default separator is white space. Simply change it to something else by doing:
awk -F"\t" ...
Your script should look like:
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -F"\t" -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
As pointed by the comments, you don't really need all those steps, you could actually simply do (one-liner):
dir=/path/to/destination/ && path="$(find /path/to/search -name file.txt | head -n 1)" && cp "$path" "$dir"
Formated code (that may look better, in this case ^^):
dir=/path/to/destination/
path="$(find /path/to/search -name file.txt | head -n 1)"
cp "$path" "$dir"
The "" are used to assign the entire content of the string to the variable, causing the separator IFS, which is a white space by default, not to be considered over the string.
If you think spaces are bad, wait till you get into trouble with newlines. Consider for example:
mkdir spaces\ are\ bad
touch spaces\ are\ bad/file.txt
mkdir newlines$'\n'are$'\n'even$'\n'worse
touch newlines$'\n'are$'\n'even$'\n'worse/file.txt
And:
find . -name file.txt
The head command assumes newline delimiter. You can get around the space and newline issue with GNU find and GNU grep (maybe others) by using \0 delimiters:
find . -name file.txt -print0 | grep -zm1 . | xargs -0 cp -t "$dir"
You could try this.
awk '{print substr($0, index($0,$9))}'
For example this is the output of ls command:
-rw-r--r--. 1 root root 73834496 Dec 6 10:55 File with spaces 2
If you use simple awk like this
# awk '{print $9}'
It returns only
# File
If used with the full command
# awk '{print substr($0, index($0,$9))}'
I get the whole output
File with spaces 2
Here
substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional.
For example if the match is addr:192.168.1.133 and you use substr as follows
# awk '{print substr($2,6)}'
You get the IP i.e 192.168.1.133. Note the 6 is the character starting from a in addr
So in the proper command the $2 is $0 ( which prints whole line.) and index($0,$9) matches $9 and prints everything ahead of column 9. You can change that to index($0,$8) and see that the output changes to
# 10:55 File with spaces 2
`index(IN, FIND)'
This searches the string IN for the first occurrence of the string
FIND, and returns the position in characters where that occurrence
begins in the string IN.
I hope it helps. Moreover if you are assigning this value to a variable in script then you need to enclose the variables in double quotes. Other wise you will get errors if you are doing some other operation for the extracted file name.

Resources