Input folder / output folder for each file in AWK [duplicate] - shell

This question already has answers here:
Redirecting stdout with find -exec and without creating new shell
(3 answers)
Closed last month.
I am trying to run (several) Awk scripts through a list of files and would like to get each file as an output in a different folder. I tried already several ways but can not find the solution. The output in the output folder is always a single file called {} which includes all content of all files from the input folder.
Here is my code:
input_folder="/path/to/input"
output_folder="/path/to/output"
find $input_folder -type f -exec awk '! /rrsig/ && ! /dskey/ {print $1,";",$5}' {} >> $output_folder/{} \;
Can you please give me a hint what I am doing wrong?
The code is called in a .sh script.

I'd probably opt for a (slightly) less complicated find | xargs, eg:
find "${input_folder}" -type f | xargs -r \
awk -v outd="${output_folder}" '
FNR==1 { close(outd "/" outf); outf=FILENAME; sub(/.*\//,"",outf) }
! /rrsig/ && ! /dskey/ { print $1,";",$5 > (outd "/" outf) }'
NOTE: the commas in $1,";",$5 will insert spaces between $1, ; and $2; if the spaces are not desired then use $1 ";" $5 (ie, remove the commas)

Related

Extracting Number From a File

I'm trying to write a script (with bash) that looks for a word (for example "SOME(X) WORD:") and prints the rest of the line which is effectively some numbers with "-" in front. To clarify, an example line that I'm looking for in a file is;
SOME(X) WORD: -1.0475392439 ANOTHER.W= -0.0590214433
I want to extract the number after "SOME(X) WORD:", so "-1.0475392439" for this example. I have a similar script to this which extracts the number from the following line (both lines are from the same input file)
A-DESIRED RESULT W( WORD) = -9.68765465413
And the script for this is,
local output="$1"
local ext="log"
local word="W( WORD)"
cd $dir
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0,ptn) {print $NF,FILENAME}' {} +
But when I change the local word variable from "W( WORD)" to "SOME(X) WORD", it captures the "-0.0590214433" instead of "-1.0475392439" meaning it takes the last number in line. How can I find a solution to this? Thanks in advance!
As you have seen, print $NF outputs the last field of the line. Please modify the find line as:
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0, ptn) {if (match($0, /-[0-9]+\.[0-9]+/)) print substr($0, RSTART, RLENGTH), FILENAME}' {} +
Then it will output the first number in the line.
Please note it assumes the number always starts with the - sign.

delete all but the last match

I want to delete all but the last match of a set of files matching file* that are present in each folder within a directory.
For example:
Folder 1
file
file_1-1
file_1-2
file_2-1
stuff.txt
stuff
Folder 2
file_1-1
file_1-2
file_1-3
file_2-1
file_2-2
stuff.txt
Folder 3
...
and so on. Within every subfolder I want to keep only the last of the matched files, so for Folder 1 this would be file_2-1, in Folder 2 it would be file_2-2. The number of files is generally different within each subfolder.
Since I have a very nestled folder structure I thought about using the find command somehow like this
find . -type f -name "file*" -delete_all_but_last_match
I know how to delete all matches but not how to exclude the last match.
I also found the following piece of code:
https://askubuntu.com/questions/1139051/how-to-delete-all-but-x-last-items-from-find
but when I apply a modified version to a test folder
find . -type f -name "file*" -print0 | head -zn-1 | xargs -0 rm -rf
it deletes all the matches in most cases, only in some the last file is spared. So it does not work for me, presumably because of the different number of files in each folder.
Edit:
The folders contain no further subfolders, but they are generally at the end of several subfolder levels. It would therefore be a benefit if the script can be executed some levels above as well.
#!/bin/bash
shopt -s globstar
for dir in **/; do
files=("$dir"file*)
unset 'files[-1]'
rm "${files[#]}"
done
Try the following solution utilising awk and xargs:
find . -type f -name "file*" | awk -F/ '{ map1[$(NF-1)]++;map[$(NF-1)][map1[$(NF-1)]]=$0 }END { for ( i in map ) { for (j=1;j<=(map1[i]-1);j++) { print "\""map[i][j]"\"" } } }' | xargs rm
Explanation:
find . -type f -name "file*" | awk -F/ '{ # Set the field delimiter to "/" in awk
map1[$(NF-1)]++; # Create an array map1 with the sub-directory as the index and an incrementing counter the value (number of files in each sub-directory)
map[$(NF-1)][map1[$(NF-1)]]=$0 # Create a two dimentional array with the sub directory index one and the file count the second. The line the value
}
END {
for ( i in map ) {
for (j=1;j<=(map1[i]-1);j++) {
print "\""map[i][j]"\"" # Loop through the map array utilising map1 to get the last but one file and printing the results
}
}
}' | xargs rm # Run the result through xargs rm
Remove the pipe to xargs to verify that the files are listing as expected before adding back in to actually remove the files.

Find two strings in a list of files and get filename [duplicate]

This question already has answers here:
Find files containing multiple strings
(6 answers)
Closed 4 years ago.
I have the following files:
100005.txt 107984.txt 116095.txt 124152.txt 133339.txt 139345.txt 18147.txt 25750.txt 32647.txt 40390.txt 48979.txt 56502.txt 64234.txt 72964.txt 80311.txt 888.txt 95969.txt
100176.txt 108084.txt 116194.txt 124321.txt 133435.txt 139438.txt 18331.txt 25940.txt 32726.txt 40489.txt 49080.txt 56506.txt 64323.txt 73063.txt 80481.txt 88958.txt 9601.txt
100347.txt 108255.txt 116378.txt 124494.txt 133531.txt 139976.txt 18420.txt 26034.txt 32814.txt 40589.txt 49082.txt 56596.txt 64414.txt 73163.txt 80580.txt 89128.txt 96058.txt
100447.txt 108343.txt 116467.txt 124594.txt 133627.txt 140519.txt 18509.txt 26128.txt 32903.txt 40854.txt 49254.txt 56768.txt 64418.txt 73498.txt 80616.txt 89228.txt 96148.txt
100617.txt 108432.txt 11647.txt 124766.txt 133728.txt 14053.txt 1866.txt 26227.txt 32993.txt 41026.txt 49308.txt 56857.txt 6449.txt 73670.txt 80704.txt 89400.txt 96239.txt
10071.txt 108521.txt 116556.txt 124854.txt 133830.txt 141062.txt 18770.txt 26327.txt 33093.txt 41197.txt 49387.txt 57029.txt 64508.txt 7377.txt 80791.txt 89500.txt 96335.txt
100788.txt 10897.txt 116746.txt 124943.txt 133866.txt 141630.txt 18960.txt 2646.txt 33194.txt 41296.txt 4971.txt 57128.txt 64680.txt 73841.txt 80880.txt 89504.txt 96436.txt
Some of the files look like:
spec:
annotations:
name: "ubuntu4"
labels:
key: "cont_name"
value: "ubuntuContainer4"
labels:
key: "cont_service"
value: "UbuntuService4"
task:
container:
image: "ubuntu:latest"
args: "tail"
args: "-f"
args: "/dev/null"
mounts:
source: "/home/testVolume"
target: "/opt"
replicated:
replicas: 1
I want to get every filename that contains ubuntu AND replicas.
I have tried awk '/ubuntu/ && /replicas/{print FILENAME}' *.txt but it doesn't seem to work for me.
Any ideas on how to fix this?
Grep can return a list of the files that match a string. You can nest that grep call so that you first get a list of files that match ubuntu, then use that list of files to get a list of files that match replicas.
grep -l replicas $( grep -l ubuntu *.txt )
This does assume that at least one file will match ubuntu. To get around that limitation, you can add a test for the existence of one file first, and then do the combined search:
grep -q ubuntu *.txt && grep -l replicas $( grep -l ubuntu *.txt )
Check if both strings appear in a given file by using a counter for each and then checking if they were incremented. You can do this with BEGINFILE, available on GNU awk:
awk 'BEGINFILE {ub=0; re=0}
/ubuntu/ {ub++}
/replicas/ {re++}
(ub>0 && re>0) {print FILENAME; nextfile}' *.txt
This sets two counters to 0 when it starts to read a file: one for one string and another one for the other. When one of the patterns is found, it increments its corresponding counter. Then it keeps checking if the two counters have been incremented. If so, it prints its filename using the FILENAME variable that contains that string. Also, it skips the rest of the file using nextfile, since there is no need to continue checking for the patterns.
awk '/ubuntu/ && /replicas/{print FILENAME}' *.txt
looks for both regexps on the same line. To find them both in the same file but possibly on separate lines with GNU awk for ENDFILE is:
awk '/ubuntu/{u=1} /replicas/{r=1} ENDFILE{if (u && r) print FILENAME; u=r=0}' *.txt
or more efficiently adding gawks nextfile construct and preferentially switching to BEGINFILE (as #fedorqui already showed) instead of ENDFILE since all that remains between file reads is to set the 2 variables:
awk 'BEGINFILE{u=r=0} /ubuntu/{u=1} /replicas/{r=1} u && r{print FILENAME; nextfile}' *.txt
With other awks it'd be:
awk '
FNR==1{prt()} /ubuntu/{u=1} /replicas/{r=1} END{prt()}
function prt() {if (u && r) print fname; fname=FILENAME; u=r=0}
' *.txt
If no subdirs have to been visited:
for f in *.txt
do
grep -q -m1 'ubuntu' $f && grep -q -m1 'replicas' $f && echo "found: $f"
done
or as oneliner:
for f in *.txt ; do grep -q -m1 'ubuntu' $f && grep -q -m1 replicas $f && echo found:$f ; done
The -q makes grep quiet, so the matches aren't display, the -m1 only searches for 1 match, so grep can report a match fast.
The && is short circuiting, so if the first grep doesn't find anything, the second isn't tried.
For working on the files further down the pipeline, you will of course eliminate the chatty "found: ".

bash script reading lines in every file copying specific values to newfile

I want to write a script helping me to do my work.
Problem: I have many files in one dir containing data and I need from every file specific values copied in a newfile.
The datafiles can look likes this:
Name abc $desV0
Start MJD56669 opCMS v2
End MJD56670 opCMS v2
...
valueX 0.0456 RV_gB
...
valueY 12063.23434 RV_gA
...
What the script should do is copy valueX and the following value and also valueY and following value copied into an new file in one line. And the add in that line the name of the source datafile. Additionally the value of valueY should only contain everything before the dot.
The result should look like this:
valueX 0.0456 valueY 12063 name_of_sourcefile
I am so far:
for file in $(find -maxdepth 0 -type f -name *.wt); do
for line in $(cat $file | grep -F vb); do
cp $line >> file_done
done
done
But that doesn't work at all. I also have no idea how to get the data in ONE line in the newfile.
Can anyone help me?
I think you can simplify your script a lot using awk:
awk '/valueX/{x=$2}/valueY/{print "valueX",x,"valueY",$2,FILENAME}' *.wt > file_done
This goes through every file in the current directory. When "valueX" is matched, the value is saved to the variable x. When "valueY" is matched, the line is printed.
This assumes that the line containing "valueX" always comes before the one containing "valueY". If that isn't a valid assumption, the script can easily be changed.
To print only the integer part of "valueY", you can use printf instead of print:
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,FILENAME}' *.wt > file_done
%d is the format specifier for an integer.
If your requirements are more complex and you need to use find, you should use -exec rather than looping through the results, to avoid problems with awkward file names:
find -maxdepth 1 -iname "5*.par" ! -iname "*_*" -exec \
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,"{}"}' '{}' \; > file_done
don't fight. I'm really thankful for your help and exspecially the fast answers.
This is my final solution I think:
#!/bin/bash
for file in $(find * -maxdepth 1 -iname "5*.par" ! -iname "*_*"); do
awk '/TASC/{x=$2}/START/{printf "TASC %s MJD %d %s",x,$2, FILENAME}' $file > mjd_vs_tasc
done
Very thanks again to you guys.
Try something like below :
egrep "valueX|valueY" *.wt | awk -vRD="\n" -vORS=" " -F':| ' '{if (NR%2==0) {print $2, $3, $1} else {print $2, $3}}' > $file.new.txt

Better way of extracting data from file for comparison

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?
Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

Resources