Using awk to manipulate data from two sources - bash

As part of CI\CD process in my team, I want to generate dynamic commands script from a file containing paths to some resources.
The file paths.txt contains the paths, separated by new lines. For every line in this file, a command should be generated, unless it starts with "JarPath/..."
example:
JarPath/DontTouchMe.jar
path/to/some/resource/View/PutMeInScript.msgflow
path/to/some/resource/Control/MeAlso.map
The file mapping.txt contains a key-values pairs. the key is a phrase to be matched with a path from paths.txt, and it's value is required for the generated command.
example:
View viewEG.bar
Control controlEG.bar
Lines in paths.txt are not sorted, and some paths can match a single value in mapping.txt.
Only the first match in the mapping.txt file that matches the first possible parse in the path should be considered. I don't care if later line in mapping also matches, nor if later directory in the path matches other line.
The to-be-matched parse at the path is not at fixed location (e.g after the 4th "/")
Final result in the script file should be:
mqsicreatebar -data ./ -b viewEG.bar -o /path/to/some/resource/View/PutMeInScript.msgflow
mqsicreatebar -data ./ -b controlEG.bar -o /path/to/some/resource/Control/MeAlso.map
Since the command line takes data from two sources (paths.txt and a value pair from mapping.txt) I couldn't wrap it into single awk command, nor pipeline it to single bash line. I wrote:
pathVar="paths.txt"
touch deltaFile.txt
while IFS= read -r line
do
awk -v var=$line" 'var ~ $1 && var !~ /^JarPath/ {print $2, " ", var ;exit}' mapping.txt >> deltaFile.txt
done < "$pathVar"
IFS=$'\n'
awk '{print "mqsicreatebar -data ./ -b", $1, "-o", $2 }' deltaFile.txt > script.sh
Well, it works, but is there a better way to do this?

Given your comment below that Only the first match in the mapping.txt file that matches the first possible parse in the path should be considered. The key dir can appear anywhere this is what you need:
$ cat tst.awk
NR==FNR {
keys[++numKeys] = $1
map[$1] = $2
next
}
!/^JarPath/ {
numDirs = split($0,dirs,"/")
val = ""
for (dirNr=1; (dirNr<=numDirs) && (val==""); dirNr++) {
dir = dirs[dirNr]
for (keyNr=1; (keyNr<=numKeys) && (val==""); keyNr++) {
key = keys[keyNr]
if (dir == key) {
val = map[dir]
}
}
}
printf "mqsicreatebar -data ./ -b \047%s\047 -o \047%s\047\n", val, $0
}
$ awk -f tst.awk mapping.txt paths.txt
mqsicreatebar -data ./ -b 'viewEG.bar' -o 'path/to/some/resource/View/PutMeInScript.msgflow'
mqsicreatebar -data ./ -b 'controlEG.bar' -o 'path/to/some/resource/Control/MeAlso.map'

Related

Bash split command to split line in comma separated values

I have a large file with 2000 hostnames and I want to create multiple files with 25 each host per file, but separated by a comma and the last , should be removed.
Large.txt:
host1
host2
host3
.
.
host10000
The below-split command is creating multiple files like file1, file2 ... however, the host are not , separated and its not the expected output.
split -d -l 25 large.txt file
The expected output is:
host1,host2,host3
You'll need to perform 2 separate operations ... 1) split the file and 2) reformat the files generated by split.
The first step is already done:
split -d -l 25 large.txt file
For the second step let's work with the results that are dumped into the first file by the basic split command:
$ cat file00
host1
host2
host3
...
host25
We want to pull these lines into a single line using a comma (,) as delimiter. For this example I'll use an awk solution:
$ cat file00 | awk '{ printf "%s%s", sep, $0 ; sep="," } END { print "" }'
host1,host2,host3...,host25
Where:
sep is initially undefined (aka empty string)
on each successive line processed by awk we set sep to a comma
the printf doesn't include a linefeed (\n) so each successive printf will append to the 'first' line of output
we END the script by printing a linefeed to the end of the file
It just so happens that split has an option to call a secondary script/code-snippet to allow for custom formatting of the output (generated by split); the option is --filter. A few issues to keep in mind:
the initial output from split is (effectively) piped as input to the command listed in the --filter option
it is necessary to escape (with backslash) certain characters in the command (eg, double quotes, dollar sign) so as to keep them from being interpreted by the split command
the --filter option automatically has access to the current split outfile name using the $FILE variable
Pulling everything together gives us:
$ split -d -l 25 --filter="awk '{ printf \"%s%s\", sep, \$0 ; sep=\",\" } END { print \"\" }' > \$FILE" large.txt file
$ cat file00
host1,host2,host3...,host25
Using the --filter option on GNU split:
split -d -l 25 --filter="(perl -ne 'chomp; print \",\" if \$i++; print'; echo) > \$FILE" large.txt file
you can use below mentioned bash code snippet
INPUT FILE
~$ cat domainlist.txt
domain1.com
domain2.com
domain3.com
domain4.com
domain5.com
domain6.com
domain7.com
domain8.com
Script
#!/usr/bin/env bash
FILE_NAME=domainlist.txt
LIMIT=4
OUTPUT_PREFIX=domain_
CMD="csplit ${FILE_NAME} ${LIMIT} {1} -f ${OUTPUT_PREFIX}"
eval ${CMD}
#=====#
for file in ${OUTPUT_PREFIX}*; do
echo $file
sed -i ':a;N;$!ba;s/\n/,/g' $file
done
OUTPUT
./mysplit.sh
36
48
12
domain_00
domain_01
domain_02
~$ cat domain_00
domain1.com,domain2.com,domain3.com
Change LIMIT, OUTPUT_PREFIX file name prefix and input file as per your requirement
using awk:
awk '
BEGIN { PREFIX = "file"; n = 0; }
{ hosts = hosts sep $0; sep = ","; }
function flush() { print hosts > PREFIX n++; hosts = ""; sep = ""; }
NR % 25 == 0 { flush(); }
END { flush(); }
' large.txt
edit: improved comma separation handling stealing from markp-fuso's excellent answer :)

Split CSV into two files based on column matching values in an array in bash / posh

I have a input CSV that I would like to split into two CSV files. If the value of column 4 matches any value in WLTarray it should go in output file 1, if it doesn't it should go in output file 2.
WLTarray:
"22532" "79994" "18809" "21032"
input CSV file:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file1:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file2:
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
I've been looking at awk to filter this (python & perl not an option in my environment) but I think there is probably a much smarter way:
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?
done
With regular awk you can make use of split and substr (to handle double-quote removal for comparison) and split the csv file as you indicate. For example you can use:
awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ") # split s into array a
for (i in a) # loop over each index in a
b[a[i]]=1 # use value in a as index for b
}
FNR == 1 { # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b { # 4th field w/o quotes in b?
print $0 > "output1.csv" # write to output1.csv
next
}
{ print $0 > "output2.csv" } # otherwise write to output2.csv
' input.csv
Where:
in the BEGIN {...} rule you set the field separator (FS) to break on comma, and split the string containing your desired output1.csv field 4 matches into the array a, then loops over the values in a using them for the indexes in array b (to allow a simple i in b check);
the first rule is applied to the first records in the file (the header line) which is simply written out to both output files;
the next rule removes the double-quotes surrounding field-4 and then checks if the number in field-4 matches an index in array b. If so the record is written to output1.csv otherwise it is written to output2.csv.
Example Input File
$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
Resulting Output Files
$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
You can use gawk like this:
test.awk
#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^\",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}
Make it executable and run it like this:
chmod +x test.awk
./test.awk input.csv
using grep with a filter file as input was the simplest answer.
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}"
do
awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

different ways to grep for two distinct strings, appearing in any order, or line, in the same file

I want to return all files that have the strings: "main(" as well as "foo".
This is like using a multi pattern OR grep but with AND instead.
The best I've come up with is:
grep -rl . -e "main("|while read fname; do grep -rl "$fname" -e "foo"; done
It does the job, but ideally I wouldn't have to write bash script.
E.g.
text1.txt:
int main()
{
stuff....
}
foo
grep command would return text1.txt since it contains the strings 'main(' and 'foo'
Just use awk to match both patterns and print filenames:
awk 'FNR == 1 { m = f = 0 } # reset flags at start of each file
/main\(/ { ++m } /foo/ { ++f } # set flags when patterns match
m && f { print FILENAME; nextfile }' **/*
nextfile is a GNU extension which skips to the next file, rather than the next line. With globstar enabled, ** expands recursively. In an interactive bash shell, it is enabled by default, but in a script you can enable it yourself using shopt -s globstar.
With non-GNU awk, you can use another flag to skip lines and avoid printing the filename multiple times:
awk 'FNR == 1 { m = f = p = 0 } # reset flags at start of each file
p { next } # skip lines once this filename has been printed
/main\(/ { ++m } /foo/ { ++f }
m && f { print FILENAME; ++p }' **/*
Try
grep -rlZ 'main(' | xargs -0 grep -l 'foo'
-Z, --null
Output a zero byte (the ASCII NUL character) instead of the character that normally follows a file
name. For example, grep -lZ outputs a zero byte after each file name instead of the usual newline.
This option makes the output unambiguous, even in the presence of file names containing unusual
characters like newlines. This option can be used with commands like find -print0, perl -0, sort -z,
and xargs -0 to process arbitrary file names, even those that contain newline characters.
The first grep would print all filenames containing main( separated by NUL character. xargs would then pass the files to second grep command which would print files containing foo
If the files are small enough and do not contain NUL character,
grep -rlz 'main(.*foo\|foo.*main('
where -z would use NUL as line separator, effectively slurping whole file

AWK - Unable to update the files

I have 300 txt files in my directory in the following format
regional_vol_WM_atlas[1-300].txt
651328 651328
553949 553949
307287 307287
2558 2558
The following awk script was supposed to create a new file by performing calculation on fourth row of each existing files in my directory .
#!/bin/bash
awk=/usr/bin/awk
awkcommand='
FNR == 1 {
newfilename = FILENAME ; sub(".txt", "_prop.txt", newfilename)
printf "" > newf
ilename
}
FNR == 4 {
$1=($1/0.824198)*0.8490061
$2=($2/0.824198)*0.8490061
}
{
print >> newfilename
}
'regional_vol_WM_atlas[0-9].txt regional_vol_WM_atlas[0-9][0-9].txt regional_vol_WM_atlas1[0-4][0-9].txt regional_vol_WM_atlas15[02].txt
Unfortunately i could not update any file in the directory ,when i run the file, i am incurring following error
dev#dev-OptiPlex-780:/media/dev/Daten/Task1/subject1/t1$ '/media/dev/Daten/Task1/subject1/t1/Method'
/media/dev/Daten/Task1/subject1/t1/Method: line 18: regional_vol_WM_atlas10.txt: command not found
Could you please correct me where i am wrong
Your script is not calling awk. It defines a variable named awk and then tries to execute the file regional_vol_WM_atlas10.txt with the variable awkcommand set in its environment. Alas, that file is not in your PATH, so bash cannot find it. You need to instead do:
awk "$awkcommand" file1 file2 ...
(where file1, file2, etc. are the input files you want to use as input.)
Also, note that your current script is appending the literal text regional_vol_WM_atlas[0-9].txt to the end of the awk command (or if a file exists which matches that glob, the name of that file is being appended), which you do not want. Overall, what you were trying to do should have been written:
#!/bin/bash
awkcommand='
FNR == 1 {
newfilename = FILENAME ; sub(".txt", "_prop.txt", newfilename)
printf "" > newfilename
}
FNR == 4 {
$1=($1/0.824198)*0.8490061
$2=($2/0.824198)*0.8490061
}
{
print >> newfilename
}
'
awk "$awkcommand" regional_vol_WM_atlas[0-9].txt \
regional_vol_WM_atlas[0-9][0-9].txt \
regional_vol_WM_atlas1[0-4][0-9].txt \
regional_vol_WM_atlas15[02].txt
The problem is that a variable can be assigned for a command, for example:
x='hello' some_command
Which in effect is what bash thinks you are trying to do. The culprit is the whitespace, which acts as a command separator, so just escape (prefix with a \) the whitespace in the list of filenames:
#!/bin/bash
awk=/usr/bin/awk
awkcommand='
FNR == 1 {
newfilename = FILENAME ; sub(".txt", "_prop.txt", newfilename)
printf "" > newf
ilename
}
FNR == 4 {
$1=($1/0.824198)*0.8490061
$2=($2/0.824198)*0.8490061
}
{
print >> newfilename
}
'\ regional_vol_WM_atlas[0-9].txt\ regional_vol_WM_atlas[0-9][0-9].txt\ regional_vol_WM_atlas1[0-4][0-9].txt\ regional_vol_WM_atlas15[02].txt
The only thing I have altered is the final line.

Better way of extracting data from file for comparison

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?
Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

Resources