AWK split for multiple delimiters lines - shell

I'm trying to split a file using AWK one-line but the code below that I came with is not working properly.
awk '
BEGIN { idx=0; file="original_file.split." }
/^REC_DELIMITER.(HIGH|TOP)$/ { idx++ }
/^REC_DELIMITER.TOP$/,/^REC_DELIMITER.(HIGH|TOP)$/ { print > file sprintf("%03d", idx) }
' original_file
Test file is "original_file":
REC_DELIMITER.TOP
lineA1
lineA2
lineA3
REC_DELIMITER.HIGH
lineB1
lineB2
lineB3
REC_DELIMITER.TOP
lineC1
lineC2
lineC3
REC_DELIMITER.HIGH
lineD1
lineD2
lineD3
AWK code above is for REC_DELIMITER.TOP and it is giving me these files:
original_file.split.001:
REC_DELIMITER.TOP
original_file.split.003:
REC_DELIMITER.TOP
however, I'm trying to get this:
original_file.split.001:
REC_DELIMITER.TOP
lineA1
lineA2
lineA3
original_file.split.003:
REC_DELIMITER.TOP
lineC1
lineC2
lineC3
There will be other record delimiters, and when needed, we can run for them like REC_DELIMITER.HIGH, this way getting files like below:
original_file.split.002:
REC_DELIMITER.HIGH
lineB1
lineB2
lineB3
original_file.split.004:
REC_DELIMITER.HIGH
lineD1
lineD2
lineD3
Any help guys is very appreciate, I have been trying to get this working past few days and AWK code above is the best I was able to get. I need now help from AWK masters. :)
Thank you!

You can try something like this:
awk '
/REC_DELIMITER\.TOP/ {
a=1
b=0
file = sprintf (FILENAME".split.%03d",++n)
}
/REC_DELIMITER\.HIGH/ {
b=1
a=0
file = sprintf (FILENAME".split.%03d",++n)
}
a {
print $0 > file
}
b {
print $0 > file
}' file

You need something like this (untested):
awk -v dtype="TOP" '
BEGIN { dbase = "^REC_DELIMITER\\."; delim = dbase dtype "$" }
$0 ~ dbase { inBlock=0 }
$0 ~ delim { inBlock=1; idx++ }
inBlock { print > sprintf("original_file.split.%03d", idx) }
' original_file

awk -vRS=REC_DELIMITER '/^.TOP\n/{print RS $0 > sprintf("original_file.split.%03d",n)};!++n' original_file
(Give or take an extra newline at the end.)
Generally, when input is supposed to be treated as a series of multi-line records with a special line as delimiter, the most direct approach is to set RS (and often ORS) to that delimiter.
Normally you'd want to add newlines to its beginning and/or end, but this case is a little special so it's easier without them.
Edited to add: You need GNU Awk for this. Standard Awk considers only the first character of RS.

I made some changes so the different delimiters go to the their own file, even when they occur later in the file. make a file like splitter.awk with the contents below, the chmod +x it and run it with ./splitter.awk original_file
#!/usr/bin/awk -f
BEGIN {
idx=0;
file="original_file.split.";
out=""
}
{
if($0 ~ /^REC_DELIMITER.(TOP|HIGH)/){
if (!cnt[$0]) {
cnt[$0] = ++idx;
}
out=cnt[$0];
}
print > file sprintf("%03d", out)
}

I'm not very used to AWK, however, plasticide's answer put me towards right direction and I finally got AWK script working as requirements.
In below code, first IF turn echo to 0 if a demilier is found. Second IF turn echo to 1 if the wanted delimiter is found, then the want ones are are split from file.
I know regex could be something like /^(REC_(DELIMITER\.(TOP|HIGH|LOW)|NO_CATEGORY)$/ but since regex is created dynamically via shellscript that reads from an specific file a list of delimiters, it will look more like in AWK below.
awk 'BEGIN {
idx=0; echo=1; file="original_file.split."
}
{
#All the delimiters to consider in given file
if($0 ~ /^(REC_DELIMITER.TOP|REC_DELIMITER.HIGH|REC_DELIMITER.LOW|REC_NO_CATEGORY)$/) {
echo=0
}
#Delimiters that should actually be pulled
if($0 ~ /^(REC_DELIMITER.HIGH|REC_DELIMITER.LOW)$/ {
idx++; echo=1
}
#Print to a file is match wanted delimmiter
if(echo) {
print > file idx
}
}' original_file
Thank you all. I really appreciate it very much.

Related

awk or other shell to convert delimited list into a table

So what I have is a huge csv akin to this:
Pool1,Shard1,Event1,10
Pool1,Shard1,Event2,20
Pool1,Shard2,Event1,30
Pool1,Shard2,Event4,40
Pool2,Shard1,Event3,50
etc
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'
You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
{
sub(/Event/, "", $3);
res[$1","$2][$3]=$4;
}
END {
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
}
}
Order could be confused by awk, but my test output is:
Pool2,Shard1,,,50,
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
}
prev = key
}
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,

How to collate multiple files in AWK?

I am trying to collate a series of .csv log files that are named by date (e.g., 2019-02-24.csv). There are a bunch of them, so I'm trying to script the process. I've crafted an AWK script that combines individual files:
awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFICE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> usage_history.csv
But I am failing when I try to string the AWK commands together with a control loop in BASH:
for i in {01..28}; do echo "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
When I run this, it prints out the correct commands to the command line, but the awk scripts are not executed (they only get printed). If I run it without echo, I get errors telling me that the file doesn't exist; though all files are present:
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
What am I missing in my loop?
Here is a condensed sample of the command and the error messages:
$ for i in {01..02}; do "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-02.csv >> user_history.csv: No such file or directory
Could you please try following.
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-9]*.csv >> user_history.csv
Here following are the points why one could use this approach:
1- Use of for loop and calling awk command in that each time will be a overkill. We should use smart approach when awk could read multiple files then we should sue it.
2- Now comes the getline part which you tried in your code, so if we want to negate any string then simply negate it by using !/string_to_be_skipped/ so it will look for only those lines which are NOT having this string.
3- While mentioning file(multiple files) to single awk command I used 2019-01-[0-9]*.csv why because since you have NOT told if files will be created daily basis or not so in case we give it a loop style and that specific file is NOT present then we will get an error. For an example let's say I use following awk command where I intentionally removed file named(2019-01-02.csv).
awk '........' 2019-01-{01..29}.csv
awk: cannot open 2019-01-02.csv (No such file or directory)
So to avoid these kind of situations I have used 2019-01-[0-9]*.csv where it will only look for files which have digits after 2019-01-0 and will loop NOT run in a loop and complaint us that some xyz etc file is missing.
Try this:
for i in {01..28}; do awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-$i.csv >>user_history.csv;done
The commands after do should not be quoted.
And what you were doing essentially equals to ignore the title lines.
The {print} after 1 is unnecessary -- single 1 implies {print}. The 1 is to provide a true.
-- When there's only an expression but no block, the block implies to {print}.
-- And only a regexp equals $0~/regex/, and here I negated it.
If there's no other command inside the loop, you can simplify the loop with one awk command:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-{01..28}.csv >>user_history.csv
But this one will throw error and stop executing when one of the files not existed.
Another way is:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-3][0-9].csv >>user_history.csv
This one will only match filenames, instead of loop for them.
It won't stop executing nor throw error, So if there's file missing you wouldn't know. And it will match extra files if exist.
For example it will read 2019-01-34.csv if it exists.
So if you want the warnings (warnings won't affect the results), but don't want the commands to stop, then use the first for loop one.
Pitfalls:
[0-3][1-9] won't match 10,20 and 30, but will match 32 to 39.
[0-9]* will match any longer number, but with 20 to 29 before 3 or likewise, it's string order.
Thanks to #Tiw and #RavinderSingh13 for their guidance. Here is the final awk script that is working well for my case where I have daily files from multiple days, months, and years (only 2018 and 2019 in this case):
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 201[8-9]-[0-1][0-2]-[0-3][0-9].csv >> user_history.csv

Assign the value of awk-for loop variable to a bash variable

content within the tempfile
123 sam moore IT_Team
235 Rob Xavir Management
What i'm trying to do is get input from user and search it in the tempfile and output of search should give the column number
Code I have for that
#!/bin/bash
set -x;
read -p "Enter :" sword6;
awk 'BEGIN{IGNORECASE = 1 }
{
for(i=1;i<=NF;i++) {
if( $i ~ "'$sword6'$" )
print i;
}
} ' /root/scripts/pscripts/tempprint.txt;
This exactly the column number
Output
Enter : sam
2
What i need is the value of i variable should be assigned to bash variable so i can call as per the need in script.
Any help in this highly appreciated.
I searched to find any existing answer but not able to find any. If any let me know please.
first of all, you should pass your shell variable to awk in this way (e.g. sword6)
awk -v word="$sword6" '{.. if($i ~ word)...}`...
to assign shell variable by the output of other command:
shellVar=$(awk '......')
Then you can continue using $shellVar in your script.
regarding your awk codes:
if user input some special chars, your script may fail, e.g .*
if one column had matched userinput multiple times, you may have duplicated output.
if your file had multi-columns matching user input, you may want to handle it.
You just need to capture the output of awk. As an aside, I would pass sword6 as an awk variable, not inject it via string interpolation.
i=$(awk -v w="$sword6" '
BEGIN { IGNORECASE = 1 }
{ for (i=1;i<=NF;i++) {
if ($i ~ w"$") { print i; }
}
}' /root/scripts/pscipts/tempprint.txt)
Following script may help you on same too.
cat script.ksh
echo "Please enter the user name:"
read var
awk -v val="$var" '{for(i=1;i<=NF;i++){if(tolower($i)==tolower(val)){print i,$i}}}' Input_file
If tempprint.txt is big
awk -v w="$word6" '
BEGIN { IGNORECASE = 1 }
"$0 ~ \\<w\\>" {
for(i=1;i<=NF;i++)
if($i==w)print i
}' tempprint.txt

awk script error on solaris

I have following script and trying to run it:
BEGIN {
start = 0
}
{
if (match($0, "<WorkflowProcess ")) {
startTag++
}
if ((startTag < 2) || (endTag == startTag)) {
print
}
if (match($0, "</WorkflowProcess>")) {
endTag++
}
}
However I always get this error:
awk: syntax error near line 6
awk: illegal statement near line 6
awk: syntax error near line 10
awk: bailing out near line 10
Any thoughts? I have tried to convert it via dos2unix and also with tr -d '\r' but it's still the same issue. The input parameter is in my opinion corect when I am sending a fullpath with file name and extention (/export/home/test/file.txt). All files have 0777.
How do you try to run that program?
If you use awk "... all that program ...", then the shell will expand $0 to its own path, which probably has a leading /... Although, now that I look at it, that should fail earlier with the internal ". Still, it would be useful to see the precise command line.
By the way, why are you calling match? It would be much more idiomatic to write:
awk '
/<WorkflowProcess / { ++startTag }
startTag < 2 || startTag == endTag { print }
/</WorkflowProcess>/ { ++endTag }
'
which avoids the explicit use of $0 altogether.
On SunOS nawk is often the better choice :
nawk -f script.awk /export/home/test/file.txt
Just an idea, in the BEGIN rule you initialize start, not startTag, but then you increment startTag in the next rule. I know, this works in GNU awk and all, but maybe you should try initializing startTag.

Shell script to combine three files using AWK

I have three files G_P_map.txt, G_S_map.txt and S_P_map.txt. I have to combine these three files using awk. The example contents are the following -
(G_P_map.txt contains)
test21g|A-CZ|1mos
test21g|A-CZ|2mos
...
(G_S_map.txt contains)
nwtestn5|A-CZ
nwtestn6|A-CZ
...
(S_P_map.txt contains)
3mos|nwtestn5
4mos|nwtestn6
Expected Output :
1mos, 3mos
2mos, 4mos
Here is the code which I tried. I was able to combine the first two, but I couldn't do along with the third one.
awk -F"|" 'NR==FNR {file1[$1]=$1; next} {$2=file[$1]; print}' G_S_map.txt S_P_map.txt
Any ideas/help is much appreciated. Thanks in advance!
I would look at a combination of join and cut.
GNU AWK (gawk) 4 has BEGINFILE and ENDFILE which would be perfect for this. However, the gawk manual includes a function that will provide this functionality for most versions of AWK.
#!/usr/bin/awk
BEGIN {
FS = "|"
}
function beginfile(ignoreme) {
files++
}
function endfile(ignoreme) {
# endfile() would be defined here if we were using it
}
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
files == 1 { # save all the key, value pairs from file 1
file1[$2] = $3
next
}
files == 2 { # save all the key, value pairs from file 2
file2[$1] = $2
next
}
files == 3 { # perform the lookup and output
print file1[file2[$2]], $1
}
# Place the regular END block here, if needed. It would be in addition to the one above (there can be more than one)
Call the script like this:
./scriptname G_P_map.txt G_S_map.txt S_P_map.txt

Resources