AWK scripting :How to remove Field separator using awk - shell

Need the following output
ONGC044
ONGC043
ONGC042
ONGC041
ONGC046
ONGC047
from this input
Medium Label Medium ID Free Blocks
===============================================================================
[ONGC044] ECCPRDDB_FS_43 ac100076:4aed9b39:44f0:0001 195311616
[ONGC043] ECCPRDDB_FS_42 ac100076:4aed9b1d:44e8:0001 195311616
[ONGC042] ECCPRDDB_FS_41 ac100076:4aed9af4:4469:0001 195311616
[ONGC041] ECCPRDDB_FS_40 ac100076:4aed9ad3:445e:0001 195311616
[ONGC046] ECCPRDDB_FS_44 ac100076:4aedd04a:68c6:0001 195311616
[ONGC047] ECCPRDDB_FS_45 ac100076:4aedd4a0:6bf5:0001 195311616

awk -F"[][]" 'NR>2{print $2}' file

I think what you want is:
awk '/^\[.*\]/ {print substr($1,2,7)}' < ~/tmp/awk_test/file
This assumes that your first field is exactly 9 (7 of them are the ones you want) characters each time. If this is not the case, then use the following instead to strip off of the [ and the ] from the first field:
awk '/^\[.*\]/ {gsub(/[\[\]]/,"",$1); print $1 }' < ~/tmp/awk_test/file

If the data is really that uniform, a very simple pipe to:
cut -b 2-8
will do it for you.
(Oh, apart from the first two lines; get rid of them with grep ^\[ in your pipeline, if you need to)

Here's a solution which only involves awk. The idea is to let awk parse the desired column and remove the undesired square brackets from this token. The only "trick" is to "escape" the [ character lest it is understood as an opening re set. (We could also use substr instead since the brackets are expected as the first and last characters)
{
#skip the column header lines
if (NR < 3)
next;
# $1 is almost what we want: [xxxx]
# ==> just remove the square brackets
sub("[[]", "", $1);
sub("]", "", $1);
print $1;
}

Related

grep for duration=N while N is longer than X, the position of this sentence changes between lines

I have a very long file with a format of column_name=column_val, column_name2=column_val2 and so on.
the columns are not in the right order, lets say for example i have this file:
bar=x moshe=foo test=x duration=5
moshe=foo2 test=y duration=0 bar=y
duration=3 moshe=foo3 bar=z test=x
i want to return lines only where duration is greater then 2
as far as I know awk is not optional since i can't tell where the columns are location in each line.
on IRC in #bash channel someone recommended using gawk's match(). there too i was having problem seeing how to resolve this while each line the duration is elsewhere.
any ideas?
thanks
You can use duration= as field separator:
# showing field content in numeric context
$ awk -F'duration=' '{print +$2}' ip.txt
5
0
3
# use required numeric comparison to get desired output
$ awk -F'duration=' '+$2 > 2' ip.txt
bar=x moshe=foo test=x duration=5
duration=3 moshe=foo3 bar=z test=x
See https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html for conversion details
Unary + works on GNU awk, not sure about other versions. 0+$2 should work everywhere to force numeric context.
Note that if you have multiple duration= in a line, only the first one will be tested.
Extract the data with regex and compare.
awk '0+gensub(".*duration=([0-9]*).*", "\\1", "1") > 2'
#edit as above, the 0+ is needed to convert string to integer.
If you want to use grep:
grep -E 'duration=([3-9] |[0-9]{2,})' "file"
awk '/duration/ {
for (counter=1; counter <= NF; counter++) {
if ($counter ~ /^duration*/) {
value=substr($counter, index($counter,"=")+1);
if (value > 2) {
print $0;
}
}
}
}' <inputfile>

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Remove ] from string based on whether it used as index

Trying with sed (in bash script) to do some substring editing
string1=randomthing0]
string2=otherthing[15]}]
string3=reallyotherthing[5]]
The aim is to remove the ]s when it is not used as an index-type as in the second one.
The output should be
string1=randomthing0
string2=otherthing[15]}
string3=reallyotherthing[5]
This works for me:
s/\[\([^]]\+\)\]/#B#\1#E#/g
s/\]//g
s/#B#/[/g
s/#E#/]/g
It first replaces all [...] with #B#...#E#, i.t. the only remaining ]'s are the non-balanced ones. Then, it just removes them and replaces the #-strings back.
Be careful: your input should never contain the #-strings.
if awk is accepted as well, check the awk solution below:
awk 'BEGIN{OFS=FS=""}{ for(i=1;i<=NF;i++){
s+=$i=="["?1:0;
e+=$i=="]"?1:0;
if(e>s){$i="";e--} }
s=e=0; print $0; }' file
Note
the script above is NOT generic enough. it only remove unbalanced "]", which means foo[a[b[c] won't be modified
if there is unbalanced ], they would be deleted, no matter they are at the end of the line or not. so foo[x]bar]blah will be changed into foo[x]barblah
an example explains it better: (I added two more lines in your input)
#in my new lines(1,2) all "]"s surrounded with * should be removed
kent$ cat a.txt
stringx=randomthi[foo]bar*]*xx*]*
stringy=random[f]x*]*bar[b]*]*blah
string1=randomthing0]
string2=otherthing[15]}]
string3=reallyotherthing[5]]
kent$ awk 'BEGIN{OFS=FS=""}{ for(i=1;i<=NF;i++){
s+=$i=="["?1:0;
e+=$i=="]"?1:0;
if(e>s){$i="";e--} }
s=e=0; print $0; }' a.txt
stringx=randomthi[foo]bar**xx**
stringy=random[f]x**bar[b]**blah
string1=randomthing0
string2=otherthing[15]}
string3=reallyotherthing[5]
hope it helps
sed 's/\([^\[0-9]\)\([0-9\]*\)\]/\1\2/'
This removes any ] which is preceded by something not in [ or 0-9 followed by zero or more 0-9 characters.
This might work for you (GNU sed):
sed -r 's/([^][]*(\[[^]]*\][^][]*)*)\]/\1/g' file

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!
Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Resources