How can I extract a subset from a column/field using awk? - bash

I wondered how can I extract a subset from a column/field using awk?
Here is the input file test.txt:
aaa bbb ccc=0.7707;ddd=0.21
I would like to be able to extract figure "0.21" from the 3rd column, and output it with the 1st and 2nd columns:
aaa bbb 0.21
I have tried and used the code below but failed:
awk 'BEGIN { OFS = "\t" } { $4 = /^ddd=(+\d)/ ; print $1,$2,$4 }' test.txt
Please help!
Many thanks,
TP

You can specify multiple delimiters using the -F flag or setting FS in the BEGIN block. For example:
echo "aaa bbb ccc=0.7707;ddd=0.21" | awk -F "[ =]" '{ print $1, $2, $NF }'
Results:
aaa bbb 0.21

You could use gsub:
awk 'BEGIN { OFS = "\t" } { gsub(/.*=/, "", $3); print $1,$2,$3 }' text.txt
For your input, it'd give:
aaa bbb 0.21

Another awk
awk '{split($3,a,"=");print $1,$2,a[3]}'
aaa bbb 0.21

Related

find unique lines based on one field only [duplicate]

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

How to print columns one after the other in bash?

Is there any better methods to print two or more columns into one column, for example
input.file
AAA 111
BBB 222
CCC 333
output:
AAA
BBB
CCC
111
222
333
I can only think of:
cut -f1 input.file >output.file;cut -f2 input.file >>output.file
But it's not good if there are many columns, or when I want to pipe the output to other commands like sort.
Any other suggestions? Thank you very much!
With awk
awk '{if(maxc<NF)maxc=NF;
for(i=1;i<=NF;i++){(a[i]!=""?a[i]=a[i]RS$i:a[i]=$i)}
}
END{
for(i=1;i<=maxc;i++)print a[i]
}' input.file
You can use a GNU awk array of arrays to store all the data and print it later on.
If the number of columns is constant, this works for any amount of columns:
gawk '{for (i=1; i<=NF; i++) # loop over columns
data[i][NR]=$i # store in data[column][line]
}
END {for (i=1;i<=NR;i++) # loop over lines
for (j=1;j<=NF;j++) # loop over columns
print data[i][j] # print the given field
}' file
Note NR stands for number of records (that is, number of lines here) and NF stands for number of fields (that is, the number of fields in a given line).
If the number of columns changes over rows, then we should use yet another array, in this case to store the number of columns for each row. But in the question I don't see a request for this, so I am leaving it for now.
See a sample with three columns:
$ cat a
AAA 111 123
BBB 222 234
CCC 333 345
$ gawk '{for (i=1; i<=NF; i++) data[i][NR]=$i} END {for (i=1;i<=NR;i++) for (j=1;j<=NF;j++) print data[i][j]}' a
AAA
BBB
CCC
111
222
333
123
234
345
If the number of columns is not constant, using an array to store the number of columns for each row helps to keep track of it:
$ cat sc.wk
{for (i=1; i<=NF; i++)
data[i][NR]=$i
columns[NR]=NF
}
END {for (i=1;i<=NR;i++)
for (j=1;j<=NF;j++)
print (i<=columns[j] ? data[i][j] : "-")
}
$ cat a
AAA 111 123
BBB 222
CCC 333 345
$ awk -f sc.wk a
AAA
BBB
CCC
111
222
333
123
-
345
awk '{print $1;list[i++]=$2}END{for(j=0;j<i;j++){print list[j];}}' input.file
Output
AAA
BBB
CCC
111
222
333
More simple solution would be
awk -v RS="[[:blank:]\t\n]+" '1' input.file
Expects tab as delimiter:
$ cat <(cut -f 1 asd) <(cut -f 2 asd)
AAA
BBB
CCC
111
222
333
Since the order is of no importance:
$ awk 'BEGIN {RS="[ \t\n]+"} 1' file
AAA
111
BBB
222
CCC
333
Ugly, but it works-
for i in {1..2} ; do awk -v p="$i" '{print $p}' input.file ; done
Change the {1..2} to {1..n} where 'n' is the number of columns in the input file
Explanation-
We're defining a variable p which itself is the variable i. i varies from 1 to n and at each step we print the 'i'th column of the file.
This will work for an arbitrary number fo space separated colums
awk '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file
If space is not the separateor ... let's suppose ":" is the separator
awk -F: '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file

awk replace string with newline

I have never used awk or sed. I am trying to replace
aaa
{
with
aaa
{
bbb
I tried different solutions using sed/awk, but couldn't figure it out.
awk '{gsub("aaa\n{", "aaa\n{\tbbb")}1' file.txt
Could you please help me on how to do it.
With awk:
awk '{print} /^aaa$/{i=NR} /^{$/ && NR==i+1 {print "\tbbb"}' File
Output:
aaa
{
bbb
sdjdhsjdhdsd
ds
ddsdsdsd
aaa
{
bbb
This might work for you (GNU sed):
sed '/^aaa$/!b;n;/^{$/a\\tbbb' file
awk with getline
awk '/aaa/ {print;getline} /{/ { print ;print "\tbbb";next}1 ' file

Splitting content of file and make it in order

I have a file like so:
{A{AAA} B{BBB} test {CCC CCC
}}
{E{EEE} F{FFF} test {GGG GGG
}}
{H{HHH} I{III} test {JJJ -JJJ
}}
{K{KKK} L{LLL} test {MMM
}}
Updated
I want to use linux commands in order to have the following output:
AAA:BBB:CCC CCC
EEE:FFF:GGG GGG
HHH:III:JJJ -JJJ
KKK:LLL:MMM
Using gnu-awk you can do this:
awk -v RS='}}' -v FPAT='{[^{}]+(}|\n)' -v OFS=':' '{for (i=1; i<=NF; i++) {
gsub(/[{}]|\n/, "", $i); printf "%s%s", $i, (i<NF)?OFS:ORS}}' file
AAA:BBB:CCC CCC
EEE:FFF:GGG GGG
HHH:III:JJJ -JJJ
KKK:LLL:MMM
-v RS='}}' will break each record using }} text
-v FPAT='{[^{}]+(}|\n)' will split field using given regex. Regex matches each field that starts with { and matches anything but { and } followed by } or a newline.
-v OFS=':' sets output field separator as :
gsub(/[{}]|\n/, "", $i) removes { or } or newline from each field
Shorter command (thanks to JoseRicardo):
awk -v RS='}}' -v FPAT='{[^{}]+(}|\n)' -v OFS=':' '{$1=$1} gsub(/[{}]|\n/, "")' file
or even this:
awk -v FPAT='{[^{}]{2,}' -v OFS=':' '{$1=$1} gsub(/[{}]/, "")' file
Perl solution
perl -nwe 'print join ":", /{([^{}]{2,})/g' file
The regular expression extracts groups of 2 or more non-curlies following a curlie, they are then printed separated with colons.
for this specific format
sed -n 's/...//;s/}[^{]*//g;s/{/:/gp' YourFile

Using a command-line utility to perform the following map-updates

I'm a complete newbie to using command-line utilities and am wondering how to process information as following:
mapping.txt:
80 001 002
81 011 012 013 014
82 021 022
...
input.txt:
81 103823044
80 103823054
81 103823064
...
Desired output.txt:
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|001|
103823054|002|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
I've done simple mapping wherein the column numbers are fixed but I'm unsure of how to map a dynamic number of columns to the desired output
If order is not important, join and awk can do the job easily.
$ join <(sort input.txt) <(sort mapping.txt) | awk -v OFS="|" '{for (i=3;i<NF;i++) print $2, $i OFS}'
103823054|001|
103823044|011|
103823044|012|
103823044|013|
103823064|011|
103823064|012|
103823064|013|
Here's a GNU awk script that uses multi-dimensional arrays to do what you want:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next }
$1 in a { for(k in a[$1]) print $2, k, "" }
If you save that to a file like script.awk and then chmod +x script.awk you can run it like:
$ ./script.awk mapping.txt input.txt
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|002|
103823054|001|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
Here's a breakdown of the script:
BEGIN - set the output field separator to |
FNR==NR - process the first file (mapping.txt) and store the data in a multi-dimensional array by $1 first, then by the other fields. next to skip any other line processing.
$1 in a - test to see if the line has a mapping. If so, print the corresponding mappings out in order(also GNU awk). The commas in the print command are converted to the OFS value.
It could be remade a "one-liner" like:
awk -v OFS="|" 'FNR==NR {for(i=2;i<=NF;i++) a[$1][$i]; next} $1 in a {for(k in a[$1]) print $2, k, ""}' mapping.txt input.txt
Here's a version of the script that uses a single dimensional array to store $0 then split()s it later to preserve order:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { a[$1]=$0; next }
$1 in a { c=split(a[$1], b); for(i=2;i<=c;i++) print $2, b[i], "" }

Resources