how to set field name as file name bash/awk - bash

I have a file with 500 columns and I would need to split each column into a new file while printing $1 as common in all the files. Below is a sample file, and I managed to do this using the below bash/awk solution :
ID F1 F2 F4 F4
aa 1 2 3 4
bb 1 2 3 4
cc 1 2 3 4
dd 1 2 3 4
num=('1' '2' '3' '4')
for i in ${num[#]}; do awk -F "\t" -v col="$i" '{print $1,$col}' OFS="\t"
Input.txt > ${i}.txt; done
which gives the required output as:
1.txt
ID ID
aa aa
bb bb
cc cc
dd dd
2.txt
ID F1
aa 1
bb 1
cc 1
dd 1
....
However, I could not track which file corresponds to which column as the output file name is the field number but not the field name. Could it be possible to write the header of the field as prefix to the output file name?
ID.txt
ID ID
aa aa
bb bb
cc cc
dd dd
F1.txt
ID F1
aa 1
bb 1
cc 1
dd 1

You can do it all in one awk script. When processing the first line, put all the column headings in an array. Then when you process lines you write to the file names from that array in a loop.
awk -F'\t' 'NR == 1 { split($0, filenames) }
{for (col = 1; col <= NF; col++) {
file= filenames[col] ".txt";
print $1, $col >> file;
close(file) } }' Input.txt

If I understand your requirement correctly, it seems like you're very close. Try
num=('1' '2' '3' '4')
for i in ${num[#]}; do
echo "i=$i"
awk -F "\t" -v col="$i" -v OFS="\t" '
NR==1{fName=$(col+1)".out";next}
{print $1,$(col+1) > fName}' data.txt
done
1>cat F1.out
aa 1
bb 1
cc 1
dd 1
. . . .
1>cat F4.out
aa 4
bb 4
cc 4
dd 4
Edit
If you need to keep the headers as shown in your example output, just remove the ;next.
Edit 2
If you have multiple column with the same name, you can append the data to same file by using >> fName instead. One word of warning with this technique. When you use > fName, this "restarts" the file each time you rerun your script. But when using >>, you will be appending to each file each time you run the script. That can cause problems for down-stream processes ;-) ... So you'd need to add code that cleans up your previous run of the script.
Here, we're relying on the fact that awk can also write output to a file, using > fName (where fName has been defined as the value of col(Num)+1 (to skip over the first column values).
And, if you were going to do this thousands of times a day, it would be worth further optimizing per comments above to have awk read the file once and create all the outputs from internal loops. But if you only need to do this a couple of times, then your 'use the tools of unix/linux to decompose the task into manageable parts' is perfectly appropriate.
IHTH

Related

How to extract mixed/partly absent records to a defined order with awk

I have the following data (it also contains other lines, here is a meaningful extract):
group
bb 1
cc 1
dd 1
end
group
dd 2
bb 2
end
group
aa 3
end
I don't know the values (like "1", "2", etc.) and have to match by the names (generic "group", "aa", etc.)
I want to get the data filtered and sorted in the following order (with empty tabs when the string is absent):
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
I run:
awk 'BEGIN {ORS = "\t"}\
/^group/ {print "\n" $0}; \
/^aa/ {AA = $0}; \
/^bb/ {BB = $0}; \
/^cc/ {CC = $0}; \
/^dd/ {DD = $0}; \
/^end/ {print AA; print BB; print CC; print DD}' test.txt
and get
group bb 1 cc 1 dd 1
group bb 2 **cc 1** dd 2
group aa 3 **bb 2** **cc 1** **dd 2**
which is in the right order, but the data is wrong (marked with asterisks). What is the correct way to do this filtering?
Thanks!
Assumptions:
input lines do not start with any white space
each ^group has a matching ^end
the first line in the file is ^group
the last line in the file is ^end
there are no lines (to ignore) between ^end and the next ^group
Primary issue is that each time group is seen we need to clear/reset the other variables otherwise we carryover the values from the previous group.
Other (minor) issues:
ORS vs OFS
multiple print commands vs a single print command
no need for line continuation characters (\)
One idea for an updated awk script:
awk '
BEGIN { OFS="\t" }
/^group/ { AA=BB=CC=DD="" ; next }
/^aa/ { AA=$0 ; next }
/^bb/ { BB=$0 ; next }
/^cc/ { CC=$0 ; next }
/^dd/ { DD=$0 ; next }
/^end/ { print "group",AA,BB,CC,DD }
' test.txt
NOTE: the ; next clauses are optional and are included as a visual reminder that we don't need to worry about the rest of the script (for the current line)
This generates:
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
Here is a simpler awk solution to do the same:
awk '/^group$/{delete m; next} {m[$1]=$0} /^end$/{
printf "group\t%s\t%s\t%s\t%s\n", m["aa"], m["bb"], m["cc"], m["dd"]
}' file
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
With GNU awk try following code. Written and tested with shown samples only. Simple explanation would be, setting RS as end\n(optional) and then simply substituting new lines with spaces and printing the line.
awk -v RS='end\n?' 'RT{gsub(/\n/,OFS);print}' Input_file
OR in case you want to have output as tab delimited then try following:
awk -v RS='end\n?' -v OFS="\t" 'RT{gsub(/\n/,OFS);print}' Input_file

Grep String between pattern and copy to new file

I am having data in below format and I am trying to capture data between 2 occurrence of string and keep it in a file.
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
create statement
CREATE VIEW `T3` AS SELECT
aa
bb
ff
..
...
..
I want output in below format :
FileName T1 should contain :--
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
FileName T2 should contain :--
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
Output filename comes from the value surrounded by backticks
I tried:
sed -n '/create statement/,/create statement/p'
Could you please try following and let me know if this helps you.
awk '/create statement/{create=$0;next} /CREATE VIEW/{val=$3;gsub("`","",val);filename=val;if(create){print create ORS $0 > filename};next} {print > filename}' Input_file
It will create 3 output files named T1, T2 and T3 and so on till all the occurrences of T's. If this is not your question then please be clear in your question and add more details on it too.
Adding a non-one liner form of solution too now:
awk '
/create statement/{
create=$0;
next
}
/CREATE VIEW/{
val=$3;
gsub("`","",val);
filename=val;
if(create){
print create ORS $0 > filename};
next
}
{
print > filename
}
' Input_file
Awk solution:
awk '/^create statement/{ s = $0; n = NR + 1; next }
NR == n{ t = $3; gsub("`", "", t); print t ORS s > t }{ print > t }' file
Results:
$ head T[123]
==> T1 <==
T1
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
==> T2 <==
T2
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
==> T3 <==
T3
create statement
CREATE VIEW `T3` AS SELECT
aa
bb
ff
With GNU awk
awk -F'`' -v ORS= -v RS='create' 'NF{print RS $0 > $2}' ip.txt
-F'`' use backtick as input field separator
-v ORS= empty ORS to avoid extra new line
-v RS='create' use create as input record separator
NF{print RS $0 > $2} for non-empty records, print value of RS and input record to file with second field as filename (which will be T1, T2, etc)

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.
$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'
Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.
Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

parsing a file with a column of key/value pairs

I am trying to parse a tab delimited file with the last column has a variable number of key-value pairs separated by semicolon. Here is an example
ab cd ef as=2;sd=5;df=12.3
gh ij kl sd=23;df=55
mn op qr as=24;df=77
I want to print the 2nd column and the value associated with the key "sd"
The expected output should be
cd 5
ij 23
Can I do this in bash?
The problem here is that the key-value column has variable no of entries so that the target key will have different positions in different rows.
I can grep the values of a given key like this
grep -o 'sd=[^;]*' file.txt
but I can not print the other column values at the same time
Whenever you have name/value pairs in your data it's best to create a name/value array from that data so you can just reference the values by name:
$ cat tst.awk
{
delete n2v
split($NF,tmp,/[;=]/)
for (i=1;i in tmp;i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
}
"sd" in n2v { print $2, n2v["sd"] }
$ awk -f tst.awk file
cd 5
ij 23
awk to the rescue!
$ awk -v k="sd=" '{n=split($NF,a,";");
for(i=1;i<=n;i++)
if(a[i]~k)
{sub(k,$2" ",a[i]);
print a[i]}}' file
cd 5
ij 23
if your key is not fixed length anchoring it on the left is a better idea.
Change a[i]~k to a[i]~"^"k
I know you asked for awk, but here is the obligatory sed one liner, that is a bit shorter than the awk examples. After peaks hint, I added a few more test cases with sd in different parts of the line.
cat kv.txt
ab cd ef as=2;sd=5;df=12.3
gh ij kl sd=23;df=55
test1 sd in col2=true;df=55
test2 sd_inFront spacer sd=2;other=5;
test3 sd_inMiddle spacer other1=6;sd=3;other2=8
test4 sd_atEnd spacer other1=7;sd=4;
test5 sd_AtEndWO; spacer other1=8;sd=5
test6 esd in col4=true;esd=6;
test7 esd_inFront spacer esd=7;other=5;
test8 esd_inMiddle spacer other1=6;esd=8;other2=8
test9 esd_atEnd spacer other1=7;esd=9;
test10 esd_AtEndWO; spacer other1=8;esd=10
test11 sd_and_esd spacer other1=6;sd=11;other2;esd=4;other3=8
test12 esd_and_sd spacer other1=6;esd=3;other2;sd=12;other3=8
cat kv.txt| sed -nr "/(.+\w){3} (.*;)?sd=/ {s/.* (.*) .* (.*;)?sd=([^;]+).*/\1 \3/g; p;}"
cd 5
ij 23
sd_inFront 2
sd_atEnd 4
sd_AtEndWO; 5
sd_and_esd 11
esd_and_sd 12
The sed command consists of two parts: the first part /(.+\w){3} (.*;)?sd=/ matches lines with sd= in coloumn four (either as first key or after a .*;) and applies the following part inside the braces to the line.
The second part inside the braces consists of a substitution (s) and a print the result command (p). The substitutions works like this:
the four .* are your columns, the second column is captured with the parentheses
(.*;)?sd=([^;]+) captures the values after sd= up to the ;
the replacement uses the captured \1 (column two) and \3 (the value after sd=) to create your desired output
Here are gawk/awk solutions that avoid splitting and looping.
$ cat pf.txt
ab cd ef as=2;sd=5;df=12.3
gh ij kl sd=23;df=55
aa bb cc as=24;df=77;sd=15
mn op qr as=24;df=77
With gawk you can use a gensub capture group to isolate the desired value from $4:
$ gawk '/sd=/{print $2, gensub(/.*sd=([^;]*).*/,"\\1","g",$4)}' pf.txt
cd 5
ij 23
bb 15
Or, with non-gawk awk you use two sub calls to remove the parts before and after the desired value:
$ awk '/sd=/{ sub(/.*sd=/, "", $4); sub(/;.*/, "", $4); print $2, $4 }' pf.txt
cd 5
ij 23
bb 15
Given:
$ cat /tmp/file.txt
ab cd ef as=2;sd=5;df=12.3
gh ij kl sd=23;df=55
mn op qr as=24;df=77
mn sd qr as=24;df=77
(Those are tabs, not spaces)
You can set awk to separate fields on either a tab or a ; like so:
$ awk -F "\t|;" '/sd/ {print $2}' /tmp/file.txt
cd
ij
sd
(I realize the last one should not be printed; bear with me)
To then print the field that has 'sd', simply loop through the fields:
$ awk -F "\t|;" '/sd/ { for (x=1;x<=NF;x++) if ($x~"^sd=") print $2 " " $(x) }' /tmp/file.txt
cd sd=5
ij sd=23
You can then split that field on =, change $x~"^sd=" for an exact match, and print the field to the right of the split on either side of = to get your precise output:
$ awk -F "\t|;" '/sd/ { for (x=1;x<=NF;x++) if ($x~"^sd=") { split($x, tmp, /=/); print $2 " " tmp[2]}}' /tmp/file.txt
cd 5
ij 23

Creating a mapping count

I have this data with two columns
Id Users
123 2
123 1
234 5
234 6
34 3
I want to create this count mapping from the given data like this
123 3
234 11
34 3
How can I do it in bash?
You have to use associative arrays, something like
declare -A newmap
newmap["123"]=2
newmap["123"]=$(( ${newmap["123"]} + 1))
obviously you have to iterate through your input, see if the entry exists then add to it, else initialize it
It will be easier with awk.
Solution 1: Doesn't expect the file to be sorted. Stores entire file in memory
awk '{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
34 3
234 11
123 3
What we are doing here is using the first column as key and adding second column as value. In the END block we iterate over our array and print the key=value pair.
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk 'NR>1{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
NR>1 is telling awk to skip the first line. NR contains the line number so we instruct awk to start creating our array from second line onwards.
Solution 2: Expects the file to be sorted. Does not store the file in memory.
awk '$1!=prev && NR>1{print prev,sum}{prev=$1; sum+=$2}END{print prev,sum}' file
123 3
234 14
34 17
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk '$1!=prev && NR>2{print prev, sum}NR>1{prev = $1; sum+=$2}END{print prev, sum}' ff
123 3
234 14
34 17
A Bash (4.0+) solution:
declare -Ai count
while read a b ; do
count[$a]+=b
done < "$infile"
for idx in ${!count[#]}; do
echo "${idx} ${count[$idx]}"
done
For a sorted output the last line should read
done | sort -n

Resources