Break one column into several columns everytime you see a pattern - bash

I have a quite simple question, but I find it hard to solve this problem.
I have two quite long column of data, and i want to separate it into several columns. the script should start writing data into a new column, each time it finds a specific string in the first column:
input:
A B
1 C
2 C
3 C
4 C
A D
1 D
2 D
3 D
4 D
output:
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
(the separating pattern is A)

You can do this using single awk:
awk 'NR>1 && /^A/{p=1} {if (p) print a[++i], $0; else a[NR]=$0}' OFS='\t' file
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D

awk with paste:
$ awk '$1 == "A" { ++n } { print > ("t.tmp." n) }' input.txt
$ ls t.tmp.*
t.tmp.1 t.tmp.2
$ paste t.tmp.*
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
EDIT
More efficient (only build the file name once for each group) and more robust (avoid the chance of having too many open files by closing them as we go) --- thanks, Ed Morton:
awk '$1 == "A" { close(out); out = "t.tmp." ++n} { print > out }' input.txt
(Above assumes first record contains pattern. If not, can initialize out in a BEGIN block.)

Using csplit and paste
$ csplit -zsf file infile.txt '/A/' {*}
$ paste file*
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D
From man csplit
csplit - split a file into sections determined by context lines
-z, --elide-empty-files
remove empty output files
-s, --quiet, --silent
do not print counts of output file sizes
-f, --prefix=PREFIX
use PREFIX instead of 'xx'
{*} repeat the previous pattern as many times as possible

using gnu awk multiline records - works for any number of occurrences of pattern - assumes equal length columns
pat=A
awk -vpat=$pat -F'\n' '
BEGIN {RS="(^|\n)"pat" "}
NR>1{
nr=NR-2
fld[nr][0]=pat" "$1
for(i=2; i<=NF; ++i)
fld[nr][i-1]=$i
}
END {
for(i=0; i < NF; ++i) {
for(j=0; j < NR-1; ++j)
printf("%s%s", j?"\t":"", fld[j][i])
printf("\n")
}
}
'
input
A B
1 C
2 C
3 C
4 C
A D
1 D
2 D
3 D
4 D
A X
1 X
3 X
5 X
7 X
output
A B A D A X
1 C 1 D 1 X
2 C 2 D 3 X
3 C 3 D 5 X
4 C 4 D 7 X

If you're reading this and wondering why it got downvoted, it's just some clown being childish because I pointed out some problems with and ways they could improve their previous answer, the downvote has nothing to do with the technical merits of this answer. This is the idiomatic awk solution to this problem.
$ awk -v OFS='\t' '
$1 == "A" { numRows=0; ++numCols }
{ val[++numRows,numCols] = $0 }
END {
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
' file
A B A D
1 C 1 D
2 C 2 D
3 C 3 D
4 C 4 D

Related

Join a specific column for several files and preserve file name as column name

I am trying to merge some tab separated files:
File_A.tsv
probeId BetaVal Annot
a 1 X
b 2 Y
c 3 Z
File_B.tsv
probeId BetaVal Annot
a 4 X
b 5 Y
c 6 Z
File_C.tsv
probeId BetaVal Annot
a 7 X
b 8 Y
c 9 Z
How can I merge these files by BetaVal column and stablish file name as column names (obtaining also a tab separated file)?
probeId File_A.tsv File_B.tsv File_C.tsv Annot
a 1 4 7 X
b 2 5 8 Y
c 3 6 9 Z
I was trying something like:
for file in *;
do
join -j 1 File_A file;
done
But this is not correct. Moreover, I am not sure about how to write file names as column names.
You may use this gnu awk:
awk -v OFS='\t' '{
a[$1][ARGIND] = (FNR==1?FILENAME:$2)
b[$1] = $3
}
END {
for (i in a) {
printf "%s", i
for(j in a[i])
printf "%s%s", OFS, a[i][j]
print OFS b[i]
}
}' File_[ABC].tsv | column -t
probeId File_A.tsv File_B.tsv File_C.tsv Annot
a 1 4 7 X
b 2 5 8 Y
c 3 6 9 Z
echo -e "\nprobeId File_A.tsv File_B.tsv File_C.tsv Annot";\
join -o 1.1 1.2 2.2 2.3 -1 1 -2 1 File_A.tsv File_B.tsv|\
join -o 1.1 1.2 1.3 2.2 1.4 -1 1 -2 1 - File_C.tsv |\
awk '{printf(" %-8s %-12s %-12s %-12s %s\n", $1,$2,$3,$4,$5);}'|tail +2
probeId File_A.tsv File_B.tsv File_C.tsv Annot
a 1 4 7 X
b 2 5 8 Y
c 3 6 9 Z
I assumed that the first column is the key field and I tried to guess
what you thought but it would be better you read these links
to get more understanding about join:
https://linuxconfig.org/learning-linux-commands-join
https://landoflinux.com/linux_join_command.html
join multiple files

join command leaving out a row of numbers

I have two files, I want to take out the rows which have common data in the third column. But it is leaving out a row which should be matched.
File1
b b b
4 5 3
c c c
File2
1 2 3 4
a b c d
e f g h
i j k l
l m n o
The output is:
c c c a b d
The command used is:
join -1 3 -2 3 --nocheck-order File1.txt File2.txt
It is missing out the row with 3 as the common field, even after placing the --nocheck-order
Edit:
Expected output:
c c c a b d
3 4 5 1 2 4
As an alternative to 2 sort commands (can be very expensive for big files) and then a join, you can use this single awk command to get your output:
awk 'FNR == NR{a[$3]=$0; next} $3 in a{print $3, a[$3], $1, $2, $4}' file1 file2
3 4 5 3 1 2 4
c c c c a b d
Explanation:
NR == FNR { # While processing the first file
a[$3] = $0 # store the whole line in array a using $3 as key
next
}
$3 in a { # while processing the 2nd file, when $3 is found in array
print $3,a[$3],$1,$2,$4 # print relevant fields from file2 and the remembered
# value from the first file.
}
You need to sort your inputs (e.g. using process substitution):
$ join -1 3 -2 3 <(sort -k3 1.txt) <(sort -k3 2.txt)
3 4 5 1 2 4
c c c a b d
This is equivalent to:
$ sort -k3 1.txt > 1-sorted.txt
$ sort -k3 2.txt > 2-sorted.txt
$ join -1 3 -2 3 1-sorted.txt 2-sorted.txt
3 4 5 1 2 4
c c c a b d

how to use awk to merge files with common fields and print in another file

I have read all the related questions, but still quite confuse...
I have two files tab separated.
file1 (breaks added for readability):
a 15 bac
g 10 bac
h11 bac
r 33 arq
t 12 euk
file2 (breaks added for readability):
0 15 h 3 5 2 gf a a g e g s s g g
p 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Output desired (breaks added for readability):
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND g 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Just that. I need to print the complete file2 but in the first column I need to replace with the third column of file1 only when $2 of file2 is the same that $2 of file1...
file1 is larger than file2, but still could happen that $2 from file2 is not present in file1, in that case print in the first column ND.
I'm sure it must be simple, but I have problems with awk managing two files. Please, if someone could help me...
Using this awk command:
awk 'FNR==NR{a[$2]=$3;next} {$1=(a[$2])?a[$2]:"ND"} 1' file1 file2
bac 15 h 3 5 2 gf a a g e g s s g g
arq 33 g 4 5 2 hg 3 1 3 f 5 h 5 h 6
ND 4 r 8 j 9 jk 9 j 9 9 h t 9 k 0
Explanation:
FNR==NR - Execute this block for first file in input i.e. file1
a[$2]=$3 - Populate an associative array a with key as $2 and value as $3 from file1
next - Read next line until EOF on first file
Now operating in file2
$1=(a[$2])?a[$2]:"ND" - Overwrite $1 with a[$2] if $2 is found in array a, otherwise by literal string "ND"
1 - print the output
You could try with join + awk command as below:
join -t ' ' -a2 -1 2 -2 2 test1.txt test2.txt | awk 'BEGIN { start = 5; end = 18 } { if (NF == 16) { temp = $1; $1 = "ND " $2; $2 = temp; print } else { printf("%s %s ", $3, $1); for (i=start; i<=end; i++) printf ("%s ", $i); printf("\n");}}'

Get n last records and change particular columns on them

I have file like this
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
* a
0 b
I want delete a, b from two last Records in END{} section
Result:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
How can I get n last lines and change fields on them with awk?
Here's one way using any awk:
awk -v count=$(wc -l <file.txt) 'NR > count - 2 { $2 = "" }1' file.txt
Results:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
Or to do awk operations for all records except 2 last lines of input file as a shell script, try ./script.sh file.txt. Contents of script.sh:
command=$(awk -v count=$(wc -l <"$1") 'NR <= count - 2 { $2 = "" }1' "$1"
echo -e "$command"
Results:
1 "45554323" p b
2 "34534567" f a
3 "76546787" u b
2 "56765435" f a
* a
0 b
If you know the value of n - the line number after which you want to delete the last item on the line/colum (here 4) this will work:
awk '{if (NR>4) NF=NF-1}1' data.txt
will give:
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
NF = NF -1 makes awk think there is one less field on the line than there is, which is how it doesn't display the last column/item on the line once that condition is met. NR refers to the current line number in the file being read.
awk can't know the number of lines in a file unless it goes through it once, or is given that information (e.g., wc -l). An alternative approach would be to save the last n lines in a buffer (sort of a sliding window/tape-delay type analogy, you are always printing n lines behind) and then process the final n lines in the END block.
This doesn't exactly answer your question but it produces the output you require:
$ gawk '{if (NF < 3) print $1; else print}' input.txt
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
$ cat file
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
* a
0 b
$ awk 'BEGIN{ARGV[ARGC++]=ARGV[ARGC-1]} NR==FNR{nr++; next} FNR>(nr-2) {NF--} 1' file
1 2 "45554323" p b
2 2 "34534567" f a
3 3 "76546787" u b
2 4 "56765435" f a
*
0
or if you don't mind manually specifying the file name twice:
awk 'NR==FNR{nr++; next} FNR>(nr-2) {NF--} 1' file file

AWK -- How to do selective multiple column sorting?

In awk, how can I do this:
Input:
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
Desired output, by sorting numerical value at $5:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Note that the sorting should only affecting $4, $5, and $6 (based on value of $5), in which the previous part of table remains intact.
This could be done in multiple steps with the help of paste:
$ gawk '{print $1, $2, $3}' in.txt > a.txt
$ gawk '{print $4, $5, $6}' in.txt | sort -k 2 -n b.txt > b.txt
$ paste -d' ' a.txt b.txt
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
Personally, I find using awk to safely sort arrays of columns rather tricky because often you will need to hold and sort on duplicate keys. If you need to selectively sort a group of columns, I would call paste for some assistance:
paste -d ' ' <(awk '{ print $1, $2, $3 }' file.txt) <(awk '{ print $4, $5, $6 | "sort -k 2" }' file.txt)
Results:
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
This can be done in pure awk, but as #steve said, it's not ideal. gawk has limited sort functions, and awk has no built-in sort at all. That said, here's a (rather hackish) solution using a compare function in gawk:
[ghoti#pc ~/tmp3]$ cat text
1 a f 1 12 v
2 b g 2 10 w
3 c h 3 19 x
4 d i 4 15 y
5 e j 5 11 z
[ghoti#pc ~/tmp3]$ cat doit.gawk
### Function to be called by asort().
function cmp(i1,v1,i2,v2) {
split(v1,a1); split(v2,a2);
if (a1[2]>a2[2]) { return 1; }
else if (a1[2]<a2[2]) { return -1; }
else { return 0; }
}
### Left-hand-side and right-hand-side, are sorted differently.
{
lhs[NR]=sprintf("%s %s %s",$1,$2,$3);
rhs[NR]=sprintf("%s %s %s",$4,$5,$6);
}
END {
asort(rhs,sorted,"cmp"); ### This calls the function we defined, above.
for (i=1;i<=NR;i++) { ### Step through the arrays and reassemble.
printf("%s %s\n",lhs[i],sorted[i]);
}
}
[ghoti#pc ~/tmp3]$ gawk -f doit.gawk text
1 a f 2 10 w
2 b g 5 11 z
3 c h 1 12 v
4 d i 4 15 y
5 e j 3 19 x
[ghoti#pc ~/tmp3]$
This keeps your entire input file in arrays, so that lines can be reassembled after the sort. If your input is millions of lines, this may be problematic.
Note that you might want to play with the printf and sprintf functions to set appropriate output field separators.
You can find documentation on using asort() with functions in the gawk man page; look for PROCINFO["sorted_in"].

Resources