CONCAT columns within a file - bash

I'd like to concatenate column2 until column4.
Example (first.txt):
|ID|column2|column3|column4|
|1 | a | b | c |
|2 | d | e | f |
To this (mynewfile.txt) :
ID|column2
1 | a b c
2 | d e f
This is my script in cygwin : $ awk '{print $2" "$3" "$4 }' first.txt > mynewfile.txt
Of course, it is not working out well.. How do I improve the script?

You need to set the field separator so that a pipe with optional whitespace around it is the field delimiter.
The pipe at the beginning of the line causes an empty field 1 before the pipe, so the ID is field 2, and columns 2-4 are fields 3-5. So it should be:
awk -F' *\\| *' 'NR == 1 {print "ID|column2|"} NR > 1 {printf("%d | %s %s %s |\n", $2, $3, $4, $5)}' first.txt > mynewfile.txt

Not especially general GNU sed method:
sed 's/^[|]//;1s/2.*/2/;1!{s/|/ /g2;s/ */ /2g}' first.txt
Output:
ID|column2
1 | a b c
2 | d e f

Related

How to add Column values based on unique value of a different column

I am trying to add values in Column B based on unique value in Column A. How can i do it using AWK (or) Any other way using bash?
Column_A | Column_B
--------------------
A | 1
A | 2
A | 1
B | 3
B | 8
C | 5
C | 8
Result:
Column_A | Column_B
--------------------
A | 6
B | 11
C | 13
Considering that your Input_file is same as shown, sorted with first field, could you please try following(will edit solution for alignment too soon).
awk '
BEGIN{
OFS=" | "
}
FNR==1 || /^-/{
print
next
}
prev!=$1 && prev{
print prev,sum
prev=sum=""
}
{
sum+=$NF
prev=$1
}
END{
if(prev && sum){
print prev,sum
}
}' Input_file
another awk
$ awk 'NR<3 {print; next}
{a[$1]+=$NF; line[$1]=$0}
END {for(k in a) {sub(/[0-9]+$/,a[k],line[k]); print line[k]}}' file
Column_A | Column_B
--------------------
A | 4
B | 11
C | 13
note that A totals to 4, not 6.
One possible solution (Assuming file is in CSV format):
Input :
$ cat csvtest.csv
A,1
A,2
A,3
B,3
B,8
C,5
C,8
$ cat csvtest.csv | awk -F "," '{arr[$1]+=$2} END {for (i in arr) {print i","arr[i]}}'
A,6
B,11
C,13

Combine multiple grep variables in one column-wise file

I have some grep expressions which count the number of lines matching a string, each one for a group of files with different extension:
Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)
Each of these greps outputs the sample name and the number of occurences, as follows.
The first one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
The second one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
And so on. I need to create a .txt file with these numerical grep outputs as columns taking the sample name as a key column. The sample name is the part of the file name before "_R1" (V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA, V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):
Sample | Nreads_ini | Nreads_align |
-----------------------------------------------------------------------
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589 |
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934 |
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981 |
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896 |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617 |
Any idea? Is there another easier solution for my problem?
Thanks!
In this answers the variable names are shortened to ini and align.
First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
Then we join the extracted data into one file. Lines with the same sample name will be combined.
join -t $'\t' <(e <<< "$ini") <(e <<< "$align")
Now we nearly have the expected output. We only have to add the header and draw lines for the table.
join ... | column -to " | " -N Sample,ini,align
This will print
Sample | ini | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617
Adding a horizontal line after the header is left as an exercise for the reader :)
This approach also works with more than two number columns. The join and -N parts have to be extended. join can only work with two files, requiring us to use an unwieldy workaround ...
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN
... so it would be easier to add another helper function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN
Alternatively, if all inputs contain the same samples in the same order, join can be replaced with one single paste command.
Assuming you have files containing the data you want parse:
$ cat file1
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
$ cat file2
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
$ cat file3 # This is a copy of file2 but could be different
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
If there is a key like V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT, you could use awk:
$ awk -F'[/.:]' '
BEGINFILE{
col[FILENAME]
}
{
row[$2]
a[FILENAME,$2]=$NF
next
}
END{
for(i in row) {
printf "%s ",substr(i,1,length(i)-3)
for(j in col)
printf "%s ",a[j SUBSEP i]; printf "\n"
}
}' file1 file2 file3
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG 13424 12896 12896
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT 13175 12589 12589
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT 13475 12981 12981
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT 14801 13934 13934
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA 12053 11617 11617
This awk script fills 3 array col, row and a that respectively stores the column name (filename), the row content and the values for all files.
The END statement prints the content of the array a by looping through all rows and columns.
If you need table decoration, use this:
{ printf "Sample Nreads_ini Nreads_align Nreads_align \n"; awk -F'[/.:]' 'BEGINFILE{col[FILENAME]}{row[$2];a[FILENAME,$2]=$NF;next}END{for(i in row) { printf "%s ",substr(i,1,length(i)-3); for(j in col) printf "%s ",a[j SUBSEP i]; printf "\n" }}' file1 file2 file3; } | column -t -s' ' -o ' | '
Could you please try following and let me know if this helps you.
awk --re-interval -F"[/.:]" '
BEGIN{
print "Sample | Nreads_ini | Nreads_align |"
}
FNR==NR{
match($2,/.*[A-Z]{10}/);
array[substr($2,RSTART,RLENGTH)]=$NF;
next
}
match($2,/.*[A-Z]{10}/) && (substr($2,RSTART,RLENGTH) in array){
print substr($2,RSTART,RLENGTH),array[substr($2,RSTART,RLENGTH)],$NF
}
' OFS=" | " first_one second_one | column -t
Output will be as follows.
Sample | Nreads_ini | Nreads_align |
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617

Replace string in Nth array

I have a .txt file with strings in arrays which looks like these:
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 0
And i want to add counts, so i need to replace last digit, but i need to change it only for the one line.
I am getting new needed value by this function:
$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"
And them I want to replace it with new value, which will be bigger for 1.
And im doing it right in there.
counts=(("$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"+1))
Where IdOpen is an id of the array, where i need to replace string.
So i have tried to replace the whole array by these:
counter="$(awk -v i=$idOpen 'BEGIN{FNqR == i}{$7+=1} END{ print $0}' bookmarks)"
N=$idOpen
sed -i "{N}s/.*/${counter}" bookmarks
But it doesn't work!
So is there a way to replace only last string with value which i have got earlier?
As result i need to get:
id | String1 | String2 | Counts
1 | Abc | Abb | 1 # if idOpen was 1 for 1 time
2 | Cde | Cdf | 2 # if idOpen was 2 for 2 times
And the last number will be increased by 1 everytime when i will activate these commands.
awk solution:
setting idOpen variable(for ex. 2):
idOpen=2
awk -F'|' -v i=$idOpen 'NR>1{if($1 == i) $4=" "$4+1}1' OFS='|' file > tmp && mv tmp file
The output(after executing the above command twice):
cat file
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 2
NR>1 - skipping the header line

replace string in comma delimiter file using nawk

I need to implement the if condition in the below nawk command to process input file if the third column has more that three digit.Pls help with the command what i am doing wrong as it is not working.
inputfile.txt
123 | abc | 321456 | tre
213 | fbc | 342 | poi
outputfile.txt
123 | abc | 321### | tre
213 | fbc | 342 | poi
cat inputfile.txt | nawk 'BEGIN {FS="|"; OFS="|"} {if($3 > 3) $3=substr($3, 1, 3)"###" print}'
Try:
awk 'length($3) > 3 { $3=substr($3, 1, 3)"###" } 1 ' FS=\| OFS=\| test1.txt
This works with gawk:
awk -F '[[:blank:]]*\\\|[[:blank:]]*' -v OFS=' | ' '
$3 ~ /^[[:digit:]]{4,}/ {$3 = substr($3,1,3) "###"}
1
' inputfile.txt
It won't preserve the whitespace so you might want to pipe through column -t

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources