Combine multiple grep variables in one column-wise file - bash

I have some grep expressions which count the number of lines matching a string, each one for a group of files with different extension:
Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)
Each of these greps outputs the sample name and the number of occurences, as follows.
The first one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
The second one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
And so on. I need to create a .txt file with these numerical grep outputs as columns taking the sample name as a key column. The sample name is the part of the file name before "_R1" (V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA, V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):
Sample | Nreads_ini | Nreads_align |
-----------------------------------------------------------------------
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589 |
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934 |
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981 |
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896 |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617 |
Any idea? Is there another easier solution for my problem?
Thanks!

In this answers the variable names are shortened to ini and align.
First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
Then we join the extracted data into one file. Lines with the same sample name will be combined.
join -t $'\t' <(e <<< "$ini") <(e <<< "$align")
Now we nearly have the expected output. We only have to add the header and draw lines for the table.
join ... | column -to " | " -N Sample,ini,align
This will print
Sample | ini | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617
Adding a horizontal line after the header is left as an exercise for the reader :)
This approach also works with more than two number columns. The join and -N parts have to be extended. join can only work with two files, requiring us to use an unwieldy workaround ...
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN
... so it would be easier to add another helper function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN
Alternatively, if all inputs contain the same samples in the same order, join can be replaced with one single paste command.

Assuming you have files containing the data you want parse:
$ cat file1
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
$ cat file2
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
$ cat file3 # This is a copy of file2 but could be different
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
If there is a key like V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT, you could use awk:
$ awk -F'[/.:]' '
BEGINFILE{
col[FILENAME]
}
{
row[$2]
a[FILENAME,$2]=$NF
next
}
END{
for(i in row) {
printf "%s ",substr(i,1,length(i)-3)
for(j in col)
printf "%s ",a[j SUBSEP i]; printf "\n"
}
}' file1 file2 file3
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG 13424 12896 12896
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT 13175 12589 12589
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT 13475 12981 12981
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT 14801 13934 13934
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA 12053 11617 11617
This awk script fills 3 array col, row and a that respectively stores the column name (filename), the row content and the values for all files.
The END statement prints the content of the array a by looping through all rows and columns.
If you need table decoration, use this:
{ printf "Sample Nreads_ini Nreads_align Nreads_align \n"; awk -F'[/.:]' 'BEGINFILE{col[FILENAME]}{row[$2];a[FILENAME,$2]=$NF;next}END{for(i in row) { printf "%s ",substr(i,1,length(i)-3); for(j in col) printf "%s ",a[j SUBSEP i]; printf "\n" }}' file1 file2 file3; } | column -t -s' ' -o ' | '

Could you please try following and let me know if this helps you.
awk --re-interval -F"[/.:]" '
BEGIN{
print "Sample | Nreads_ini | Nreads_align |"
}
FNR==NR{
match($2,/.*[A-Z]{10}/);
array[substr($2,RSTART,RLENGTH)]=$NF;
next
}
match($2,/.*[A-Z]{10}/) && (substr($2,RSTART,RLENGTH) in array){
print substr($2,RSTART,RLENGTH),array[substr($2,RSTART,RLENGTH)],$NF
}
' OFS=" | " first_one second_one | column -t
Output will be as follows.
Sample | Nreads_ini | Nreads_align |
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617

Related

Bash extract strings between two characters

I have the output of query result into a bash variable, stored as a single line.
-------------------------------- | NAME | TEST_DATE | ----------------
--------------------- | TESTTT_1 | 2019-01-15 | | TEST_2 | 2018-02-16 | | TEST_NAME_3 | 2020-03-17 | -------------------------------------
I would like to ignore the column names(NAME | TEST_DATE) and store actual values of each name and test_date as a tuple in an array.
So here is the logic I am thinking, I would like to extract third string onwards between two '|' characters. These strings are comma separated and when a space is encountered we start the next tuple in the array.
Expected output:
array=(TESTTT_1,2019-01-15 TEST_2,2018-02-16 TEST_NAME_3,2020-03-17)
Any help is appreciated. Thanks.
let say your
String is stored in variable a (or pipe our query output to below command
echo "$a"
-------------------------------- | NAME | TEST_DATE | ----------------
--------------------- | TESTTT_1 | 2019-01-15 | | TEST_2 | 2018-02-16 | | TEST_NAME_3 | 2020-03-17 | ------------------------------------
Command to obtain desired results is:
array="$(echo "$a" | cut -d '|' -f2,3,5,6,8,9 | tail -n1 | sed 's/ | /,/g')
Above will store ourput in variable named array as you expected
Output of above command is:
echo "$array"
TESTTT_1,2019-01-15,TEST_2,2018-02-16,TEST_NAME_3,2020-03-17
Explanation of command: output of echo $a will be piped into cut and using '|' as delimeter it will cut fields 2,3,5,6,8,9 then the output is piped into tail to remove the undesired NAME and TEST_DATE columns and provide values only and then as per your expected output | will be converted to , using sed.
Here in this string you are having only three dates if you have more then just in cut command add more field numbers and as per format of your string field numbers will be in following style 2,3,5,6,8,9,11,12,14,15 .... and so on.
Hope it solved your problem.
echo "$a" | awk -F "|" '{ for(i=2; i<=NF; i++){ print $i }}' | sed -e '1,3d' -e '$d' | tr ' ' '\n' | sed '/^$/d' | sed 's/^/,/g' | sed -e 'N;s/\n/ /' | sed 's/^.//g' | xargs | sed 's/ ,/, /g'
Above is awk based solution
Output:
TESTTT_1, 2019-01-15 TEST_2, 2018-02-16 TEST_NAME_3, 2020-03-17
Is it ok.

CONCAT columns within a file

I'd like to concatenate column2 until column4.
Example (first.txt):
|ID|column2|column3|column4|
|1 | a | b | c |
|2 | d | e | f |
To this (mynewfile.txt) :
ID|column2
1 | a b c
2 | d e f
This is my script in cygwin : $ awk '{print $2" "$3" "$4 }' first.txt > mynewfile.txt
Of course, it is not working out well.. How do I improve the script?
You need to set the field separator so that a pipe with optional whitespace around it is the field delimiter.
The pipe at the beginning of the line causes an empty field 1 before the pipe, so the ID is field 2, and columns 2-4 are fields 3-5. So it should be:
awk -F' *\\| *' 'NR == 1 {print "ID|column2|"} NR > 1 {printf("%d | %s %s %s |\n", $2, $3, $4, $5)}' first.txt > mynewfile.txt
Not especially general GNU sed method:
sed 's/^[|]//;1s/2.*/2/;1!{s/|/ /g2;s/ */ /2g}' first.txt
Output:
ID|column2
1 | a b c
2 | d e f

awk query with numbers vs. strings

I am writing a function in R that will generate an awk script to pull in rows from a csv according to conditions that a user selected through a UI.
This is the example of the string generated by the function:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == "20116688") && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
It doesn’t return anything because $3 is a numeric variable. Neither does:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == Disregard) {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
… because $20 is a string.
This returns a portion of the dataset:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook`
|---------+------------+------+------------|
| 5058.0 | 20116688.0 | 4162 | Disregard |
|---------+------------+------+------------|
| 5060.0 | 20116688.0 | 3622 | Disregard |
| 5060.0 | 20116688.0 | 3619 | Disregard |
| 5061.0 | 20116688.0 | 766 | Disregard |
| 5059.0 | 20116688.0 | 3603 | Disregard |
| 5055.0 | 20116688.0 | 1013 | Disregard |
| 5058.0 | 20116688.0 | 1012 | Disregard |
| 5055.0 | 20116688.0 | 4163 | Disregard |
| 5060.0 | 20116688.0 | 4225 | Disregard |
| 5061.0 | 20116688.0 | 3466 | Disregard |
|---------+------------+------+——————|
Unfortunately, I don’t currently have a way of anticipating which of the variables that the user selects through the UI will be string or numerical (I know how to do that, but it will take time that I’d rather not spend if there was a workaround). Is there a way to cast each variable a string before the comparison or have some other way of dealing with this issue?
Edit This is what the raw data look like:
$ csvcut -c15:20 faults_main_only_dp_1_shopFlag.csv | head
faultActiveLongitude,faultActiveAltitude,faultCode,faultSoftwareVersion,stateID,stateName
-0.8100106,-1.0,3604,25.07.01 11367,2.0,Work Item
-0.81860137,840.0,766,25.07.01 11367,5.0,Disregard
-0.8100140690000001,-1.0,4279,25.07.01 11367,2.0,Work Item
-0.8100509640000001,-2.0,4279,25.07.01 11367,2.0,Work Item
-0.8102342,14.0,3604,25.07.01 11367,2.0,Work Item
-0.8181563620000001,831.0,3604,25.07.01 11367,5.0,Disregard
-0.81022054,11.0,3604,25.07.01 11367,2.0,Work Item
-0.8102272,11.0,4279,25.07.01 11367,2.0,Work Item
-0.8083836999999999,17.0,766,25.07.01 11367,5.0,Disregard
awk can do the int <--> string comparison if the token can be converted. Note that you're using comma as the field separator and spaces will be part of the fields. If it's not a decimal point issue where your numbers are integers,
Check these three cases
$ echo "42,42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
works
$ echo "42, 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
works
$ echo "42 , 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
does not work
The string interpretation (first field) should not have the space!
You can try setting up your field separator to " *, *"
UPDATE: If your integers get .0 floating point extensions which you can ignore, convert the them to int before the comparison
$ echo "42.0 , 42" | awk -v FS=" *, *" 'int($1)=="42" && $2=="42"{print "works";next} {print "does not work"}'
works
Here your generic value will be quoted but the field will be converted to int before the string conversion. You need to know what fields are numeric what fields are string though.

How to format the output based on the maximum column length

Formatting the output based on the maximum column length. How can I achieve this?
Shell script or any tools is fine.
Input
| date | ID | Typ | Actn |
| 11/29/13 | ID660011 | DP | A |
| 11/29/13 | ID6600123 | DP | A |
Output
| date | ID | Typ| Actn|
| 11/29/13| ID660011 | DP | A |
| 11/29/13| ID6600123| DP | A |
EDIT:
If I use column -t, these are the errors:
$ column -t -s'|' -o'|'
input_file .feature > input_file _check.feature
column: illegal option -- o
usage: column [-tx] [-c columns] [-s sep] [file ...]
$ echo $SHELL /usr/local/bin/bash
$ column -t input_file .feature > input_file _check.feature column:
line too long
On your shell terminal, try this one:
$ awk '
{
for(i=1;i<=NF;i++)
printf("%-40s%c", $i, (i==NF) ? ORS : "")
}' FS=, file.txt

replace string in comma delimiter file using nawk

I need to implement the if condition in the below nawk command to process input file if the third column has more that three digit.Pls help with the command what i am doing wrong as it is not working.
inputfile.txt
123 | abc | 321456 | tre
213 | fbc | 342 | poi
outputfile.txt
123 | abc | 321### | tre
213 | fbc | 342 | poi
cat inputfile.txt | nawk 'BEGIN {FS="|"; OFS="|"} {if($3 > 3) $3=substr($3, 1, 3)"###" print}'
Try:
awk 'length($3) > 3 { $3=substr($3, 1, 3)"###" } 1 ' FS=\| OFS=\| test1.txt
This works with gawk:
awk -F '[[:blank:]]*\\\|[[:blank:]]*' -v OFS=' | ' '
$3 ~ /^[[:digit:]]{4,}/ {$3 = substr($3,1,3) "###"}
1
' inputfile.txt
It won't preserve the whitespace so you might want to pipe through column -t

Resources