Read and sum occurrence lines in bash - bash

I have a file that includes lines below separated by comma ;
filename.txt
usernameA,10,10
usernameB,20,20
usernameA,10,10
usernameB,20,20
usernameC,10,10
I just want to parse the file and add numbers by username if occurs multiple times , so the result should be ;
usernameA=40
usernameB=80
usernameC=20
How can i achive this result using Bash script ?
Thank you,

$ awk -F, '{a[$1]+=$2+$3}END{for(x in a)print x "=" a[x]}' file
usernameA=40
usernameB=80
usernameC=20
This works for the given example.

Related

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

Duplicate and modify words in bash with sed

I'm on Linux OS. I have a file to modify in my bash script.
My original file is like that:
...
ERIC-1898
HELENE-5456
THOMAS-54565
IRON-06516
...
And I'd like to modify this file with duplicate words (and -SYSTEM- word in second field), and add double quotes.
So, the result has to be like that:
...
"ERIC-1898" "ERIC-SYSTEM-1898"
"HELENE-5456" "HELENE-SYSTEM-5456"
"THOMAS-54565" "THOMAS-SYSTEM-54565"
"IRON-06516" "IRON-SYSTEM-06516"
...
How can I do that, for example with sed?
With sed and two capture groups:
$ sed 's/\(.*-\)\(.*\)/"&" "\1SYSTEM-\2"/' infile
"ERIC-1898" "ERIC-SYSTEM-1898"
"HELENE-5456" "HELENE-SYSTEM-5456"
"THOMAS-54565" "THOMAS-SYSTEM-54565"
"IRON-06516" "IRON-SYSTEM-06516"
Assuming that there is exactly one hyphen per input line.
awk solution:
awk -F'-' '{printf("\"%s\" \"%s-SYSTEM-%s\"\n", $1FS$2,$1,$2)}' file
The output would be like:
"ERIC-1898" "ERIC-SYSTEM-1898"
"HELENE-5456" "HELENE-SYSTEM-5456"
"THOMAS-54565" "THOMAS-SYSTEM-54565"
"IRON-06516" "IRON-SYSTEM-06516"
Not using external program:
#!/bin/bash
IFS=$'-'
while read -r first second;do
echo "\"$first-$second\" \"$first-SYSTEM-$second\""
done <infile
awk '{sub(/ /,"\" \"");print "\042" $0 "\042"}' file
"ERIC-1898" "ERIC-SYSTEM-1898"
"HELENE-5456" "HELENE-SYSTEM-5456"
"THOMAS-54565" "THOMAS-SYSTEM-54565"
"IRON-06516" "IRON-SYSTEM-06516"

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

Separating joined columns with awk

I have a data file which looks like the following:
0.00000-130250.92921 28880.20200-159131.13121 301.58706
0.05000-130250.73120 28156.69202-158407.42322 294.03167
0.10000-130250.79137 28237.16138-158487.95275 294.87198
0.15000-130250.81209 28168.63042-158419.44250 294.15634
0.20000-130250.82418 28149.57611-158400.40029 293.95736
0.25000-130250.88438 28069.57135-158320.45573 293.12189
0.30000-130251.06059 28071.30576-158322.36635 293.14000
0.35000-130250.96639 28084.46351-158335.42990 293.27741
as you can see some of the columns which start with "-" sign are
joined to the previous one, for instance: 0.35000-130250.96639
this should be 0.35000 and -130250.96639. I can separate the
columns with VIM but I wanted to know if it is possible to do that
with AWK.
Thanks.
You can use sed: replace each - with a space and -:
sed -e 's/-/ -/g' input > output
The /g means globally, i.e. it replaces all occurrences on each line, not just the first one.
Using just awk
awk '{ gsub("-"," -") ; print }'

convert multiply lines between pattern to a comma separated string

I need help in processing data from STDIN (data is taken from another file with 'tail -f' plus grepped to filter out garbage). There are several lines between patterns:
<DN> 589</DN>
<DD>03.12.2014</DD>
<ST> </ST>
<STC>0</STC>
<STT>0</STT>
<PU>5</PU>
<OT>01</OT>
<DSN></DSN>
<NRA>40807,40820,426,30231,40818,30230</NRA>
<GR>300 000-00
&#10</GR>
then next block with DN/GR starts
I need to convert lines between and to a single line, comma-separated:
<DN> 589</DN>,<DD>03.12.2014</DD>,<ST> </ST>,<STC>0</STC>,<STT>0</STT>,<PU>5</PU>,<OT>01</OT>,<DSN></DSN>,<NRA>40807,40820,426,30231,40818,30230</NRA>,<GR>300 000-00
&#10</GR>
I need a one-liner with awk or sed or perl to do it and put result to STDOUT.
I've tried to do it, but failed due to lack of experience. Also tried to google and didn't find a working solution.
whatever..| awk '{sub(/^\s*/,"");printf "%s%s",$0,(/\/GR>\s*$/?"\n":",")}'
this line does:
remove the leading spaces from each line
join all line with sep , till the block end /GR>
if you have x data blocks, it gives you x long lines.
sed -nr '/<DN>/,/<GR>/{ H; /<GR>/{ g; s%\n%,%g; s%^,%%; p; s%.*%%; h }; }' <<'EOSEQ'
<DN> 589</DN>
<DD>03.12.2014</DD>
<STC>0</STC>
<GR>300 000-00
&#10</GR>
<DN>900</DN>
<DD>20.11.2014</DD>
<OT>01</OT>
<NRA>40807,40820,426,30231,40818,30230</NRA>
<GR>300 000-00
&#10</GR>
EOSEQ
SED one-liner, as you wish :)
Using awk you could do the following:
awk '{printf ("%s,", $NF)}' test.txt ##Will have comma at the end which may/may not be ok for you.
You can use the following one in sed.
sed -r ':loop ;N;s/(.*)\n(.*)/\1,\2/ ; t loop ' file name.

Resources