How to search and replace multi line text AWK

How to search and replace multi line text AWK - bash

I have a file with following content (snippet) -- The test could be anywhere in the file.
More text here
Things-I-DO-NOT-NEED:
name: "Orange"
count: 8
count: 10
Things-I-WANT:
name: "Apple"
count: 3
count: 4
More text here
I would like to replace : (Including indentation)
Things-I-WANT:
name: "Apple"
count: 3
count: 4
with
Things-I-WANT:
name: "Banana"
count: 7
Any suggestions on achieving it using awk/sed? Thanks!

You can do this in awk:
#!/usr/bin/env awk
# Helper variable
{DEFAULT = 1}
# Matches a line that begins with an alphabet
/^[[:alpha:]]+/ {
# This matches "Things-I-WANT:"
if ($0 == "Things-I-WANT:") {
flag = 1
}
# Matches a line that begins with an alphabet and which is
# right after "Things-I-WANT:" block
else if (flag == 1) {
print "\tname: \"Banana\""
print ""
print "\tcount: 7"
print ""
flag = 0
}
# Matches any other line that begins with an alphabet
else {
flag = 0
}
print $0
DEFAULT = 0
}
# If line does not begin with an alphabet, do this
DEFAULT {
# Print any line that's not within "Things-I-WANT:" block
if (flag == 0) {
print $0
}
}
You can run this in bash using:
$ awk -f test.awk test.txt
The output of this script will be:
More text here
Things-I-DO-NOT-NEED:
name: "Orange"
count: 8
count: 10
Things-I-WANT:
name: "Banana"
count: 7
More text here
As you can see, the Things-I-WANT: block has been replaced.

Related

Converting second pattern to millisecond in awk

I have file which is having pattern 's' , I need to convert into 'ms' by multiplying by 1000. I am unable to do it. Please help me.
file.txt
First launch 1
App: +1s170ms
First launch 2
App: +186ms
First launch 3
App: +1s171ms
First launch 4
App: +1s484ms
First launch 5
App: +1s227ms
First launch 6
App: +204ms
First launch 7
App: +1s180ms
First launch 8
App: +1s177ms
First launch 9
App: +1s183ms
First launch 10
App: +1s155ms
My code:
awk 'BEGIN { FS="[: ]+"}
/:/ && $2 ~/ms$/{vals[$1]=vals[$1] OFS $2+0;next}
END {
for (key in vals)
print key vals[key]
}' file.txt
Expected output:
App 1170 186 1171 1484 1227 204 1180 1177 1183 1155
Output Coming:
App 1 186 1 1 1 204 1 1 1 1
How to convert in above pattern 's' to 'ms' if second pattern comes .

What I will try to do here is explain it a bit generic and then apply it to your case.
Question: I have a string of the form 123a456b7c8d where the numbers are numeric integral values of any length and the letters are corresponding units. I also have conversion factors to convert from unit a,b,c,d to unit f. How can I convert this to a single quantity of unit f?
Example: from 1s183ms to 1183ms
Strategy:
create per string a set of key-value pairs 'a' => 123,'b' => 456, 'c' => 7 and 'd' => 8
multiply each value with the corect conversion factor
add the numbers together
Assume we use awk and the key-value pairs are stored in array a with the key as an index.
Extract key-value pairs from str:
function extract(str,a, t,k,v) {
delete a; t=str;
while(t!="") {
v=t+0; match(t,/[a-zA-Z]+/); k=substr(t,RSTART,RLENGTH);
t=substr(t,RSTART+RLENGTH);
a[k]=v
}
return
}
convert and sum: here we assume we have an array f which contains the conversion factors:
function convert(a,f, t,k) {
t=0; for(k in a) t+=a[k] * f[k]
return t
}
The full code (for the example of the OP)
# set conversion factors
BEGIN{ f['s']=1000; f['ms'] = 1 }
# print first word
BEGIN{ printf "App:" }
# extract string and print
/^App/ { extract($2,a); printf OFS "%dms", convert(a,f) }
END { printf ORS }
which outputs:
App: 1170ms 186ms 1171ms 1484ms 1227ms 204ms 1180ms 1177ms 1183ms 1155ms

perl -n -e '$s=0; ($s)=/(\d+)s/; ($ms)=/(\d+)ms/;
s/^(\w+):/push #{$vals{$1}}, $ms+$s*1000/e;
eof && print "$_: #{$vals{$_}}\n" for keys %vals;' file`
perl -n doesn't print anything as it loops through the input.
$s and $ms are set to those fields. $s is ensured to reset to zero
s///e is stuffing the %vals hash with a list of numbers in ms for each key, App, in this case.
eof && executes the subsequent code after the end of the file.
print "$_: #{$vals{$_}}\n" for keys %vals is printing the %vals hash as the OP wants.
App: 1170 186 1171 1484 1227 204 1180 1177 1183 1155

Case/if-else statement to create new column in new csv

I'm trying to do a case/if-else statement on a CSV file (e.g., myfile.csv) that analyzes a column, then creates a new column in a new csv (e.g., myfile_new.csv).
The source data (myfile.csv) looks like this:
unique_id,variable1,variable2
1,,C
2,1,
3,,A
4,,B
5,1,
I'm trying to do two transformations:
For the second field, if the input file has any data in the field, have it be 1, otherwise 0.
The third field is flattened into three fields. If the input file has an A in the third field, the third output field has 1, and 0 otherwise; the same for B and C and the fourth/fifth field in the output file.
I want the result (myfile_new.csv) to look like this:
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0
I'm trying to do the equivalent of this in SQL
select unique_id,
case when len(variable1)>0 then 1 else 0 as variable1,
case when variable2 = 'A' then 1 else 0 end as variable2_A,
case when variable2 = 'B' then 1 else 0 end as variable2_B,
case when variable2 = 'C' then 1 else 0 end as variable2_C, ...
I'm open to whatever, but CSV files will be 500GB - 1TB in size so it needs to work with that size file.

Here is an awk solution that would do it:
awk 'BEGIN {
FS = ","
OFS = ","
}
NR == 1 {
$3 = "variable2_A"
$4 = "variable2_B"
$5 = "variable2_C"
print
next
}
{
$2 = ($2 == "") ? 0 : 1
$3 = ($3 == "A" ? 1 : 0) "," ($3 == "B" ? 1 : 0) "," ($3 == "C" ? 1 : 0)
print
}' myfile.csv > myfile_new.csv
In the BEGIN block, we set input and output file separator to a comma.
The NR == 1 block creates the header for the output file and skips the third block.
The third block checks if the second field is empty and stores 0 or 1 in it; the $3 statement concatenates the result of using the ternary operator ?: three times, comma separated.
The output is
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0

Quick and dirty solution using a while loop.
#!/bin/bash
#Variables:
line=""
result=""
linearray[0]=0
while read line; do
unset linearray #Clean the variables from the previous loop
unset result
IFS=',' read -r -a linearray <<< "$line" #Splits the line into an array, using the comma as the field seperator
result="${linearray[0]}""," #column 1, at index 0, is the same in both files.
if [ -z "${linearray[1]}" ]; then #If column 2, at index 1, is empty, then...
result="$result""0""," #Pad empty strings with zero
else #Otherwise...
result="$result""${linearray[1]}""," #Copy the non-zero column 2 from the old line
fi
#The following read index 2, for column 3, and add on the appropriate text. Only one can ever be true.
if [ "${linearray[2]}" == "A" ]; then result="$result""1,0,0"; fi
if [ "${linearray[2]}" == "B" ]; then result="$result""0,1,0"; fi
if [ "${linearray[2]}" == "C" ]; then result="$result""0,0,1"; fi
if [ "${linearray[2]}" == "" ]; then result="$result""0,0,0"; fi
echo $result >> myfile_new.csv #append the resulting line to the new file
done <myfile.csv

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.

If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2

This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

{awk} How to read a line and compare a $ with its next/previous line?

The command below is used to read an input file containing 7682 lines:
I use the --field-separator then converted some fields into what I need, and the grep got rid of the 2 first lines I do not need.
awk --field-separator=";" '($1<15) {print int(a=(($1-1)/480)+1) " " ($1-((int(a)-1)*480)) " " (20*log($6)/log(10))}' 218_DW.txt | grep -v "0 480 -inf"
I used ($1<15) so that I only print 14 lines, better for testing. The output I get is exactly what I want, but, there is more I need to do on that:
1 1 48.2872
1 2 48.3021
1 3 48.1691
1 4 48.1502
1 5 48.1564
1 6 48.1237
1 7 48.1048
1 8 48.015
1 9 48.0646
1 10 47.9472
1 11 47.8469
1 12 47.8212
1 13 47.8616
1 14 47.8047
From above, $1 will increment from 1-16, $2 from 1-480, it's always continuous,
so when it gets to 16 480 47.8616 it restarts from 2 1 47.8616 until last line is 16 480 10.2156
So I get 16*480=7680 lines
What I want to do is simple, but, I don't get it :)
I want to compare the current line with the next one. But not all fields, only $3, it's a value in dB that decreases when $2 increases.
In example:
The current line is 1 1 48.2872=a
Next line is 1 2 48.3021=b
If [ (a - b) > 6 ] then print $1 $2 $3
Of course (a - b) has got to be an absolute value, always > 0.
The beast will be to be able to compare the current line (the $3 only) with it's next and previous line ($3).
Something like this:
1 3 48.1691=a
1 4 48.1502=b
1 5 48.1564=c
If [ ABS(b - a) > 6 ] OR If [ ABS(b - c) > 6 ] then print $1 $2 $3
But of course first line can only be compared with its next one and the last one with its previous one. Is it possible?

Try this:
#!/usr/bin/awk -f
function abs(x) {
if (x >= 0)
return x;
else
return -1 * x;
}
function compare(a,b) {
return abs(a - b) > 6;
}
function update() {
before_value = current_value;
current_line = $0;
current_value = $3;
}
BEGIN {
line_n = 1;
}
#Edit: added to skip blank lines and differently formatted lines in
# general. You could add some error message and/or exit function
# here to detect badly formatted data.
NF != 3 {
next;
}
line_n == 1 {
update();
line_n += 1;
next;
}
line_n == 2 {
if (compare(current_value, $3))
print current_line;
update();
line_n += 1;
next;
}
{
if (compare(current_value, before_value) && compare(current_value, $3))
print current_line;
update();
}
END {
if (compare(current_value, before_value)) {
print current_line;
}
}
The funny thing is that I had this code lying around from a old project where I had to do basically the same thing. Adapted it a little for you. I think it solves your problem (how I understood it, at least). If it doesn't, it should point you in the right direction.
Instructions to run the awk script:
Supposing you saved the code with the name "awkscript", the data file is named "datafile" and they are both in the current folder, you should first mark the script as executable with chmod +x awkscript and then execute it passing the data file as parameter with ./awkscript datafile or use it as part of a sequence of pipes as in cat datafile | ./awkscript.

Comparing the current line to the previous one is trivial, so I think the problem you're having is that you can't figure out how to compare the current line to the next one. Just keep 2 previous lines instead of 1 and always operate on the line before the one that's actually being read as $0, i.e. the line stored in the array p1 in this example (p2 is the line before it and $0 is the line after it):
function abs(val) { return (val > 0 ? val : -val) }
NR==2 {
if ( abs(p1[3] - $3) > 6 ) {
print p1[1], p1[2], p1[3]
}
}
NR>2 {
if ( ( abs(p1[3] - p2[3]) > 6 ) || ( abs(p1[3] - $3) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}
{ prev2=prev1; prev1=$0; split(prev2,p2); split(prev1,p1) }
END {
if ( ( abs(p1[3] - p2[3]) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}

unix shell: replace by dictionary

I have file which contains some data, like this
2011-01-02 100100 1
2011-01-02 100200 0
2011-01-02 100199 3
2011-01-02 100235 4
and have some "dictionary" in separate file
100100 Event1
100200 Event2
100199 Event3
100235 Event4
and I know that
0 - warning
1 - error
2 - critical
etc...
I need some script with sed/awk/grep or something else which helps me receive data like this
100100 Event1 Error
100200 Event2 Warning
100199 Event3 Critical
etc
will be grateful for ideas how to do this in best way, or for working example
update
sometimes I have data like this
2011-01-02 100100 1
2011-01-02 sometext 100200 0
2011-01-02 100199 3
2011-01-02 sometext 100235 4
where sometext = any 6 characters (maybe this is helpful info)
in this case I need whole data:
2011-01-02 sometext EventNameFromDictionary Error
or without "sometext"

awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
print $2, evt[$2], lvl[$3]
}' dictionary infile

Adding a new answer for the new requirement and because of the limited formatting options inside a comment:
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
if (NF > 3) {
idx = 3; $1 = $1 OFS $2
}
else idx = 2
print $1, $idx in evt ? \
evt[$idx] : $idx, $++idx in lvl ? \
lvl[$idx] : $idx
}' dictionary infile
You won't need to escape the new lines inside the tertiary operator if you're using GNU awk.
Some awk implementations may have problems with this part:
$++idx in lvl ? lvl[$idx] : $idx
If you're using one of those,
change it to:
$(idx + 1) in lvl ? lvl[$(idx + 1)] : $(idx + 1)
OK, comments added:
awk 'BEGIN {
lvl[0] = "warning" # map the error levels
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR { # while reading the first
# non-empty input file
evt[$1] = $2 # build the associative array evt
next # skip the rest of the program
# keyed by the value of the first column
# the second column represents the values
}
{ # now reading the rest of the input
if (NF > 3) { # if the number of columns is greater than 3
idx = 3 # set idx to 3 (the key in evt)
$1 = $1 OFS $2 # and merge $1 and $2
}
else idx = 2 # else set idx to 2
print $1, \ # print the value of the first column
$idx in evt ? \ # if the value of the second (or the third,
\ # depeneding on the value of idx), is an existing
\ # key in the evt array, print its value
evt[$idx] : $idx, \ # otherwise print the actual column value
$++idx in lvl ? \ # the same here, but first increment the idx
lvl[$idx] : $idx # because we're searching the lvl array now
}' dictionary infile

I hope perl is ok too:
#!/usr/bin/perl
use strict;
use warnings;
open(DICT, 'dict.txt') or die;
my %dict = %{{ map { my ($id, $name) = split; $id => $name } (<DICT>) }};
close(DICT);
my %level = ( 0 => "warning",
1 => "error",
2 => "critical" );
open(EVTS, 'events.txt') or die;
while (<EVTS>)
{
my ($d, $i, $l) = split;
$i = $dict{$i} || $i; # lookup
$l = $level{$l} || $l; # lookup
print "$d\t$i\t$l\n";
}
Output:
$ ./script.pl
2011-01-02 Event1 error
2011-01-02 Event2 warning
2011-01-02 Event3 3
2011-01-02 Event4 4

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to search and replace multi line text AWK - bash

Related

Converting second pattern to millisecond in awk

Case/if-else statement to create new column in new csv

i have a protein sequence file i want to count trimers in it using sed or grep

{awk} How to read a line and compare a $ with its next/previous line?

unix shell: replace by dictionary

Categories

Resources