Trim column length to 6 characters - shell

I have a file with over 30 columns and I want to trim the length of let's say 9th column to 6 characters (using shell).
Not able to get a good solution.
Please help!

awk would be an excellent choice here. For example lets say we need to trim the first column ($1) to six characters, we can write something like
awk '{$1 = gensub(/^(......).*$/, "\\1", $1)}1'
Test
$ echo 'helloworld' | awk '{$1 = gensub(/^(......).*$/, "\\1", $1)}1'
hellow
What it does
gensub does a regular expression substitute.
/^(......).*$/ Matches 6 characters and captures them in \1, if you note you could see 6 dots in the ().
\\1 replace the entire column content with the captured 6 characters.
1 Always evaluates to true, awk takes the default action to print the entire record.

Related

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

How to add a character end of each variable with awk?

I have a tab deliminated file which I want to add "$" end of each variable, Can I do that with awk,sed or anything else?
Example
input:
a seq1 anot1
b seq2 anot2
c seq3 anot3
d seq4 anot4
I neet to have this:
output:
a$ seq1$ anot1$
b$ seq2$ anot2$
c$ seq3$ anot3$
d$ seq4$ anot4$
Any answer will be appreciated,
Thanks
In bash alone:
while read line; do echo "${line//$'\t'/\$$'\t'}\$"; done < file
This hackish solution relies on two "special" things -- parameter expansion to do the replacement, and format expansion to allow the tabs to be parsed.
In awk, you can process fields much more safely:
awk -F'\t' 'BEGIN{OFS=FS} {for(n=1;n<=NF;n++){$n=$n "$"}} 1' file
This works by stepping through each line of input and replacing each field with itself plus the dollar sign. The BEGIN block insures that your output will use the same field separators as your input. The 1 at the end is awk short-hand for "print the current line".
late to the party...
another awk solution. Prefix field and record separators with "$"
$ awk -F'\t' 'BEGIN{OFS="$"FS; ORS="$"RS} {$1=$1}1' file
With sed:
sed 's/[^ ]*/&$/g' filename
which replaces any non-space words with the word (&) followed by a $.
Oops! You said tabs. You can replace the above space with "\t" to use tab delimited.
sed 's/[^\t]*/&$/g' filename
Actually, even better, for tabs OR spaces:
sed 's/[^[:blank:]]*/&$/g' filename
awk is your friend :
awk '{for(i=1;i<=NF;i++)sub(/$/,"$",$i);print}' file
or
awk '{for(i=1;i<=NF;i++)sub(/$/,"$",$i);}1' file
Sample Output
a$ seq1$ anot1$
b$ seq2$ anot2$
c$ seq3$ anot3$
d$ seq4$ anot4$
What is happening here?
Using a for-loop we iterate thru all the fields in a record.
We use the awk sub function to replace the end ie (/$/) with a $ ie ("$") for each record ($i).
Use print explicitly to print the record. Numeric 1 also represents the default action that is to print the record.
awk '{gsub(/ /,"$ ")}{print $0 "$\r"}' file
a$ seq1$ anot1$
b$ seq2$ anot2$
c$ seq3$ anot3$
d$ seq4$ anot4$
What happens?
First replace spaces with dollar sign and new space.
Last insert dollar sign before the carriage return.

awk sub() of a substring by position

if I have the following:
>ID_10_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
(i.e. a fasta file!)
I want to be able to locate a substring based on the position (2nd element of first like i.e. 10) and take n positions around it i.e. 5 positions
EFGHIJKLMNO
and then substitute the position of interest with the 4th element of line 1 - i.e. X:
EFGHIXKLMNO
I can locate the substring, which is fine...but I am having trouble using the elements of line 1 to make the substitution in line 2. I have the following code:
#!/bin/bash
awk '
/>/{split($0,M,"_")}
!/^>/{split($1,N,"")
print M[1]"_"M[2]"_"M[3]"_"M[4]"\n"substr($1,M[2]-5,10)}
' $1
which gets me my substring.
Could someone help with my logic here to make the substitution. I gather I can use the sub() function and call the substring directly. My thinking is to use:
sub(regex/position,replacement,target)
which in my example would translate as:
sub(N[2],N[4],substr($1,M[2]-5,10))
Trying this results in
awk: cmd. line:5: print sub(M[2],M[4],substr($1,M[2]-10,20))}
awk: cmd. line:5: ^ sub third parameter is not a changeable object
So it seems I cannot call the substring explictly, and I alos have doubts about being able to use the position elements in the regex parameter.
Could someone help me with my code to form a general solution? My input is
>ID_10_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
and desired output is:
EFGHIXKLMNO
where I will have many inputs in the same file.
It must also hold true that, although I am looking for a substring consisting of 5 positions either side of the position given in line 1, if the position in line 1 is < 5, the substitution must be made in the specified position i.e.
>ID_2_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
AXCDEFG
It would be nice (but not essential) if the final substring is always a certain length i.e. if I have specified a substring of 10, but the substition is in position 2 as above, 8 characters are selected after the substitution to complete the a substring of length 10
Thanks
This awk script produces your desired output:
awk -F_ '/^>/{p=$2;s=$NF;next}{print substr($0,p-5,5) s substr($0,p+1,5)}' file
The first block saves your position p and replacement character s. The second prints the 5 characters before p, the replacement character, then the 5 characters after p.
Demo:
$ cat file
>ID_10_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
$ awk -F_ '/^>/{p=$2;s=$NF;next}{print substr($0,p-5,5) s substr($0,p+1,5)}' file
EFGHIXKLMNO
Here's an updated version of the code to deal with positions that are closer than 5 characters away from the start or end of the line. As it's slightly longer, I've used a script rather than a one-liner for clarity. You can run it like awk -f script.awk file:
BEGIN { FS="_" }
/^>/ {
p=$2; c=$NF; next
}
{
if (p-5<1) s=1
else if (p+5>length($0)) s=length($0)-10
else s=p-5
print substr($0,s,p-s) c substr($0,p,11-p+s)
}
Testing it out:
$ cat file
>ID_2_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
>ID_10_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
>ID_22_J_X
ABCDEFGHIJKLMNOPQRSTUVQXYZ
$ awk -f script.awk file
AXBCDEFGHIJK
EFGHIXJKLMNO
PQRSTUXVQXYZ

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!
Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Resources