How I can remove double lines in whole file, omiting first n characters in each line? - bash

I have following data format:
123456786|data1
123456787|data2
123456788|data3
The first column is main_id. I need to remove all duplicated lines from txt file but omitting main_id number. How I can do that?
Normally I use such AWK script, but it finds double lines without omiting:
awk '!x[$0]++' $2 > "$filename"_no_doublets.txt #remove doublets
Thanks for any help.

if you have more columns, this line should do:
awk '{a=$0;sub(/[^|]*\|/,"",a)}!x[a]++' file
example:
123456786|data1
12345676|data1
123456787|data2|foo
203948787|data2|foo
123456788|data3
kent$ awk '{a=$0;sub(/[^|]*\|/,"",a)}!x[a]++' f
123456786|data1
123456787|data2|foo
123456788|data3

You can use:
awk -F'|' '!x[$2]++'
This will find duplicates only based on field 2 delimited by |
UPDATE:
awk '{line=$0; sub(/^[^|]+\|/, "", line)} !found[line]++'

awk '{key=$0; sub(/[^|]+/,"",key)} !seen[key]++' file

Related

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

awk rounding the float numbers

My first line of file a.txt contains following and fields are separated by (,)
ab,b1,c,d,5.986627e738,e,5.986627e738
cd,g2,h,i,7.3423542344,j,7.3423542344
ef,l3,m,n,9.3124234323,o,9.3124234323
when I issue the below command
awk -F"," 'NR>-1{OFS=",";gsub($5,$5+10);OFS=",";print }' a.txt
it is printing
ab,b1,c,d,inf,e,inf
cd,g2,h,i,17.3424,j,17.3424
ef,l3,m,n,19.3124,o,19.3124
Here I have two issues
I asked awk to add 10 to only 5th column but it has added to 7th column as well due to duplicate entries
It is rounding up the numbers, instead, I need decimals to print as it is
How can I fix this?
awk 'BEGIN {FS=OFS=","}{$5=sprintf("%.10f", $5+10)}7' file
in your data, the $5 from line#1 has an e, so it was turned into 10.0000... in output.
you did substitution with gsub, therefore all occurrences will be replaced.
printf/sprintf should be considered to output in certain format.
tested with gawk
If you want to set the format in printf dynamically:
kent$ cat f
ab,b1,c,d,5.9866,e,5.986627e738
cd,g2,h,i,7.34235,j,7.3423542344
ef,l3,m,n,9.312423,o,9.3124234323
kent$ awk 'BEGIN {FS=OFS=","}{split($5,p,".");$5=sprintf("%.*f",length(p[2]), $5+10)}7' f
ab,b1,c,d,15.9866,e,5.986627e738
cd,g2,h,i,17.34235,j,7.3423542344
ef,l3,m,n,19.312423,o,9.3124234323
what you did is replacement on the whole record, what you really want to do is
awk 'BEGIN {FS=OFS=","}
{$5+=10}1' a.txt

Bash: Converting 4 columns of text interleaved lines (tab-delimited columns to FASTQ file)

I need to convert a 4-column file to 4 lines per entry. The file is tab-delimited.
The file at current is arranged in the following format, with each line representing one record/sequence (with millions of such lines):
#SRR1012345.1 NCAATATCGTGG #4=DDFFFHDHH HWI-ST823:136:C24YTACXX
#SRR1012346.1 GATTACAGATCT #4=DDFFFHDHH HWI-ST823:136:C22YTAGXX
I need to rearrange this such that the four columns are presented as 4 lines:
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX
What would be the best way to go about doing this, preferably with a bash one-liner? Thank you for your assistance!
You can use tr:
< file tr '\t' '\n' > newfile
very clear to use awk here:
awk '{print $1; print $2; print $3; print $4}' file
$ awk -v OFS='\n' '{$1=$1}1' file
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX

cut out fields that matched a regex from a delimited string

Example file:
35=A|11=ABC|55=AAA|20=DEF
35=B|66=ABC|755=AAA|800=DEF|11=ZZ|55=YYY
35=C|66=ABC|11=CC|755=AAA|800=DEF|55=UUU
35=C|66=ABC|11=XX|755=AAA|800=DEF
i want the output to to print like following, with only column 11= and 55= printed. (They are not at fixed location)
11=ABC|55=AAA
11=ZZ|55=YYY
11=CC|55=UUU
Thanks.
sed might be easier here:
sed -nr '/(^|\|)11=[^|]*.*\|55=/s~^.*(11=[^|]*).*(\|55=[^|]*).*$~\1\2~p' file
11=ABC|55=AAA
11=ZZ|55=YYY
11=CC|55=UUU
Try this:
$ awk -F'|' '{f=0;for (i=1;i<=NF;i++)if ($i~/^(11|55)=/){printf "%s",(f?"|":"")$i;f=1};print""}' file
11=ABC|55=AAA
11=ZZ|55=YYY
11=CC|55=UUU
11=XX
To only show lines that have both a 11 field and a 55 field:
$ awk -F'|' '/(^|\|)11=/ && /\|55=/{f=0;for (i=1;i<=NF;i++)if ($i~/^(11|55)=/){printf "%s",(f?"|":"")$i;f=1};print""}' file
11=ABC|55=AAA
11=ZZ|55=YYY
11=CC|55=UUU

Cut and replace bash

I have to process a file with data organized like this
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
etc
Columns can have different length but lines always have the same number of columns.
I want to be able to cut a specific column of a given line and change it to the value I want.
For example I'd apply my command and change the file to
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
I know how to select a specific line with sed and then cut the field but I have no idea on how to replace the field with the value I have.
Thanks
Here's a way to do it with awk:
Going with your example, if you wanted to replace the 3rd field of the 1st line:
awk 'BEGIN{FS=OFS=":"} {if (NR==1) {$3 = "XXXX"}; print}' input_file
Input:
AAAAA:BB:CCC:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Output:
AAAAA:BB:XXXX:EEEE:DDDD
FF:III:JJJ:KK:LLL
MMMM:NN:OOO:PP
Explanation:
awk: invoke the awk command
'...': everything enclosed by single-quotes are instructions to awk
BEGIN{FS=OFS=":"}: Use : as delimiters for both input and output. FS stands for Field Separator. OFS stands for Output Field Separator.
if (NR==1) {$3 = "XXXX"};: If Number of Records (NR) read so far is 1, then set the 3rd field ($3) to "XXXX".
print: print the current line
input_file: name of your input file.
If instead what you are trying to accomplish is simply replace all occurrences of CCC with XXXX in your file, simply do:
sed -i 's/CCC/XXXX/g` input_file
Note that this will also replace partial matches, such as ABCCCDD -> ABXXXXDD
This might work for you (GNU sed):
sed -r 's/^(([^:]*:?){2})CCC/\1XXXX/' file
or
awk -F: -vOFS=: '$3=="CCC"{$3="XXXX"};1' file

Resources