Awk Match a TSV column and replace all rows with a prefix in bash - bash

I have a TSV file with the following format:
HAPPY today I feel good
SAD this is a bad day
UPSET Hey please leave me alone!
I have to replace the first column value with a prefix like __label__ plus my value to lower, so that to have as output
__label__happy today I feel good
__label__sad this is a bad day
__label__upset Hey please leave me alone!
in the shell (using awk, sed) etc.

awk 'BEGIN{FS=OFS="\t"}{ $1 = "__label__" tolower($1) }1' infile

Following awk may also help you in same too.
awk -F"\t" '{$1=tolower($1);printf("_label_%s\n",$0)}' OFS="\t" Input_file

another awk
$ awk 'sub($1,"__label__"tolower($1))' file
with GNU sed
$ sed -r 's/[^t]+/__label__\L&/' file

Related

Find Everything between 2 strings -- Sed

I have file which has data in below format.
{"default":true,"groupADG":["ABC","XYZ:mno"],"groupAPR":true}
{"default":true,"groupADG":["PQR"],"groupAPR":true}
I am trying to get output as
"ABC","XYZ:mno"
"PQR"
I tried doing it using sed but somewhere I am going wrong .
sed -e 's/groupADG":[\(.*\)],"groupAPR"/\1/ file.txt
Regards.
Note: If anyone is rating the question negative, I would request to give a reason also for same. As I have tried to fix it myself , since I was unable to do it I posted it here. I also gave my trial example.
Here is one potential solution:
sed -n 's/.*\([[].*[]]\).*/\1/p' file.txt
To exclude the brackets:
sed -n 's/.*\([[]\)\(.*\)\([]]\).*/\2/p'
Also, this would work using AWK:
awk -F'[][]' '{print $2}' file.txt
Just watch out for edge cases (e.g. if there are multiple fields with square brackets in the same line you may need a different strategy)
With your shown samples, following may also help you on same.
awk 'match($0,/\[[^]]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
OR with OP's attempts try with /"groupADG" also:
awk 'match($0,/"groupADG":\[[^]]*/){print substr($0,RSTART+12,RLENGTH-12)}' Input_file
With awk setting FS as [][] and the condition /groupADG/
awk -F'[][]' '/groupADG/ {print $2}' file
"ABC","XYZ:mno"
"PQR"

Grabbing only text/substring between 4th and 7th underscores in all lines of a file

I have a list.txt which contains the following lines.
Primer_Adapter_clean_KL01_BOLD1_100_KL01_BOLD1_100_N701_S507_L001_merged.fasta
Primer_Adapt_clean_KL01_BOLD1_500_KL01_BOLD1_500_N704_S507_L001_merged.fasta
Primer_Adapt_clean_LD03_BOLD2_Sessile_LD03_BOLD2_Sessile_N710_S506_L001_merged.fasta
Now I would like to grab only the substring between the 4th underscore and 7th underscore such that it will appear as below
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03
I tried the below awk command but I guess I've got it wrong. Any help here would be appreciated. If this can be achieved via sed, I would be interested in that solution too.
awk -v FPAT="[^__]*" '$4=$7' list.txt
I feel like awk is overkill for this. You can just use cut to select just the fields you want:
$ cut -d_ -f5-7 list.txt
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03
awk 'BEGIN{FS=OFS="_"} {print $5,$6,$7}' file
Output:
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03

Replacing new line with comma seperator

I have a text file that the records in the following format. Please note that there are no empty files within the Name, ID and Rank section.
"NAME","STUDENT1"
"ID","123"
"RANK","10"
"NAME","STUDENT2"
"ID","124"
"RANK","11"
I have to convert the above file to the below format
"STUDENT1","123","10"
"STUDENT2","124","11"
I understand that this can be achieved using shell script by reading the records and writing it to another output file. But can this can done using awk or sed ?
$ awk -F, '{ORS=(NR%3?FS:RS); print $2}' file
"STUDENT1","123","10"
"STUDENT2","124","11"
With awk:
awk -F, '$1=="\"RANK\""{print $2;next}{printf "%s,",$2}' file
With awk, printing newline each 3 lines:
awk -F, '{printf "%s",$2;if (NR%3){printf ","}else{print""};}'
Following awk may also help you on same.
awk -F, '{ORS=$0~/^"RANK/?"\n":FS;print $NF}' Input_file
With sed
sed -E 'N;N;;y/\n/ /;s/([^,]*)(,[^ ]*)/\2/g;s/,//' infile

Delete every other row in CSV file using AWK or grep

I have a file like this:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,
1000_Tv178.tif,34.66987165
1000_Tv178.tif,
1001_Tv180.tif,65.51335742
1001_Tv180.tif,
1002_Tv184.tif,33.83784863
1002_Tv184.tif,
1002_Tv184.tif,22.82542442
1002_Tv184.tif,
How can I make it like this using a simple Bash command? :
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
Im other words, I need to delete every other row, starting with the second.
Thanks!
hek2mgl's (deleted) answer was on the right track, given the output you actually desire.
awk -F, '$2'
This says, print every row where the second field has a value.
If the second field has a value, but is nothing but whitespace you want to exclude, try this:
awk -F, '$2~/.*[^[:space:]].*/'`
You could also do this with sed:
sed '/,$/d'
Which says, delete every line that ends with a comma. I'm sure there's a better way, I avoid sed.
If you really want to explicitly delete every other row:
awk 'NR%2'
This says, print every row where the row number modulo 2 is not zero. If you really want to delete every even row it doesn't actually matter that it's a comma-delimited file.
awk provides a simple way
awk 'NR % 2' file.txt
This might work for you (GNU sed):
sed '2~2d' file
or:
sed 'n;d' file
Here's the gnu sed equivalent of the awk answers provided. Now you can safely use sed's -i flag, by specifying a backup extension:
sed -n -i.bak 'N;P' file.txt
Note that gawk4 can do this too:
gawk -i inplace -v INPLACE_SUFFIX=".bak" 'NR%2==1' file.txt
Results:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
If OPs input does not contain space after last number or , this awk can be used.
awk '!/,$/'
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
But its not robust at all, any space after , brakes it.
This should fix the last space:
awk '!/,[ ]*$/'
Thank for your help guys, but I also had to make a workaround:
Read it into R and then wrote it out again. Then I installed GNU versions of awk and used gawk '{if ((FNR % 2) != 0) {print $0}}'. So if anyone else have the same problem, try it!

Replace everything between two character

All.
I am newbie to sed.
I want something like
Input:
ABC,DEF,GHI,JKL,MNO
Output:
ABC,,,,MNO
Means....
I want to remove all contents between two ','
This might work for you (GNU sed):
sed 's/[^,]*,/,/2g' file
you could set all fields between 1 and last to empty with awk:
awk -F, -v OFS="," '{for(i=2;i<NF;i++)$i=""}7'

Resources