Grabbing only text/substring between 4th and 7th underscores in all lines of a file - bash

I have a list.txt which contains the following lines.
Primer_Adapter_clean_KL01_BOLD1_100_KL01_BOLD1_100_N701_S507_L001_merged.fasta
Primer_Adapt_clean_KL01_BOLD1_500_KL01_BOLD1_500_N704_S507_L001_merged.fasta
Primer_Adapt_clean_LD03_BOLD2_Sessile_LD03_BOLD2_Sessile_N710_S506_L001_merged.fasta
Now I would like to grab only the substring between the 4th underscore and 7th underscore such that it will appear as below
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03
I tried the below awk command but I guess I've got it wrong. Any help here would be appreciated. If this can be achieved via sed, I would be interested in that solution too.
awk -v FPAT="[^__]*" '$4=$7' list.txt

I feel like awk is overkill for this. You can just use cut to select just the fields you want:
$ cut -d_ -f5-7 list.txt
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03

awk 'BEGIN{FS=OFS="_"} {print $5,$6,$7}' file
Output:
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03

Related

Unix substring file name by 3rd symbol - "_" from right side

i have a list of files like :
File_one_1_11_1
File_two_v1_22_1
File_three_2_31_5
My questions is how to substring file name to have a list like this:
File_one
File_two
File_three
Thank you.
echo "File_one_1_11_1" | rev | cut -d"_" -f4- | rev
This could be done easily by sed's back reference capability, try following, written and tested with shown samples(by OP's posted answer could see these are the values which OP reading from either a variable OR from an Input_file).
sed 's/\([^_]*\)\(_[^_]*\)_.*/\1\2/' Input_file
Explanation: Simple explanation will be, using sed's substitution capability to substitute everything with 1st and 2nd captured groups(which is done in regex part before substitution part).
OR with simple awk it will be just:
awk 'BEGIN{FS=OFS="_"} {print $1,$2}' Input_file
Assuming that you have one of those filenames stored in a variable filename, i.e.
filename=File_one_1_11_1
you can do a
residual_name=${filename%_*_*_*}

Awk Match a TSV column and replace all rows with a prefix in bash

I have a TSV file with the following format:
HAPPY today I feel good
SAD this is a bad day
UPSET Hey please leave me alone!
I have to replace the first column value with a prefix like __label__ plus my value to lower, so that to have as output
__label__happy today I feel good
__label__sad this is a bad day
__label__upset Hey please leave me alone!
in the shell (using awk, sed) etc.
awk 'BEGIN{FS=OFS="\t"}{ $1 = "__label__" tolower($1) }1' infile
Following awk may also help you in same too.
awk -F"\t" '{$1=tolower($1);printf("_label_%s\n",$0)}' OFS="\t" Input_file
another awk
$ awk 'sub($1,"__label__"tolower($1))' file
with GNU sed
$ sed -r 's/[^t]+/__label__\L&/' file

remove first text with shell script

Please someone help me with this bash script,
lets say I have lots of files with url like below:
https://example.com/x/c-ark4TxjU8/mybook.zip
https://example.com/x/y9kZvVp1k_Q/myfilename.zip
My question is, how to remove all other text and leave only the file name?
I've tried to use the command described in this url How to delete first two lines and last four lines from a text file with bash?
But since the text is random which means it doesn't have exact numbers the code is not working.
You can use the sed utility to parse out just the filenames
sed 's_.*\/__'
You can use awk:
The easiest way that I find:
awk -F/ '{print $NF}' file.txt
or
awk -F/ '{print $6}' file.txt
You can also use sed:
sed 's;.*/;;' file.txt
You can use cut:
cut -d'/' -f6 file.txt

Delete every other row in CSV file using AWK or grep

I have a file like this:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,
1000_Tv178.tif,34.66987165
1000_Tv178.tif,
1001_Tv180.tif,65.51335742
1001_Tv180.tif,
1002_Tv184.tif,33.83784863
1002_Tv184.tif,
1002_Tv184.tif,22.82542442
1002_Tv184.tif,
How can I make it like this using a simple Bash command? :
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
Im other words, I need to delete every other row, starting with the second.
Thanks!
hek2mgl's (deleted) answer was on the right track, given the output you actually desire.
awk -F, '$2'
This says, print every row where the second field has a value.
If the second field has a value, but is nothing but whitespace you want to exclude, try this:
awk -F, '$2~/.*[^[:space:]].*/'`
You could also do this with sed:
sed '/,$/d'
Which says, delete every line that ends with a comma. I'm sure there's a better way, I avoid sed.
If you really want to explicitly delete every other row:
awk 'NR%2'
This says, print every row where the row number modulo 2 is not zero. If you really want to delete every even row it doesn't actually matter that it's a comma-delimited file.
awk provides a simple way
awk 'NR % 2' file.txt
This might work for you (GNU sed):
sed '2~2d' file
or:
sed 'n;d' file
Here's the gnu sed equivalent of the awk answers provided. Now you can safely use sed's -i flag, by specifying a backup extension:
sed -n -i.bak 'N;P' file.txt
Note that gawk4 can do this too:
gawk -i inplace -v INPLACE_SUFFIX=".bak" 'NR%2==1' file.txt
Results:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
If OPs input does not contain space after last number or , this awk can be used.
awk '!/,$/'
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
But its not robust at all, any space after , brakes it.
This should fix the last space:
awk '!/,[ ]*$/'
Thank for your help guys, but I also had to make a workaround:
Read it into R and then wrote it out again. Then I installed GNU versions of awk and used gawk '{if ((FNR % 2) != 0) {print $0}}'. So if anyone else have the same problem, try it!

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. VoilĂ !
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Resources