Unix substring file name by 3rd symbol - "_" from right side - shell

i have a list of files like :
File_one_1_11_1
File_two_v1_22_1
File_three_2_31_5
My questions is how to substring file name to have a list like this:
File_one
File_two
File_three
Thank you.

echo "File_one_1_11_1" | rev | cut -d"_" -f4- | rev

This could be done easily by sed's back reference capability, try following, written and tested with shown samples(by OP's posted answer could see these are the values which OP reading from either a variable OR from an Input_file).
sed 's/\([^_]*\)\(_[^_]*\)_.*/\1\2/' Input_file
Explanation: Simple explanation will be, using sed's substitution capability to substitute everything with 1st and 2nd captured groups(which is done in regex part before substitution part).
OR with simple awk it will be just:
awk 'BEGIN{FS=OFS="_"} {print $1,$2}' Input_file

Assuming that you have one of those filenames stored in a variable filename, i.e.
filename=File_one_1_11_1
you can do a
residual_name=${filename%_*_*_*}

Related

sed extract part of string from a file

I've ben trying to extract only part of string from a file looking like this:
str1=USER_NAME
str2=justAstring
str3=https://product.org/v-4.5-bin.zip
str4=USER_HOME
I need to extract ONLY the version - in this case: 4.5
I did it by grep and then sed but now the output is 4.5-bin.zip
-> grep str3 file.txt
str3=https://product.org/v-4.5-bin.zip
-> echo str3=https://product.org/v-4.5-bin.zip | sed -n "s/^.*v-\(\S*\)/\1/p"
4.5-bin.zip
What should I do in order to remove also the -bin.zip at the end?
Thanks.
1st solution: With your shown samples, please try following sed code.
sed -n '/^str3=/s/.*-\([^-]*\)-.*/\1/p' Input_file
Explanation: Using sed's -n option which will STOP printing of values by default, to only print matched part. In main program checking condition if line starts from str3= then perform substitution there. In substitution catching everything between 1st - and next - in a capturing group and substituting whole line with it by using \1 and printing the matched portion only by using p option.
2nd solution: Using GNU grep you could try following grep program.
grep -oP '^str3=.*?-\K([^-]*)' Input_file
3rd solution: Using awk program for getting expected output as per shown smaples.
awk -F'-' '/^str3=/{print $2}' Input_file
4th solution: Using awk's match function to get expected results with help of using RSTART and RLENGTH variables which get set once a TRUE match is found by match function.
awk 'match($0,/^str3=.*-/){split(substr($0,RSTART,RLENGTH),arr,"-");print arr[2]}' Input_file
If you know the version contains just digits and dots, replace \S by [0-9.]. Also, match the remaining characters outside of the capture group to get it removed.
sed -n 's/^.*v-\([0-9.]*\).*/\1/p'

Grabbing only text/substring between 4th and 7th underscores in all lines of a file

I have a list.txt which contains the following lines.
Primer_Adapter_clean_KL01_BOLD1_100_KL01_BOLD1_100_N701_S507_L001_merged.fasta
Primer_Adapt_clean_KL01_BOLD1_500_KL01_BOLD1_500_N704_S507_L001_merged.fasta
Primer_Adapt_clean_LD03_BOLD2_Sessile_LD03_BOLD2_Sessile_N710_S506_L001_merged.fasta
Now I would like to grab only the substring between the 4th underscore and 7th underscore such that it will appear as below
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03
I tried the below awk command but I guess I've got it wrong. Any help here would be appreciated. If this can be achieved via sed, I would be interested in that solution too.
awk -v FPAT="[^__]*" '$4=$7' list.txt
I feel like awk is overkill for this. You can just use cut to select just the fields you want:
$ cut -d_ -f5-7 list.txt
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03
awk 'BEGIN{FS=OFS="_"} {print $5,$6,$7}' file
Output:
BOLD1_100_KL01
BOLD1_500_KL01
BOLD2_Sessile_LD03

Find multiple strings between values and replace with newline in bash

I need to write a bash script to list values from an sql database.
I've got so far but now I need to get the rest of the way.
The string so far is
10.255.200.0/24";i:1;s:15:"10.255.207.0/24";i:2;s:14:"192.168.0.0/21
I now need to delete everything between the speech marks and send it to a new line.
desired output:
10.255.200.0/24
10.255.207.0/24
192.168.0.0/21
any help would be greatly appreciated.
$ tr '"' '\n' <<< $string | awk 'NR%2'
10.255.200.0/24
10.255.207.0/24
192.168.0.0/21
You could use :
echo 'INPUT STRING HERE' | sed $'s/"[^"]*"/\\\n/g'
Explanation :
sed 's/<PATTERN1>/<PATTERN2/g' : we substitute every occurrence of PATTERN1 by PATTERN2
[^"]*: any character that is not a ", any number of time
\\\n: syntax for newline in sed (reference here)
Considering that your Input_file is same as shown sample then could you please try following.
awk '
{
while(match($0,/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\/[0-9]+/)){
print substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
This might work for you (GNU sed):
sed 's/"[^"]*"/\n/g' file
Or using along side Bash:
sed $'/"[^"]*"/\\n/g' file
Or using most other sed's:
sed ':a;/"[^"]*"\(.*\)\(.\|$\)/{G;s//\2\1/;ba}' file
This uses the feature that an unadulterated hold space contains a newline.

How to extract specific string in a file using awk, sed or other methods in bash?

I have a file with the following text (multiple lines with different values):
TokenRange(start_token:8050285221437500528,end_token:8051783269940793406,...
I want to extract the value of start_token and end_token. I tried awk and cut, but I am not able to figure out the best way to extract the targeted values.
Something like:
cat filename| get the values of start_token and end_token
grep -oP '(?<=token:)\d+' filename
Explanation:
-o: print only part that matches, not complete line
-P: use Perl regex engine (for look-around)
(?<=token:): positive look-behind – zero-width pattern
\d+: one or more digits
Result:
8050285221437500528
8051783269940793406
A (potentially more efficient) variant of this, as pointed out by hek2mgl in his comment, uses \K, the variable-width look-behind:
grep -oP 'token:\K\d+'
\K keeps everything that has been matched to the left of it, but does not include it in the match (see perlre).
Using awk:
awk -F '[(:,]' '{print $3, $5}' file
8050285221437500528 8051783269940793406
First value is start_token and last value is end_token.
a sed version
sed -e '/^TokenRange(/!d' -e 's/.*:\([0-9]*\),.*:\([0-9]*\),.*/\1 \2/' YourFile

Delete every other row in CSV file using AWK or grep

I have a file like this:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,
1000_Tv178.tif,34.66987165
1000_Tv178.tif,
1001_Tv180.tif,65.51335742
1001_Tv180.tif,
1002_Tv184.tif,33.83784863
1002_Tv184.tif,
1002_Tv184.tif,22.82542442
1002_Tv184.tif,
How can I make it like this using a simple Bash command? :
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
Im other words, I need to delete every other row, starting with the second.
Thanks!
hek2mgl's (deleted) answer was on the right track, given the output you actually desire.
awk -F, '$2'
This says, print every row where the second field has a value.
If the second field has a value, but is nothing but whitespace you want to exclude, try this:
awk -F, '$2~/.*[^[:space:]].*/'`
You could also do this with sed:
sed '/,$/d'
Which says, delete every line that ends with a comma. I'm sure there's a better way, I avoid sed.
If you really want to explicitly delete every other row:
awk 'NR%2'
This says, print every row where the row number modulo 2 is not zero. If you really want to delete every even row it doesn't actually matter that it's a comma-delimited file.
awk provides a simple way
awk 'NR % 2' file.txt
This might work for you (GNU sed):
sed '2~2d' file
or:
sed 'n;d' file
Here's the gnu sed equivalent of the awk answers provided. Now you can safely use sed's -i flag, by specifying a backup extension:
sed -n -i.bak 'N;P' file.txt
Note that gawk4 can do this too:
gawk -i inplace -v INPLACE_SUFFIX=".bak" 'NR%2==1' file.txt
Results:
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
If OPs input does not contain space after last number or , this awk can be used.
awk '!/,$/'
1000_Tv178.tif,34.88552709
1000_Tv178.tif,34.66987165
1001_Tv180.tif,65.51335742
1002_Tv184.tif,33.83784863
1002_Tv184.tif,22.82542442
But its not robust at all, any space after , brakes it.
This should fix the last space:
awk '!/,[ ]*$/'
Thank for your help guys, but I also had to make a workaround:
Read it into R and then wrote it out again. Then I installed GNU versions of awk and used gawk '{if ((FNR % 2) != 0) {print $0}}'. So if anyone else have the same problem, try it!

Resources