Extracting all but a certain sequence of characters in Bash - bash

In bash I need to extract a certain sequence of letters and numbers from a filename. In the example below I need to extract just the S??E?? section of the filenames. This must work with both upper/lowercase.
my.show.s01e02.h264.aac.subs.mkv
great.s03e12.h264.Dolby.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
Expected output would be:
s01e02
s03e12
S05E11
I've been trying to do this with SED but can't get it to work. This is what I have tried, without success:
sed 's/.*s[0-9][0-9]e[0-9][0-9].*//'
Many thanks for any help.

With sed we can match the desired string in a capture group, and use the I suffix for case-insensitive matching, to accomplish the desired result.
For the sake of this answer I'm assuming the filenames are in a file:
$ cat fnames
my.show.s01e02.h264.aac.subs.mkv
great.s03e12.h264.Dolby.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
One sed solution:
$ sed -E 's/.*\.(s[0-9][0-9]e[0-9][0-9])\..*/\1/I' fnames
s01e02
s03e12
S05E11
Where:
-E - enable extended regex support
\.(s[0-9][0-9]e[0-9][0-9])\. - match s??e?? with a pair of literal periods as bookends; the s??e?? (wrapped in parens) will be stored in capture group #1
\1 - print out capture group #1
/I - use case-insensitive matching

I think your pattern is ok. With the grep -o you get only the matched part of a string instead of matching lines. So
grep -io 'S[0-9]{2}E[0-9]{2}'
solves your problem. Compared to your pattern only numbers will be matched. Maybe you can put it in an if, so lines without a match show that something is wrong with the filename.

Suppose you have those file names:
$ ls -1
great.s03e12.h264.Dolby.mkv
my.show.s01e02.h264.aac.subs.mkv
what.a.fab.title.S05E11.Atmos.h265.subs.eng.mp4
You can extract the substring this way:
$ printf "%s\n" * | sed -E 's/^.*([sS][0-9][0-9][eE][0-9][0-9]).*/\1/'
Or with grep:
$ printf "%s\n" *.m* | grep -o '[sS][0-9][0-9][eE][0-9][0-9]'
Either prints:
s03e12
s01e02
S05E11
You could use that same sed or grep on a file (with filenames in it) as well.

Related

Extracting a value from a same file from multiple directories

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.
If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.
If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt
This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.
If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

How using sed one can find and replace a pattern with multiple strings?

I got this x.xx.xxx.xxxx.api-6.8.25-SNAPSHOT.jar filename, which I would like to change to, x.xx.xxx.xxxx.api_6.8.25.SNAPSHOT.jar. using sed I came up with this:
FILENAME=$(sed 's/-(?=[\w])/_/g' <<< "$FILENAME")
The regex pattern seems to be correct in pointing -s, however when my script runs no change is applied on my string. what I'm missing here? and how can I have multiple substitutions? changing the first dash with an underscore and the second with a dot?
I suggest:
echo 'x.xx.xxx.xxxx.api-6.8.25-SNAPSHOT.jar' | sed 's/-/_/; s/-/./'
Output:
x.xx.xxx.xxxx.api_6.8.25.SNAPSHOT.jar
Pure bash solution without calling any external utility:
fn='xx.xxx.xxxx.api-6.8.25-SNAPSHOT.jar`
fn="${fn/-/_}" # replace first - by _
fn="${fn/-/.}" # replace next - by .
echo "$fn"
xx.xxx.xxxx.api_6.8.25.SNAPSHOT.jar
You can use
FILENAME=$(sed -E 's/(.*)-([0-9.]+)-/\1_\2./' <<< "$FILENAME")
See the online demo.
Details:
-E enables POSIX ERE syntax
(.*)-([0-9.]+)- - a regex that matches and captures into Group 1 any zero or more chars, then -, then one or more digits or dots captured into Group 2 and then a -
\1_\2. is the replacement, Group 1, _, Group 2 and a ..

Insert character after pattern with character exclusion using sed

I have this string of file names.
FileNames="FileName1.txtStrange-File-Name2.txt.zipAnother-FileName.txt"
What I like to do is to separate the file names by semicolon so I can iterate over it. For the .zipextension I have a working command.
I tried the following:
FileNames="${FileNames//.zip/.zip;}"
echo "$FileNames" | sed 's|.txt[^.zip]|.txt;|g'
Which works partially. It add a semicolon to the .zip as expected, but where sed matches the .txt I got the output:
FileName1.txt;trange-File-Name2.txt.zip;Another-FileName.txt
I think because of the character exclusion sed replaces the following character after the match.
I would like to have an output like this:
FileName1.txt;Strange-File-Name2.txt.zip;Another-FileName.txt
I'm not sticked to sed, but it would be fine to using it.
There might be a better way, but you can do it with sed like this:
$ echo "FileName1.txtStrange-File-Name2.txt.zipAnother-FileName.txt" | sed 's/\(zip\|txt\)\([^.]\)/\1;\2/g'
FileName1.txt;Strange-File-Name2.txt.zip;Another-FileName.txt
Beware that [^.zip] matches 'one char that is not ., nor z, nor i nor p'. It does not match 'a word that is not .zip'
Note the less verbose solution by #sundeep:
sed -E 's/(zip|txt)([^.])/\1;\2/g'
sed -r 's/(\.[a-z]{3})(.)/\1;\2/g'
would be a more generic expression.

how to grep the following

I have an input file
RAKESH_ONE
RAKESH-TWO
RAKESH123
RAKESHTHREE
/RAKESH/
FIVERAKESH
456RAKESH
WELCOME123
This is RAKESH
I would like to get the output
RAKESH_ONE
RAKESH-TWO
/RAKESH/
This is RAKESH
I want to print the line matching the pattern RAKESH. If the pattern is prefixed or suffixed with alphanumeric we should avoid it.
([^a-zA-Z0-9]+|^)RAKESH([^a-zA-Z0-9]+|$)
This will match patterns on the lines without alphanumeric prefixes or suffixes. It will not match the whole line, but if used with grep or sed you can output just the lines you need.
UPDATE
As requested, here's the full grep command. Use the -E option to use extended regex:
grep -E "([^a-zA-Z0-9]+|^)RAKESH([^a-zA-Z0-9]+|$)" file.txt

Using BASH, how to increment a number that uniquely only occurs once in most lines of an HTML file?

The target is always going to be between two characters, 'E' and '/' and there will never be but one occurrence of this combination, e.g. 'E01/' in most lines in the HTML file and will always be between '01' and '90'.
So, I need to programmatically read the file and replace each occurrence of 'Enn/' where 'nn' in 'Enn/' will be between '01' and '90' and must maintain the '0' for numbers '01' to '09' in 'Enn/' while incrementing the existing number by 1 throughout the HTML file.
Is this doable and if so how best to go about it?
Edit: Target lines will be in one or the other formats:
<DT>ProgramName
<DT>Program Name
You can use sed inside BASH as a fantastic one-liner, either:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+(10#\2>=90?0:1)))/ge' FILENAME
or if you are guaranteed the number is lower than 100:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+1)))/ge' FILENAME
Basically, you'll be doing inplace search and replace. The above will not add anything after 90 (since you didn't specify the exact nature of the overflow condition). So E89/ -> E90/, E90/ -> E90/, and if by chance you have E91/, it will remain E91/. Add this line inside a loop for multiple files
A small explanation of the above command:
-r states that you'll be using a regular expression
-i states to write back to the same file (be careful with overwriting!)
s/search/replace/ge this is the regex command you'll be using
s/ states you'll be using a string search
(.E) first grouping of all characters upto the first E (case sensitive)
([0-9]{2}) second grouping of numbers 0 through 9, repeated twice (fixed width)
(/.) third grouping getting the escaped trailing slash and everything after that
/ (slash separator) denotes end of search pattern and beginning of replacement pattern
printf "format" var this is the expression used for each replacement
\1 place first grouping found here
%02u the replace format for the var
\3 place third grouping found here
$((expression)) BASH arithmetic expression to use in printf format
10#\2 force second grouping as a base 10 number
+(10#\2>=90?0:1) add 0 or 1 to the second grouping based on if it is >= 90 (as used in first command)
+1 add 1 to the second grouping (see second command)
/ge flags for global replacement and the replace parameter will be an expression
GNU sed and awk are very powerful tools to do this sort of thing.
You can use the following perl one-liner to increment the numbers while maintaining the ones with leading 0s.
perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
$ cat file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
$ perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
You can add the -i option to make changes in-place. I would recommend creating backup before doing so.
Not as elegant as one line sed!
Break the commands used into multiple commands and you can debug your bash or grep or sed.
# find the number
# use -o to grep to just return pattern
# use head -n1 for safety to just get 1 number
n=$(grep -o "E[0-9][0-9]\/" file.html |grep -o "[0-9][0-9]"|head -n1)
#octal 08 and 09 are problem so need to do this
n1=10#$n
echo Debug n1=$n1 n=$n
n2=n1
# bash arithmetic done inside (( ))
# as ever with bash bracketing whitespace is needed
(( n2++ ))
echo debug n2=$n2
# use sed with -i -e for inline edit to replace number
sed -ie "s/E$n\//E$(printf '%02d' $n2)\//" file.html
grep "E[0-9][0-9]" file.html
awk might be better. Maybe could do it in one awk command also.
The sed one-liner in other answer is awesome :-)
This works in bash or sh.
http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

Resources