Bash / Sed / Grep : Parsing / Capturing a Substring

Bash / Sed / Grep : Parsing / Capturing a Substring - bash

I have generated a set of filepaths as strings in a bash script, all of this form:
./foo/bar/filename.proto
There can be any number of subfolders/slashes, but they all have the .proto extension.
I want to trim the leading ./ and trailing filename.proto to transform them to look like this:
foo/bar
I have had a surprising amount of difficulty adapting this from other solutions and debugging it. I have tried:
grep -Po "\.\/(.*)\/[^\/]+\.proto"
and
sed -n 's/\.\/\(.*\)\/[^\/]+\.proto/\1/p'
I have tried sed with both escaped and unescaped parentheses. For reference, I am currently working on a mac, and would like the most cross-platform-compatible solution.
I could do this fairly easily in Python, but I want to avoid the complexity of calling another script to do this.
To give you an idea of how this is working, my full script looks like this (so far):
#!/bin/bash
consume_single_folder () {
do_stuff $1
}
find . -name \*.proto|while read fname; do
echo "$fname" |sed -n 's/\.\/\(.*\)\/[^\/]+\.proto/\1/p' | consume_single_folder
done
Any help is appreciated. Thanks!
EDIT:
To be clear, I have tested my regex on regex101.com and it seems to look alright:
\.\/(.*)\/[^\/]+\.proto
It should be greedy, capturing everything between the first and last slash.

Looks like dirname could help you:
$ dirname "./foo/bar/filename.proto"
./foo/bar
With leading ./ removal:
$ dirname "./foo/bar/filename.proto" | sed "s/\.\///g"
foo/bar
Also you could add sort | uniq avoid duplicates:
find . -name \*.proto|while read fname; do
echo "$fname" | xargs dirname | sed "s/\.\///g" | consume_single_folder
done
Works on MacOS and Linux

Please do not use sites like regex101 for testing sed regular expression - syntax and features vary a lot between tools, as well as between various implementations.. See Why does my regular expression work in X but not in Y? and differences between various sed implementations
For your given example, changing + to * will work (lookup differences between BRE and ERE)
$ fname='./foo/bar/filename.proto'
$ echo "$fname" | sed -n 's/\.\/\(.*\)\/[^\/]*\.proto/\1/p'
foo/bar
$ # or use a different delimiter
$ echo "$fname" | sed 's|\./\(.*\)/[^/]*\.proto|\1|'
foo/bar
$ # further simplification as find already filters by extension
$ echo "$fname" | sed 's|\./\(.*\)/.*|\1|'
foo/bar
Also, I would suggest to read Why is looping over find's output bad practice? and change your find syntax accordingly

Related

Rename files named foobar(12345).txt to 12345.txt

All:
Quickly and succinctly, I have many many files named as such:
lorem(12312315).txt
ipsum(578938-12-315-13-416-4).txt
amet(ran-dom-guid).txt
And I want to rename them to what's inside the parentheses dot text, like so:
12312315.txt
578938-12-315-13-416-4.txt
randomguid.txt
I'm sure a mix of sed, awk, grep, etc will do it, but commenting out the parentheses from the shell is throwing me. I cant come up with a string that will do it.
If anyone is kind enough to share a few thought cycles and help me, it would be a lovely Karma gesture!
Thanks for reading!
-Jim

Another flavor:
find . -type f -name \*\(\*\).txt -print0 | xargs -0 sh -c '
for filename ; do
basename_core="${filename##*(}"
basename_core="${basename_core%%)*}"
mv "${filename}" "${basename_core}".txt
done' dummy

This might work for you (GNU sed and shell);
sed -n 's/.*(\(.*\)).*/mv '\''&'\'' '\''\1.txt'\''/p' *.txt
This will print out a list of move commands, after you have validated they are correct, pipe to shell:
sed -n 's/.*(\(.*\)).*/mv '\''&'\'' '\''\1.txt'\''/p' *.txt | sh

find and mv can handle this, with a bash rematch to find your names;
#!/bin/bash
touch lorem\(12312315\).txt
touch ipsum\(578938-12-315-13-416-4\).txt
touch amet\(ran-dom-guid\).txt
pat=".*\((.*)\).txt"
for f in $(find . -type f -name "*.txt" ); do
if [[ $f =~ $pat ]]; then
mv $f ${BASH_REMATCH[1]}.txt
fi
done
ls *.txt

A for loop and Parameter Expansion.
#!/usr/bin/env bash
for f in *\(*\)*.txt; do
temp=${f#*\(}
value=${temp%\)*}
case $value in
*[!0-9-]*) value="${value//-}";;
esac
echo mv -v "$f" "$value.txt"
done
Remove the echo if you're satisfied with the output, so mv can rename/move the files.

Thank you everyone for the responses! I ended up using a mishmash of your suggestions and doing something else entirely, but I'm posting here for posterity...
The files all had one thing in common, the GUID contained in the filename was also always contained in line 2 of the accompanying file, so I yank lane two, strip out the things that are NOT the guid, and rename the file to that string, for any .xml file in the directory where the script is run.
as such:
for i in ./*xml
do
GUID=`cat "$i" | sed -n '2p' | awk '{print $1}' | sed 's/<id>//g' | sed 's/<\/id>//'`
echo File "|" $i "|" is "|" $GUID
done
In the actual script, I do a MV instead of an ECHO and the files are renamed to the guid.
Hopefully this helps someone else in the future, and yes, I know it's wasteful to call sed three times. If I were better with regular expressions, I'm sure I could get that down to one! :)
Thanks!

grep ".*" does not match valid matches?

Information and Problems
I am learning linux command now, and was simply practicing grep command in a bash.
I want to match every file whose name begins with character "a"...quite a simple requirement...From what I understand the regex should be something like a.*, but it doesn't work as what I thought.
Some of the filenames should be matched doesn't match.
My Command
I typed commands in a Ubuntu Mate 16.04 VirtualBox terminal.
I have created a document called test. In the test document, I have got three files,
a.txt
a1.txt
a2.txt
Here the following is my command using grep.
ls -a | grep -E -e a.*
But the output is simply
a.txt
I think .* should mean any numbers of whatever character. So the a1.txt and a2.txt should match the regex, but it doesn't work.
However if I tried
ls -a | grep -E -e ^a.*
ls -a | grep -E -e a.+
Both of the command work as what I expected, all the filenames matches.
a.txt
a1.txt
a2.txt
I could not figure out what goes wrong?
What I have tried
I have searched through the questions, there exist a question very similar to mine, but the problems is about the extended grep and the basic one, which definitely isn't my situation.

Use more quotes!
With the literal command you ran in your question:
ls -a | grep -E -e a.*
...your shell will replace a.* with a list of filenames in the current directory matching a.* as a glob pattern before grep is started at all. (See also the full bash-hackers page on globbing).
If a.* is placed inside quotes, as in:
ls -a | grep -E 'a.*'
...then this string will no longer be evaluated as a glob. You might also want to anchor the regex with ^, to search only at the beginning:
ls -a | grep -E '^a.*'
That said, ls is not a tool build for programmatic use -- it isn't guaranteed to emit filenames in unmodified literal form, so it's not certain that all possible names will be emitted in such a way that grep or other tools will parse them correctly (indeed, ls can't emit all possible names is literal form, since it uses newline delimiters between names, whereas newline literals are actually possible within names themselves). Consider using find for this kind of processing:
while IFS= read -r -d '' filename; do
printf 'Found file: %q\n' "$filename"
done < <(find . -regex '/^a[^/]*' -print0)
...will work even with files having intentionally difficult-to-process names; consider, for example, mkdir -p $'\n/etc/passwd\n' && touch $'\n/etc/passwd\n/a.txt'.

You are misunderstanding how the shell is parsing your command. When you do this:
ls -a | grep -E -e a.*
The shell globs the command before it is passed to ls or grep. The result of the glob is this:
ls -a | grep -E -e a.txt
Because in globbing, a.* only matches a.txt.
You need to put the regexes in quotes, e.g.
ls -a | grep -E -e 'a.*'

How to stop this script from moving renamed files out of source folder?

The script works as far as renaming the files but it moves the renamed files out of their respective folders.
I would like it to not move them but only rename them and I have failed after a few days of trying. I know this code is a mess and there is unneeded code in it but it nearly works.
Also the renamed file isn’t getting an extension of .txt but that isn't really an issue for me. I just want to see the "Dynamic Range Value" that is taken from inside the file as the file name so I don’t have to open every file (a couple thousand albums worth) to see what the DR is. Here is the code:
#!/bin/bash
cd /media/Storage/MusicWorks/Processing
find . -name 'dr14.txt' | while IFS=$'\n' read -r i
do mv -n "$i" `egrep -m1 -e 'Official DR value:' "$i" | sed -e 's/Official DR value://'`;
echo "Done"
done
I run this script from the terminal with a bash alias.

I have reservations about the egrep | sed part of your script, but if they work for you, so be it. You need to preserve the pathname of the file, for example like this:
find . -name 'dr14.txt' |
while IFS=$'\n' read -r i
do
newname="${i%/*}"/$(egrep -m1 -e 'Official DR value:' "$i" | sed -e 's/Official DR value://');
mv -n "$i" "$newname"
echo "Done $i ($newname)"
done
The ${i%/*} notation removes anything from the last slash to the end of the name in $i. Since all the names from find will start with ./, this is secure enough; it would not work well on absolute names such as / and /unix (the output would be the empty string, but /usr/bin/sh would be fine).
Under a little prompting by tripleee in a comment, it is possible to simplify the egrep | sed part of the code to:
newname="${i%/*}"/$(sed -n -e '/Official DR value:/{s///p;q;}' "$i");
The second semicolon is needed with BSD sed but not with GNU sed.

select nth file in folder (using sed)?

I am trying to select the nth file in a folder of which the filename matches a certain pattern:
Ive tried using this with sed: e.g.,
sed -n 3p /path/to/files/pattern.txt
but it appears to return the 3rd line of the first matching file.
Ive also tried
sed -n 3p ls /path/to/files/*pattern*.txt
which doesnt work either.
Thanks!

Why sed, when bash is so much better at it?
Assuming some name n indicates the index you want:
Bash
files=(path/to/files/*pattern*.txt)
echo "${files[n]}"
Posix sh
i=0
for file in path/to/files/*pattern*.txt; do
if [ $i = $n ]; then
break
fi
i=$((i++))
done
echo "$file"
What's wrong with sed is that you would have to jump through many hoops to make it safe for the entire set of possible characters that can occur in a filename, and even if that doesn't matter to you you end up with a double-layer of subshells to get the answer.
file=$(printf '%s\n' path/to/files/*pattern*.txt | sed -n "$n"p)
Please, never parse ls.

ls -1 /path/to/files/*pattern*.txt | sed -n '3p'
or, if patterne is a regex pattern
ls -1 /path/to/files/ | egrep 'pattern' | sed -n '3p'
lot of other possibilities, it depend on performance or simplicity you look at

Remove hyphens from filename with Bash

I am trying to create a small Bash script to remove hyphens from a filename. For example, I want to rename:
CropDamageVO-041412.mpg
to
CropDamageVO041412.mpg
I'm new to Bash, so be gentle :] Thank you for any help

Try this:
for file in $(find dirWithDashedFiles -type f -iname '*-*'); do
mv $file ${file//-/}
done
That's assuming that your directories don't have dashes in the name. That would break this.
The ${varname//regex/replacementText} syntax is explained here. Just search for substring replacement.
Also, this would break if your directories or filenames have spaces in them. If you have spaces in your filenames, you should use this:
for file in *-*; do
mv $file "${file//-/}"
done
This has the disadvantage of having to be run in every directory that contains files you want to change, but, like I said, it's a little more robust.

FN=CropDamageVO-041412.mpg
mv $FN `echo $FN | sed -e 's/-//g'`
The backticks (``) tell bash to run the command inside them and use the output of that command in the expression. The sed part applies a regular expression to remove the hyphens from the filename.
Or to do this to all files in the current directory matching a certain pattern:
for i in *VO-*.mpg
do
mv $i `echo $i | sed -e 's/-//g'`
done

A general solution for removing hyphens from any string:
$ echo "remove-all-hyphens" | tr -d '-'
removeallhyphens
$

f=CropDamageVO-041412.mpg
echo "${f//-}"
or, of course,
mv "$f" "${f//-}"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash / Sed / Grep : Parsing / Capturing a Substring - bash

Related

Rename files named foobar(12345).txt to 12345.txt

grep ".*" does not match valid matches?

How to stop this script from moving renamed files out of source folder?

select nth file in folder (using sed)?

Remove hyphens from filename with Bash

Categories

Resources