Using egrep and regular expression together - bash

I want to search the below text file for words that ends in _letter, and get the whole portion upto "::". There is no space between any letter
blahblah:/blahblah::abc_letter:/blahblah/blahblah
blahblah:/blahblah::cd_123_letter:/blahblah/blahblah
blahblah:::/blahblah::24_cde_letter:/blahblah/blahblah
blahblah::/blahblah::45a6_letter:/blahblah/blahblah
blahblah:/blahblah::fgh_letter:/blahblah/blahblah
blahblah:/blahblah::789_letter:/blahblah/blahblah
I tried
egrep -o '*_letter'
and
egrep -o "*_letter"
But it only returns the word _letter
then I want to feed the input to the parametre of a shell script for loop. So the script will look like following
for i in [grep command]
mkdir $i
end
It will create the following directories
abc_letter/
cd_123_letter/
24_cde_letter/
45a6_letter/
fgh_letter/
789_letter/
ps: The result between :: and _letter doesn't contain any special character, only alphanumeric character
also my system doesn't have perl

Assuming no spaces or new-lines:
for i in $(sed 's/^.*:\([^/]*_letter\):.*$/\1/g' infile); do
mkdir $i
done

To extract after : to _letter strings from a file.txt and use them in your for loop, you can use the following egrep and revise your: script.sh, like this:
#!/bin/bash
for i in $(egrep -o "[^:]+_letter" file.txt); do
mkdir -p $i
done
Then you run ./script.sh, and later you check with ls, you see:
$ ls -1
24_cde_letter
45a6_letter
789_letter
abc_letter
cd_123_letter
fgh_letter
file.txt
script.sh
Explanation
Your original egrep -o '*_letter' probably just confused bash filename expansion with regular expression,
In bash, *something uses star globbing character to match * = anything here + something.
However in regular expression star * means the preceding character zero or more times. Since * is at the beginning of what you wrote, there is nothing before it, so it does not match anything there.
The only thing egrep can match is _letter, and since we are using the -o option it only displays the match, on an individual line, and thus why you originally only saw a line of _letter matches
Our new changes:
egrep pattern starts with [^ ... ], a negation, matches the opposite of what characters you put within. We put : within.
The + says to match the preceding one or more times.
So combined, it says look for anything-but-:, and do this one or more times.
Thus of course it matches anything after :, and keeps matching, until the next part of the pattern
The next part of the pattern is just _letter
egrep -o so only matched text will be shown, one per line
So in this way, from lines such as:
blahblah:/blahblah::abc_letter:/blahblah/blahblah
It successfully extracts:
abc_letter
Then, changes to your bash script:
Bash command substitution $() to have the results of the egrep command sent to the for-loop
for i value...; do ... done syntax
mkdir -p just a convenience in case you are re-testing, it will not error if directory was already made.
So altogether it helps to extract the pattern you wanted and generate directories with those names.

Related

How to remove all file extensions in bash?

x=./gandalf.tar.gz
noext=${x%.*}
echo $noext
This prints ./gandalf.tar, but I need just ./gandalf.
I might have even files like ./gandalf.tar.a.b.c which have many more extensions.
I just need the part before the first .
If you want to give sed a chance then:
x='./gandalf.tar.a.b.c'
sed -E 's~(.)\..*~\1~g' <<< "$x"
./gandalf
Or 2 step process in bash:
x="${s#./}"
echo "./${x%%.*}"
./gandalf
Using extglob shell option of bash:
shopt -s extglob
x=./gandalf.tar.a.b.c
noext=${x%%.*([!/])}
echo "$noext"
This deletes the substring not containing a / character, after and including the first . character. Also works for x=/pq.12/r/gandalf.tar.a.b.c
Perhaps a regexp is the best way to go if your bash version supports it, as it doesn't fork new processes.
This regexp works with any prefix path and takes into account files with a dot as first char in the name (hidden files):
[[ "$x" =~ ^(.*/|)(.[^.]*).*$ ]] && \
noext="${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
Regexp explained
The first group captures everything up to the last / included (regexp are greedy in bash), or nothing if there are no / in the string.
Then the second group captures everything up to the first ., excluded.
The rest of the string is not captured, as we want to get rid of it.
Finally, we concatenate the path and the stripped name.
Note
It's not clear what you want to do with files beginning with a . (hidden files). I modified the regexp to preserve that . if present, as it seemed the most reasonable thing to do. E.g.
x="/foo/bar/.myinitfile.sh"
becomes /foo/bar/.myinitfile.
If performance is not an issue, for instance something like this:
fil=$(basename "$x")
noext="$(dirname "$x")"/${fil%%.*}

Search a directory for files with fixed name pattern

I want to search a directory (let's call it "testDir") for files which names start with a letter "a", have letter "z" at fourth position and their file extension is .html.
Is there any way to use grep for this? How can I search for a character at fixed index?
You can use native Bash pattern matching: a??z*.html. This pattern means exactly what you're asking for:
Start with the letter "a"
Followed by any two characters
Followed by the letter "z" (4th position)
Followed by 0 or more characters
Ending with ".html"
You can get the matching filenames with any shell tool that prints filenames when passed as arguments.
Some examples:
ls testDir/a??z*.html or echo testDir/a??z*.html. Note that these will print with the testDir/ prefix.
(cd testDir && echo a??z*.html) will print just the filenames without the testDir/ prefix.
Note that the ls command will produce an error when there are no matching files, while the echo command will print the pattern (a??z*.html).
For more details on pattern matching, see the Pattern Matching section in man bash.
If you are looking for an alternative that produces no output when there are no matches, grep will be easier to use, but grep uses different syntax for matching pattern, it uses regular expressions.
The same pattern written in regular expressions is ^a..z.*\.html$.
This breaks down to:
^ means start of line, so ^a means to start with "a"
. is any character, precisely one
.* is 0 or more of any character
\. is a "."
$ means end of line, so html$ means to end with "html"
Here's one way to apply it to your example:
(cd testDir && ls | grep '^a..z.*\.html$')
How about this:
ls -d testDir/a??z* |grep -e '.html$'

KSH Shell script Matching file pattern

I am new to shell script. I want to iterate a directory for the below specific pattern.
Ad_sf_03041500000.dat
SF_AD_0304150.DEL
SF_AD_0404141.EXP
Number of digits should be exactly match with this pattern.
I am using KSH shell script. Could you please help me to iterate only those files in for loop.
The patterns you are looking for are
Ad_sf_{11}([[:digit:]]).dat
SF_AD_{7}([[:digit:]]).DEL
SF_AD_{7}([[:digit:]]).EXP
Note that the {n}(...) pattern, to match exactly n occurrences of the following pattern, is an extension unique to ksh (as far as I know, not even zsh provides an equivalent).
To iterate over matching files, you can use
for f in Ad_sf_{11}(\d).dat SF_AD_{7}(\d).#(DEL|EXP); do
where I've use the "pick one" operator #(...) to combine the two shorter patterns into a single pattern, and I've used \d, which ksh supports as a shorter version of [[:digit:]] when inside parentheses.
Automatic wildcard generation method. Print the filenames with leading text and line numbers...
POSIX shell:
2> /dev/null find \
$(echo Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP |
sed 's/[0-9]/[0-9]/g' ) |
while read f ; do
echo "Here's $f";
done | nl
ksh (with a spot borrowed from Chepner):
set - Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP
for f in ${*//[0-9]/[0-9]} ; do [ -f "$f" ] || continue
echo "Here's $f";
done | nl
Output of either method:
1 Here's Ad_sf_03041500000.dat
2 Here's SF_AD_0304150.DEL
3 Here's SF_AD_0404141.EXP
If the line numbers aren't wanted, omit the | nl. echo can be replaced with whatever command needs to be run on the files.
How the POSIX code works. The OP spec is simple enough to churn out the correct wildcard with a little tweaking. Example:
echo Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP |
sed 's/[0-9]/[0-9]/g'
Which outputs exactly the patterns needed (line feeds added for clarity):
Ad_sf_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].dat
SF_AD_[0-9][0-9][0-9][0-9][0-9][0-9][0-9].DEL
SF_AD_[0-9][0-9][0-9][0-9][0-9][0-9][0-9].EXP
The patterns above go to find, which prints only the matching filenames, (not the pattern itself when there are no files), then the filenames go to a while loop.
(The ksh variant is the same method but uses pattern substitution, set, and test -f in place of sed and find.)

how to count the number of lines in a variable in a shell script

Having some trouble here. I want to capture output from ls command into variable. Then later use that variable and count the number of lines in it. I've tried a few variations
This works, but then if there are NO .txt files, it says the count is 1:
testVar=`ls -1 *.txt`
count=`wc -l <<< $testVar`
echo '$count'
This works for when there are no .txt files, but the count comes up short by 1 when there are .txt files:
testVar=`ls -1 *.txt`
count=`printf '$testVar' | wc -l`
echo '$count'
This variation also says the count is 1 when NO .txt files exist:
testVar=`ls -1 *.txt`
count=`echo '$testVar' | wc -l`
echo '$count'
Edit: I should mention this is korn shell.
The correct approach is to use an array.
# Use ~(N) so that if the match fails, the array is empty instead
# of containing the pattern itself as the single entry.
testVar=( ~(N)*.txt )
count=${#testVar[#]}
This little question actually includes the result of three standard shell gotchas (both bash and korn shell):
Here-strings (<<<...) have a newline added to them if they don't end with a newline. That makes it impossible to send a completely empty input to a command with a here-string.
All trailing newlines are removed from the output of a command used in command substitution (cmd or preferably $(cmd)). So you have no way to know how many blank lines were at the end of the output.
(Not really a shell gotcha, but it comes up a lot). wc -l counts the number of newline characters, not the number of lines. So the last "line" is not counted if it is not terminated with a newline. (A non-empty file which does not end with a newline character is not a Posix-conformant text file. So weird results like this are not unexpected.)
So, when you do:
var=$(cmd)
utility <<<"$var"
The command substitution in the first line removes all trailing newlines, and then the here-string expansion in the second line puts exactly one trailing newline back. That converts an empty output to a single blank line, and otherwise removes blank lines from the end of the output.
So utility is wc -l, then you will get the correct count unless the output was empty, in which case it will be 1 instead of 0.
On the other hand, with
var=$(cmd)
printf %s "$cmd" | utility
The trailing newline(s) are removed, as before, by the command substitution, so the printf leaves the last line (if any) unterminated. Now if utility is wc -l, you'll end up with 0 if the output was empty, but for non-empty files, the count will not include the last line of the output.
One possible shell-independent work-around is to use the second option, but with grep '' as a filter:
var=$(cmd)
printf %s "${var}" | grep '' | utility
The empty pattern '' will match every line, and grep always terminates every line of output. (Of course, this still won't count blank lines at the end of the output.)
Having said all that, it is always a bad idea to try to parse the output of ls, even just to count the number of files. (A filename might include a newline character, for example.) So it would be better to use a glob expansion combined with some shell-specific way of counting the number of objects in the glob expansion (and some other shell-specific way of detecting when no file matches the glob).
I was going to suggest this, which is a construct I've used in bash:
f=($(</path/to/file))
echo ${#f[#]}
To handle multiple files, you'd just .. add files.
f=($(</path/to/file))
f=+($(</path/to/otherfile))
or
f=($(</path/to/file) $(</path/to/otherfile))
To handle lots of files, you could loop:
f=()
for file in *.txt; do
f+=($(<$file))
done
Then I saw chepner's response, which I gather is more korn-y than mine.
NOTE: loops are better than parsing ls.
You can also use like this:
#!/bin/bash
testVar=`ls -1 *.txt`
if [ -z "$testVar" ]; then
# Empty
count=0
else
# Not Empty
count=`wc -l <<< "$testVar"`
fi
echo "Count : $count"

How to escape a previously unknown string in regular expression?

I need to egrep a string that isn't known before runtime and that I'll get via shell variable (shell is bash, if that matters). Problem is, that string will contain special characters like braces, spaces, dots, slashes, and so on.
If I know the string I can escape the special characters one at a time, but how can I do that for the whole string?
Running the string through a sed script to prefix each special character with \ could be an idea, I still need to rtfm how such a script should be written. I don't know if there are other, better, options.
I did read re_format(7) but it seems there is no such thing like "take the whole next string as literal"...
EDIT: to avoid false positives, I should also add newline detection to the pattern, eg. egrep '^myunknownstring'
If you need to embed the string into a larger expression, sed is how I would do it.
s_esc="$(echo "$s" | sed 's/[^-A-Za-z0-9_]/\\&/g')" # backslash special characters
inv_ent="$(egrep "^item [0-9]+ desc $s_esc loc .+$" inventory_list)"
Use the -F flag to make the PATTERN a fixed literal string
$ var="(.*+[a-z]){3}"
$ echo 'foo bar (.*+[a-z]){3} baz' | grep -F "$var" -o
(.*+[a-z]){3}
Are you trying to protect the string from being incorrectly interpreted as bash syntax or are you trying to protect parts of the string from being interpreted as regular expression syntax?
For bash protection:
grep supports the -f switch:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
No escaping is necessary inside the file. Just make it a file containing a single line (and thus one pattern) which can be produced from your shell variable if that's what you need to do.
# example trivial regex
var='^r[^{]*$'
pattern=/tmp/pattern.$$
rm -f "$pattern"
echo "$var" > "$pattern"
egrep -f "$pattern" /etc/password
rm -f "$pattern"
Just to illustrate the point.
Try it with -F instead as another poster suggested for regex protection.

Resources