KSH Shell script Matching file pattern - shell

I am new to shell script. I want to iterate a directory for the below specific pattern.
Ad_sf_03041500000.dat
SF_AD_0304150.DEL
SF_AD_0404141.EXP
Number of digits should be exactly match with this pattern.
I am using KSH shell script. Could you please help me to iterate only those files in for loop.

The patterns you are looking for are
Ad_sf_{11}([[:digit:]]).dat
SF_AD_{7}([[:digit:]]).DEL
SF_AD_{7}([[:digit:]]).EXP
Note that the {n}(...) pattern, to match exactly n occurrences of the following pattern, is an extension unique to ksh (as far as I know, not even zsh provides an equivalent).
To iterate over matching files, you can use
for f in Ad_sf_{11}(\d).dat SF_AD_{7}(\d).#(DEL|EXP); do
where I've use the "pick one" operator #(...) to combine the two shorter patterns into a single pattern, and I've used \d, which ksh supports as a shorter version of [[:digit:]] when inside parentheses.

Automatic wildcard generation method. Print the filenames with leading text and line numbers...
POSIX shell:
2> /dev/null find \
$(echo Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP |
sed 's/[0-9]/[0-9]/g' ) |
while read f ; do
echo "Here's $f";
done | nl
ksh (with a spot borrowed from Chepner):
set - Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP
for f in ${*//[0-9]/[0-9]} ; do [ -f "$f" ] || continue
echo "Here's $f";
done | nl
Output of either method:
1 Here's Ad_sf_03041500000.dat
2 Here's SF_AD_0304150.DEL
3 Here's SF_AD_0404141.EXP
If the line numbers aren't wanted, omit the | nl. echo can be replaced with whatever command needs to be run on the files.
How the POSIX code works. The OP spec is simple enough to churn out the correct wildcard with a little tweaking. Example:
echo Ad_sf_03041500000.dat SF_AD_0304150.DEL SF_AD_0404141.EXP |
sed 's/[0-9]/[0-9]/g'
Which outputs exactly the patterns needed (line feeds added for clarity):
Ad_sf_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].dat
SF_AD_[0-9][0-9][0-9][0-9][0-9][0-9][0-9].DEL
SF_AD_[0-9][0-9][0-9][0-9][0-9][0-9][0-9].EXP
The patterns above go to find, which prints only the matching filenames, (not the pattern itself when there are no files), then the filenames go to a while loop.
(The ksh variant is the same method but uses pattern substitution, set, and test -f in place of sed and find.)

Related

Extracting part of a string bounded by special symbols

Hello I am passing strings for example /bin/bash/Xorg.tar.gz to my script which is
for i in $*; do
echo "$(expr match "$i" '\.*\.')"
done
I expect to return Xorg only but it returns 0,any ideas why?
It seems weird that your string would be /bin/bash/Xorg.tar.gz (kinda looks like /bin/bash is a directory or something) but either way, you can use standard parameter expansion to get the part you want:
i=${i##*/}
i=${i%%.*}
First remove everything up to the last /, then remove everything from the first ..
expr match directive attempts to match complete input not partial.
However, you can use builtin BASH regex for this:
[[ "$i" =~ .*/([^./]+)\. ]] && echo "${BASH_REMATCH[1]}"
This will print Xorg for your example argument.
The immediate fix (leaving the loop aside):
$ expr '/path/to/Xorg.tar.gz' : '.*/\([^.]*\)'
Xorg
Note:
: is needed after the input string to signal a regex-matching operation.
Note: expr <string> : <regex> is the POSIX-compliant syntax; GNU expr also accepts expr match <string> <regex>, as in your attempt.
expr implicitly matches from the start of the string, so .*/ must be used to match everything up to the last /
\([^.]*\) is used to match everything up to, but not including, the first . of the filename component; note the \-escaping of the ( and ) (the capture group delimiters), which is needed, because expr only supports (the obsolescent and limited) BREs.
Using a capture group ensures that the matched string is output, whereas by default the count of matching chars. is output.
As for the regex you used:
'\.*\.': \.* matches any (possibly empty) sequence (*) of literal . chars. (\.), implicitly at the start of the string, followed by exactly 1 literal . (\.).
In other words: you tried to match 2 or more consecutive . chars. at the start of the string, which is obviously not what you intended.
Because your regex doesn't contain a capture group, expr outputs the count of matching characters, which in this case is 0, since nothing matches.
That said, calling an external utility in every iteration of a shell loop is inefficient, so consider:
Tom Fenech's helpful answer, which only uses shell parameter expansions.
anubhava's helpful answer, which only uses Bash's built-in regex-matching operator, =~
If you don't actually need a shell loop and are fine with processing all paths with a single command using external utilities, consider this:
basename -a "$#" | cut -d'.' -f1
Note: basename -a, for processing multiple filename operands, is nonstandard, but both GNU and BSD/macOS basename support it.
To demonstrate it in action:
# Set positional parameters with `set`.
$ set -- '/path/to/Xorg.tar.gz' '/path/to/another/File.with.multiple.suffixes'
$ basename -a "$#" | cut -d'.' -f1
Xorg
File

Using egrep and regular expression together

I want to search the below text file for words that ends in _letter, and get the whole portion upto "::". There is no space between any letter
blahblah:/blahblah::abc_letter:/blahblah/blahblah
blahblah:/blahblah::cd_123_letter:/blahblah/blahblah
blahblah:::/blahblah::24_cde_letter:/blahblah/blahblah
blahblah::/blahblah::45a6_letter:/blahblah/blahblah
blahblah:/blahblah::fgh_letter:/blahblah/blahblah
blahblah:/blahblah::789_letter:/blahblah/blahblah
I tried
egrep -o '*_letter'
and
egrep -o "*_letter"
But it only returns the word _letter
then I want to feed the input to the parametre of a shell script for loop. So the script will look like following
for i in [grep command]
mkdir $i
end
It will create the following directories
abc_letter/
cd_123_letter/
24_cde_letter/
45a6_letter/
fgh_letter/
789_letter/
ps: The result between :: and _letter doesn't contain any special character, only alphanumeric character
also my system doesn't have perl
Assuming no spaces or new-lines:
for i in $(sed 's/^.*:\([^/]*_letter\):.*$/\1/g' infile); do
mkdir $i
done
To extract after : to _letter strings from a file.txt and use them in your for loop, you can use the following egrep and revise your: script.sh, like this:
#!/bin/bash
for i in $(egrep -o "[^:]+_letter" file.txt); do
mkdir -p $i
done
Then you run ./script.sh, and later you check with ls, you see:
$ ls -1
24_cde_letter
45a6_letter
789_letter
abc_letter
cd_123_letter
fgh_letter
file.txt
script.sh
Explanation
Your original egrep -o '*_letter' probably just confused bash filename expansion with regular expression,
In bash, *something uses star globbing character to match * = anything here + something.
However in regular expression star * means the preceding character zero or more times. Since * is at the beginning of what you wrote, there is nothing before it, so it does not match anything there.
The only thing egrep can match is _letter, and since we are using the -o option it only displays the match, on an individual line, and thus why you originally only saw a line of _letter matches
Our new changes:
egrep pattern starts with [^ ... ], a negation, matches the opposite of what characters you put within. We put : within.
The + says to match the preceding one or more times.
So combined, it says look for anything-but-:, and do this one or more times.
Thus of course it matches anything after :, and keeps matching, until the next part of the pattern
The next part of the pattern is just _letter
egrep -o so only matched text will be shown, one per line
So in this way, from lines such as:
blahblah:/blahblah::abc_letter:/blahblah/blahblah
It successfully extracts:
abc_letter
Then, changes to your bash script:
Bash command substitution $() to have the results of the egrep command sent to the for-loop
for i value...; do ... done syntax
mkdir -p just a convenience in case you are re-testing, it will not error if directory was already made.
So altogether it helps to extract the pattern you wanted and generate directories with those names.

how to count the number of lines in a variable in a shell script

Having some trouble here. I want to capture output from ls command into variable. Then later use that variable and count the number of lines in it. I've tried a few variations
This works, but then if there are NO .txt files, it says the count is 1:
testVar=`ls -1 *.txt`
count=`wc -l <<< $testVar`
echo '$count'
This works for when there are no .txt files, but the count comes up short by 1 when there are .txt files:
testVar=`ls -1 *.txt`
count=`printf '$testVar' | wc -l`
echo '$count'
This variation also says the count is 1 when NO .txt files exist:
testVar=`ls -1 *.txt`
count=`echo '$testVar' | wc -l`
echo '$count'
Edit: I should mention this is korn shell.
The correct approach is to use an array.
# Use ~(N) so that if the match fails, the array is empty instead
# of containing the pattern itself as the single entry.
testVar=( ~(N)*.txt )
count=${#testVar[#]}
This little question actually includes the result of three standard shell gotchas (both bash and korn shell):
Here-strings (<<<...) have a newline added to them if they don't end with a newline. That makes it impossible to send a completely empty input to a command with a here-string.
All trailing newlines are removed from the output of a command used in command substitution (cmd or preferably $(cmd)). So you have no way to know how many blank lines were at the end of the output.
(Not really a shell gotcha, but it comes up a lot). wc -l counts the number of newline characters, not the number of lines. So the last "line" is not counted if it is not terminated with a newline. (A non-empty file which does not end with a newline character is not a Posix-conformant text file. So weird results like this are not unexpected.)
So, when you do:
var=$(cmd)
utility <<<"$var"
The command substitution in the first line removes all trailing newlines, and then the here-string expansion in the second line puts exactly one trailing newline back. That converts an empty output to a single blank line, and otherwise removes blank lines from the end of the output.
So utility is wc -l, then you will get the correct count unless the output was empty, in which case it will be 1 instead of 0.
On the other hand, with
var=$(cmd)
printf %s "$cmd" | utility
The trailing newline(s) are removed, as before, by the command substitution, so the printf leaves the last line (if any) unterminated. Now if utility is wc -l, you'll end up with 0 if the output was empty, but for non-empty files, the count will not include the last line of the output.
One possible shell-independent work-around is to use the second option, but with grep '' as a filter:
var=$(cmd)
printf %s "${var}" | grep '' | utility
The empty pattern '' will match every line, and grep always terminates every line of output. (Of course, this still won't count blank lines at the end of the output.)
Having said all that, it is always a bad idea to try to parse the output of ls, even just to count the number of files. (A filename might include a newline character, for example.) So it would be better to use a glob expansion combined with some shell-specific way of counting the number of objects in the glob expansion (and some other shell-specific way of detecting when no file matches the glob).
I was going to suggest this, which is a construct I've used in bash:
f=($(</path/to/file))
echo ${#f[#]}
To handle multiple files, you'd just .. add files.
f=($(</path/to/file))
f=+($(</path/to/otherfile))
or
f=($(</path/to/file) $(</path/to/otherfile))
To handle lots of files, you could loop:
f=()
for file in *.txt; do
f+=($(<$file))
done
Then I saw chepner's response, which I gather is more korn-y than mine.
NOTE: loops are better than parsing ls.
You can also use like this:
#!/bin/bash
testVar=`ls -1 *.txt`
if [ -z "$testVar" ]; then
# Empty
count=0
else
# Not Empty
count=`wc -l <<< "$testVar"`
fi
echo "Count : $count"

In Bash, how to strip out all numbers in the file names in a directory while leaving the file extension intact

I have files in a directory like this:
asdfs54345gsdf.pdf
gsdf6456wer.pdf
oirt4534724wefd.pdf
I want to rename all the files to just the numbers + .pdf so the above files would be renamed to:
54345.pdf
6456.pdf
4534724.pdf
The best would be a native Bash command or script (OSX 10.6.8)
Some clues I picked up include
sed 's/[^0-9]*//g' input.txt
sed 's/[^0-9]*//g' input.txt > output.txt
sed -i 's/[^0-9]*//g' input.txt
echo ${A//[0-9]/} rename 's/[0-9] //' *.pdf
This sould do it:
for f in *.pdf
do
mv "$f" "${f//[^0-9]/}.pdf"
done
but you better try before:
for f in *.pdf
do
echo mv "$f" "${f//[^0-9]/}.pdf"
done
Note, that abc4.pdf and zzz4.pdf will both be renamed to 4.pdf. So maybe you use mv -i instead of just mv.
updte: explaining:
I guess the fist part is clear; *.pdf is called globbing, and matches all files, ending with .pdf. for f in ... just iterates over them, setting f to one of them each time.
for f in *.pdf
do
mv "$f" "${f//[^0-9]/}.pdf"
done
I guess
mv source target
is clear as well. If a file is named "Unnamed File1", you need to mask it with quotes, because else mv will read
mv Unnamed File1 1.pdf
which means, it has multiple files to move, Unnamed and File1, and will interpret 1.pdf to be a directory to move both files to.
Okay, I guess the real issue is here:
"${f//[^0-9]/}.pdf"
There is an outer glueing of characters. Let be
foo=bar
some variable assignment Then
$foo
${foo}
"$foo"
"${foo}"
are four legitimate ways to refer to them. The last two used to mask blanks and such, so this is in some cases no difference, in some cases it is.
If we glue something together
$foo4
${foo}4
"$foo"4
"${foo}"4
the first form will not work - the shell will look for a variable foo4. All other 3 expressions refer to bar4 - first $foo is interpreted as bar, and then 4 is appended. For some characters the masking is not needed:
$foo/fool
${foo}/fool
"$foo"/fool
"${foo}"/fool
will all be interpreted in the same way. So whatever "${f//[^0-9]/}" is, "${f//[^0-9]/}.pdf" is ".pdf" appended to it.
We approach the kernel of all mysterias:
${f//[^0-9]/}
This is a substitution expression of the form
${variable//pattern/replacement}
variable is $f (we can omit the $ inside the braces here) is said $f from above. That was easy!
replacement is empty - that was even more easy.
But [^0-9] is something really complicated, isn't it?
-
[0-9]
is just the group of all digits from 0 to 9, other groups could be:
[0-4] digits below 5
[02468] even digits
[a-z] lower case letters
[a-zA-Z] all (common latin) characters
[;:,/] semicolon, colon, comma, slash
The Caret ^ in front as first character is the negation of the group:
[^0-9]
means everything except 0 to 9 (including dot, comma, colon, ...) is in the group. Together:
${f//[^0-9]/}
remove all non-digits from $f, and
"${f//[^0-9]/}.pdf"
append .pdf - the whole thing masked.
${v//p/r}
and its friends (there are many useful) are explained in man bash in the chapter Parameter Expansion. For the group I don't have a source for further reading at hand.

How do you escape a user-provided search term that you don't want evaluated for sed?

I'm trying to escape a user-provided search string that can contain any arbitrary character and give it to sed, but can't figure out how to make it safe for sed to use. In sed, we do s/search/replace/, and I want to search for exactly the characters in the search string without sed interpreting them (e.g., the '/' in 'my/path' would not close the sed expression).
I read this related question concerning how to escape the replace term. I would have thought you'd do the same thing to the search, but apparently not because sed complains.
Here's a sample program that creates a file called "my_searches". Then it reads each line of that file and performs a search and replace using sed.
#!/bin/bash
# The contents of this heredoc will be the lines of our file.
read -d '' SAMPLES << 'EOF'
/usr/include
P#$$W0RD$?
"I didn't", said Jane O'Brien.
`ls -l`
~!##$%^&*()_+-=:'}{[]/.,`"\|
EOF
echo "$SAMPLES" > my_searches
# Now for each line in the file, do some search and replace
while read line
do
echo "------===[ BEGIN $line ]===------"
# Escape every character in $line (e.g., ab/c becomes \a\b\/\c). I got
# this solution from the accepted answer in the linked SO question.
ES=$(echo "$line" | awk '{gsub(".", "\\\\&");print}')
# Search for the line we read from the file and replace it with
# the text "replaced"
sed 's/'"$ES"'/replaced/' < my_searches # Does not work
# Search for the text "Jane" and replace it with the line we read.
sed 's/Jane/'"$ES"'/' < my_searches # Works
# Search for the line we read and replace it with itself.
sed 's/'"$ES"'/'"$ES"'/' < my_searches # Does not work
echo "------===[ END ]===------"
echo
done < my_searches
When you run the program, you get sed: xregcomp: Invalid content of \{\} for the last line of the file when it's used as the 'search' term, but not the 'replace' term. I've marked the lines that give this error with # Does not work above.
------===[ BEGIN ~!##$%^&*()_+-=:'}{[]/.,`"| ]===------
sed: xregcomp: Invalid content of \{\}
------===[ END ]===------
If you don't escape the characters in $line (i.e., sed 's/'"$line"'/replaced/' < my_searches), you get this error instead because sed tries to interpret various characters:
------===[ BEGIN ~!##$%^&*()_+-=:'}{[]/.,`"| ]===------
sed: bad format in substitution expression
sed: No previous regexp.
------===[ END ]===------
So how do I escape the search term for sed so that the user can provide any arbitrary text to search for? Or more precisely, what can I replace the ES= line in my code with so that the sed command works for arbitrary text from a file?
I'm using sed because I'm limited to a subset of utilities included in busybox. Although I can use another method (like a C program), it'd be nice to know for sure whether or not there's a solution to this problem.
This is a relatively famous problem—given a string, produce a pattern that matches only that string. It is easier in some languages than others, and sed is one of the annoying ones. My advice would be to avoid sed and to write a custom program in some other language.
You could write a custom C program, using the standard library function strstr. If this is not fast enough, you could use any of the Boyer-Moore string matchers you can find with Google—they will make search extremely fast (sublinear time).
You could write this easily enough in Lua:
local function quote(s) return (s:gsub('%W', '%%%1')) end
local function replace(first, second, s)
return (s:gsub(quote(first), second))
end
for l in io.lines() do io.write(replace(arg[1], arg[2], l), '\n') end
If not fast enough, speed things up by applying quote to arg[1] only once, and inline frunciton replace.
As ghostdog mentioned, awk '{gsub(".", "\\\\&");print}' is incorrect because it escapes out non-special characters. What you really want to do is perhaps something like:
awk 'gsub(/[^[:alpha:]]/, "\\\\&")'
This will escape out non-alpha characters. For some reason I have yet to determine, I still cant replace "I didn't", said Jane O'Brien. even though my code above correctly escapes it to
\"I\ didn\'t\"\,\ said\ Jane\ O\'Brien\.
It's quite odd because this works perfectly fine
$ echo "\"I didn't\", said Jane O'Brien." | sed s/\"I\ didn\'t\"\,\ said\ Jane\ O\'Brien\./replaced/
replaced`
this : echo "$line" | awk '{gsub(".", "\\\\&");print}' escapes every character in $line, which is wrong!. do an echo $ES after that and $ES appears to be \/\u\s\r\/\i\n\c\l\u\d\e. Then when you pass to the next sed, (below)
sed 's/'"$ES"'/replaced/' my_searches
, it will not work because there is no line that has pattern \/\u\s\r\/\i\n\c\l\u\d\e. The correct way is something like:
$ sed 's|\([#$#^&*!~+-={}/]\)|\\\1|g' file
\/usr\/include
P\#\$\$W0RD\$?
"I didn't", said Jane O'Brien.
\`ls -l\`
\~\!\#\#\$%\^\&\*()_\+-\=:'\}\{[]\/.,\`"\|
you put all the characters you want escaped inside [], and choose a suitable delimiter for sed that is not in your character class, eg i chose "|". Then use the "g" (global) flag.
tell us what you are actually trying to do, ie an actual problem you are trying to solve.
This seems to work for FreeBSD sed:
# using FreeBSD & Mac OS X sed
ES="$(printf "%q" "${line}")"
ES="${ES//+/\\+}"
sed -E s$'\777'"${ES}"$'\777'replaced$'\777' < my_searches
sed -E s$'\777'Jane$'\777'"${line}"$'\777' < my_searches
sed -E s$'\777'"${ES}"$'\777'"${line}"$'\777' < my_searches
The -E option of FreeBSD sed is used to turn on extended regular expressions.
The same is available for GNU sed via the -r or --regexp-extended options respectively.
For the differences between basic and extended regular expressions see, for example:
http://www.gnu.org/software/sed/manual/sed.html#Extended-regexps
Maybe you can use FreeBSD-compatible minised instead of GNU sed?
# example using FreeBSD-compatible minised,
# http://www.exactcode.de/site/open_source/minised/
# escape some punctuation characters with printf
help printf
printf "%s\n" '!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~'
printf "%q\n" '!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~'
# example line
line='!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~ ... and Jane ...'
# escapes in regular expression
ES="$(printf "%q" "${line}")" # escape some punctuation characters
ES="${ES//./\\.}" # . -> \.
ES="${ES//\\\\(/(}" # \( -> (
ES="${ES//\\\\)/)}" # \) -> )
# escapes in replacement string
lineEscaped="${line//&/\&}" # & -> \&
minised s$'\777'"${ES}"$'\777'REPLACED$'\777' <<< "${line}"
minised s$'\777'Jane$'\777'"${lineEscaped}"$'\777' <<< "${line}"
minised s$'\777'"${ES}"$'\777'"${lineEscaped}"$'\777' <<< "${line}"
To avoid potential backslash confusion, we could (or rather should) use a backslash variable like so:
backSlash='\\'
ES="${ES//${backSlash}(/(}" # \( -> (
ES="${ES//${backSlash})/)}" # \) -> )
(By the way using variables in such a way seems like a good approach for tackling parameter expansion issues ...)
... or to complete the backslash confusion ...
backSlash='\\'
lineEscaped="${line//${backSlash}/${backSlash}}" # double backslashes
lineEscaped="${lineEscaped//&/\&}" # & -> \&
If you have bash, and you're just doing a pattern replacement, just do it natively in bash. The ${parameter/pattern/string} expansion in Bash will work very well for you, since you can just use a variable in place of the "pattern" and replacement "string" and the variable's contents will be safe from word expansion. And it's that word expansion which makes piping to sed such a hassle. :)
It'll be faster than forking a child process and piping to sed anyway. You already know how to do the whole while read line thing, so creatively applying the capabilities in Bash's existing parameter expansion documentation can help you reproduce pretty much anything you can do with sed. Check out the bash man page to start...

Resources