Shell: extract words matching pattern, but ignore circumventing expression - shell

I am currently trying to extract ALL matching expressions from a text which e.g. looks like this and put them into an array.
aaaaaaaaa${bbbbbbb}ccccccc${dddd}eeeee
ssssssssssssssssss${TTTTTT}efhsekfh ej
348653jlk3jß1094utß43t59ßgöelfl,-s-fko
The matching expressions are similar to this: ${}. Beware that I need the full expression, not only the word in between this expression! So in this case the result should be an array which contains:
${bbbbbbb}
${dddd}
${TTTTTTT}
Problems I have stumbled upon and couldn't solve:
It should NOT recognizes this as a whole
${bbbbbbb}ccccccc${dddd} but each for its own
grep -o is not installed on the old machine, Perl is not allowed either!
Many commands e.g. BASH_REMATCH only deliver the whole line or the first occurrence of the expression, instead of all matching expressions in the line!
The mentioned pattern \${[^}]*} seems to work partly, as it can extract the first occurrence of the expression, however it always omitts the ones following after that, if it's in the same text line. What I need is ALL matching expressions found in the line, not only the first one.

You could split the string on any of the characters $,{,}:
$ s='...blaaaaa${blabla}bloooo${bla}bluuuuu...'
$ echo "$s"
...blaaaaa${blabla}bloooo${bla}bluuuuu...
$ IFS='${}' read -ra words <<< "$s"
$ for ((i=0; i<${#words[#]}; i++)); do printf "%d %s\n" $i "${words[i]}"; done
0 ...blaaaaa
1
2 blabla
3 bloooo
4
5 bla
6 bluuuuu...
So if you're trying to extract the words inside the braces:
$ for ((i=2; i<${#words[#]}; i+=3)); do printf "%d %s\n" $i "${words[i]}"; done
2 blabla
5 bla
If the above doesn't suit you, grep will work:
$ echo '...blaaaaa${blabla}bloooo${bla}bluuuuu...' | grep -o '\${[^}]\+}'
${blabla}
${bla}
You still haven't told us exactly what output you want.

Since it bugged me a lot I have asked directly on www.unix.com and was kindly provided with a solution which fits for my ancient shell. So if anyone got the same problem here is the solution:
line='aaaa$aa{yyy}aaa${important}xxxxxxxx${important2}oo{o$}oo$oo${importantstring3}'
IFS=\$ read -a words <<< "$line"
regex='^(\{[^}]+})'
for e in "${words[#]}"; do
if [[ $e =~ $regex ]]; then
echo "\$${BASH_REMATCH[0]}";
fi;
done
which prints then the following - without even getting disturbed by random occurrences of $ and { or } between the syntactically correct expressions:
${important}
${important2}
${importantstring3}
I have updated the full solution after I got another update from the forums: now it also ignores this: aaa$aa{yyy}aaaa - which it previously printed as ${yyy} - but which it should completely ignore as there are characters between $ and {. Now with the additional anchoring on the beginning of the regexp it works as expected.
I just found another issue: theoretically using the above approach I would still get a wrong output if the read line looks like this line='{ccc}aaaa${important}aaa'. The IFS would split it and the REGEX would match {ccc} although this hadn't the $ sign in front. This is suboptimal.
However following approach could solve it: after getting the BASH_REMATCH I would need to do a search in the original line - the one I gave to the IFS - for this exact expression ${ccc} - with the difference, that the $ is included! And only if it finds this exact match, only then, it counts as a valid match; otherwise it should be ignored. Kind of a reverse search method...
Updated - add this reverse search to ignore the trap on the beginning of the line:
pattern="\$${BASH_REMATCH[0]}";
searchresult="";
searchresult=`echo "$line" | grep "$pattern"`;
if [ "$searchresult" != "" ]; then echo "It was found!"; fi;
Neglectable issue: If the line looks like this line='{ccc}aaaaaa${ccc}bbbbb' it would recognize the first {ccc} as a valid match (although it isn't) and print it, because the reverse search found the second ${ccc}. Although this is not intended it's irrelevant for my specific purpose as it implies that this pattern does in fact exist at least once in the same line.

Related

Bash - binary operator expected

I've got bash script for counting rows in the reports. I have one array where all reports names are stored and in the loop I'm counting rows. However for some files my script receives binary operator expected error. Do anyone have a solution?
for i in ${ARRAY[#]}; do
if [ ! -f "$BASE_DIR/$i"* ];
then
echo "File not generated yet"
else
ARRAY2=$(wc -l < "$BASE_DIR/$i"*.tab | awk '{print $1-2}')
echo ${ARRAY2[$i]} $i
fi
Use double straight braces instead of ones as follows since you r using extended expressions.
if [[ ! -f "$BASE_DIR/$i"* ]];
Need to check with array contents. Special characters as ' ' (spaces) in file names must be escaped.
-f takes just one argument, so the error occurs when the pattern matches more than one file.
It seems to work with [[, although I can't find any documentation as to why it does.
The bigger problem is you can also only use one file with the < operator; if the pattern matches multiple files, you'll get an ambiguous redirect error. To fix that, you'll need to use cat:
cat "$BASE_DIR/$i"*.tab | wc -l
However, it's not clear what you are expecting from the output; ARRAY2 will not actually be an array.

Extracting a string between last two slashes in Bash

I know this can be easily done using regex like I answered on https://stackoverflow.com/a/33379831/3962126, however I need to do this in bash.
So the closest question on Stackoverflow I found is this one bash: extracting last two dirs for a pathname, however the difference is that if
DIRNAME = /a/b/c/d/e
then I need to extract
d
This may be relatively long, but it's also much faster to execute than most preceding answers (other than the zsh-only one and that by j.a.), since it uses only string manipulations built into bash and uses no subshell expansions:
string='/a/b/c/d/e' # initial data
dir=${string%/*} # trim everything past the last /
dir=${dir##*/} # ...then remove everything before the last / remaining
printf '%s\n' "$dir" # demonstrate output
printf is used in the above because echo doesn't work reliably for all values (think about what it would do on a GNU system with /a/b/c/-n/e).
Here a pure bash solution:
[[ $DIRNAME =~ /([^/]+)/[^/]*$ ]] && printf '%s\n' "${BASH_REMATCH[1]}"
Compared to some of the other answers:
It matches the string between the last two slashes. So, for example, it doesn't match d if DIRNAME=d/e.
It's shorter and fast (just uses built-ins and doesn't create subprocesses).
Support any character between last two slashes (see Charles Duffy's answer for more on this).
Also notice that is not the way to assign a variable in bash:
DIRNAME = /a/b/c/d/e
^ ^
Those spaces are wrong, so remove them:
DIRNAME=/a/b/c/d/e
Using awk:
echo "/a/b/c/d/e" | awk -F / '{ print $(NF-1) }' # d
Edit: This does not work when the path contains newlines, and still gives output when there are less than two slashes, see comments below.
Using sed
if you want to get the fourth element
DIRNAME="/a/b/c/d/e"
echo "$DIRNAME" | sed -r 's_^(/[^/]*){3}/([^/]*)/.*$_\2_g'
if you want to get the before last element
DIRNAME="/a/b/c/d/e"
echo "$DIRNAME" | sed -r 's_^.*/([^/]*)/[^/]*$_\1_g'
OMG, maybe this was obvious, but not to me initially. I got the right result with:
dir=$(basename -- "$(dirname -- "$str")")
echo "$dir"
Using zsh parameter substitution is pretty cool too
echo ${${DIRNAME%/*}##*/}
I think it's faster than the double $() as well, because it won't need any subprocesses.
Basically it slices off the right side first, and then all the remaining left side second.

How to read a space in bash - read will not

Many people have shown how to keep spaces when reading a line in bash. But I have a character based algorithm which need to process each end every character separately - spaces included. Unfortunately I am unable to get bash read to read a single space character from input.
while read -r -n 1 c; do
printf "[%c]" "$c"
done <<< "mark spitz"
printf "[ ]\n"
yields
[m][a][r][k][][s][p][i][t][z][][ ]
I've hacked my way around this, but it would be nice to figure out how to read a single any single character.
Yep, tried setting IFS, etc.
Just set the input field separator(a) so that it doesn't treat space (or any character) as a delimiter, that works just fine:
printf 'mark spitz' | while IFS="" read -r -n 1 c; do
printf "[%c]" "$c"
done
echo
That gives you:
[m][a][r][k][ ][s][p][i][t][z]
You'll notice I've also slightly changed how you're getting the input there, <<< appears to provide a extraneous character at the end and, while it's not important to the input method itself, I though it best to change that to avoid any confusion.
(a) Yes, I'm aware that you said you've tried setting IFS but, since you didn't actually show how you'd tried this, and it appears to work fine the way I do it, I have to assume you may have just done something wrong.

Find negative numbers in file with grep

i have this script that reads a file, the file looks like this:
711324865,438918283,2
-333308476,886548365,2
1378685449,-911401007,2
-435117907,560922996,2
259073357,714183955,2
...
the script:
#!/bin/bash
while IFS=, read childId parentId parentLevel
do
grep "\$parentId" parent_child_output_level2.csv
resul=$?
echo "child is $childId, parent is $parentId parentLevel is $parentLevel resul is $resul"
done < parent_child_output_level1.csv
but it is not working, resul is allways returning me 1, which is a false positive.
I know that because i can launch the next command, equivalent, i think:
[core#dub-vcd-vms165 generated-and-saved-to-hdfs]$
grep "\-911401007"parent_child_output_level2.csv
-911401007,-157143722,3
Please help.
grep command to print only the negative numbers.
$ grep -oP '(^|,)\K-\d+' file.csv
-333308476
-911401007
-435117907
(^|,) matches the start of a line or comma.
\K discards the previously matched characters.
-\d+ Matches - plus the following one or more numbers.
Your title is inconsistent with your question. Your title asks for how to grep negative numbers, which Avinash Raj answered well, although I'd suggest you don't even need the (Perl-style) look-behind positive assertion (^|,)\K to match start-of-field, because if the file is well-formed, then -\d+ would match all numbers just as well. So you could just run (edit: realized that with a leading - you need -- to prevent grep from taking the pattern as an option):
grep -oP -- '-\d+' file.csv;
Your question includes a script whose intention seems to be to grep for any number (positive or negative) in the first field (childId) of one file (parent_child_output_level2.csv) that occurs in the second field (parentId) of another file (parent_child_output_level1.csv). To accomplish this, I wouldn't use grep, because you're trying to do an exact numerical equality test, which can even be done as an exact string equality test assuming your numbers are always consistently represented (e.g. no redundant leading zeroes). Repeatedly grepping through the entire file just to search for a number in one column is also wasteful of CPU.
Here's what I would do:
parentIdList=($(cut -d, -f2 parent_child_output_level1.csv));
childIdList=($(cut -d, -f1 parent_child_output_level2.csv));
for parentId in "${parentIdList[#]}"; do
for childId in "${childIdList[#]}"; do
if [[ "$childId" == "$parentId" ]]; then
echo "$parentId";
fi;
done;
done;
With this approach, you precompute both the parent id list and the child id list just once, using cut to extract the appropriate field from each file. Then you can use the shell-builtin for loop, shell-builtin if conditional, and shell-builtin [[ test command to accomplish the check, and finally finish with a shell-builtin echo to print the matches. Everything is shell-builtin, after the initial command substitutions that run the cut external executable.
If you also want to filter these results on negative numbers, you could grep for ^- in the results of the above script, or grep for it in the results of each (or just the first) cut command, or add the following line just inside the outer for loop:
if [[ "${parentId:0:1}" != '-' ]]; then continue; fi;
Alternative approach:
if [[ "$parentId" != -* ]]; then continue; fi;
Either approach will skip non-negatives.

Match exact word in bash script, extract number from string

I'm trying to create a very simple bash script that will open new link base on the input command
Use case #1
$ ./myscript longname55445
It should take the number 55445 and then assign that to a variable which will later be use to open new link based on the given number.
Use case #2
$ ./myscript l55445
It should do the exact same thing as above by taking the number and then open the same link.
Use case #3
$ ./myscript 55445
If no prefix given then we just simply open that same link as a fallback.
So far this is what I have
#!/bin/sh
BASE_URL=http://api.domain.com
input=$1
command=${input:0:1}
if [ "$command" == "longname" ]; then
number=${input:1:${#input}}
url="$BASE_URL?id="$number
open $url
elseif [ "$command" == "l" ]; then
number=${input:1:${#input}}
url="$BASE_URL?id="$number
open $url
else
number=${input:1:${#input}}
url="$BASE_URL?id="$number
open $url
fi
But this will always fallback to the elseif there.
I'm using zsh at the moment.
input=$1
command=${input:0:1}
sets command to the first character of the first argument. It's not possible for a one character string to be equal to an eight-character string ("longname"), so the if condition must always fail.
Furthermore, both your elseif and your else clauses set
number=${input:1:${#input}}
Which you could have written more simply as
number=${input:1}
But in both cases, you're dropping the first character of input. Presumably in the else case, you wanted the entire first argument.
see whether this construct is helpful for your purpose:
#!/bin/bash
name="longname55445"
echo "${name##*[A-Za-z]}"
this assumes a letter adjacent to number.
The following is NOT another way to write the same, because it is wrong.
Please see comments below by mklement0, who noticed this. Mea culpa.
echo "${name##*[:letter:]}"
You have command=${input:0:1}
It takes the first single char, and you compare it to "longname", of course it will fail, and go to elseif.
The key problem is to check if the input is beginning with l or longnameor nothing. If in one of the 3 cases, take the trailing numbers.
One grep line could do it, you can just grep on input and get the returned text:
kent$ grep -Po '(?<=longname|l|^)\d+' <<<"l234"
234
kent$ grep -Po '(?<=longname|l|^)\d+' <<<"longname234"
234
kent$ grep -Po '(?<=longname|l|^)\d+' <<<"234"
234
kent$ grep -Po '(?<=longname|l|^)\d+' <<<"foobar234"
<we got nothing>
You can use regex matching in bash.
[[ $1 =~ [0-9]+ ]] && number=$BASH_REMATCH
You can also use regex matching in zsh.
[[ $1 =~ [0-9]+ ]] && number=$MATCH
Based on the OP's following clarification in a comment,
I'm only looking for the numbers [...] given in the input.
the solution can be simplified as follows:
#!/bin/bash
BASE_URL='http://api.domain.com'
# Strip all non-digits from the 1st argument to get the desired number.
number=$(tr -dC '[:digit:]' <<<"$1")
open "$BASE_URL?id=$number"
Note the use of a bash shebang, given the use of 'bashism' <<< (which could easily be restated in a POSIX-compliant manner).
Similarly, the OP's original code should use a bash shebang, too, due to use of non-POSIX substring extraction syntax.
However, judging by the use of open to open a URL, the OP appears to be on OSX, where sh is essentially bash (though invocation as sh does change behavior), so it'll still work there. Generally, though, it's safer to be explicit about the required shell.

Resources