grep between some characters (quotes, etc) of after (eg. hashtag) any content (text, numbers, emojis) [duplicate] - bash

This question already has answers here:
Grep string inside double quotes
(4 answers)
Closed 1 year ago.
Based on this question: Bash sed - find hashtags in string; with no solutions for this case (when you have special characters).
This question is well-researched and not a duplicate of this unrelated question as the referred doesn't covers all the asked topics (support to special characters and numbers; grep both between and after/before).
echo "Text and #hashtag" | grep -o '#[[:alpha:]]\+*' | tr -d '"' works successfully, returning #hashtag; that's still related to the mentioned question...
...About this new question with mine own needs (that can be useful to you), this is my version, parsing text between doublequotes instead of after hashtag:
echo '#first = "Yes"' | grep -o '"[[:alpha:]]\+*"' | tr -d '"' and it works, returning Yes.
However, when it have an emoji or other characters such as > and / (example: echo '#first = "✅ Yes"' | grep -o '"[[:alpha:]]\+*"' | tr -d '"') it returns an empty output.
It have to support any kind of character (emojis, html tags, numbers).
This should be useful not only for parsing between characters, but also after a character (such as parsing any #hashtag text) or before.

The way to extract text between double quotes is to match any character except double quote, as many as possible, between double quotes.
grep -o '"[^"]*"' | tr -d '"'
Some test cases:
grep -o '"[^"]*"' <<\___here | tr -d '"'
there is "text" between "double quotes"
just one "?" here, "test me!"
any unpaired double quote " will not match
___here
The second one of these will fail with the current code in your own answer.

Thanks to #Aserre's pointings, I could come up with an answer.
In order for the "get every text when it appear AFTER a charater" and "get every text when it appear BETWEEN quotes" (grep) to work with any character, we have to replace [[:alpha:]] in the block to ...
So, it is:
echo '#first = "✅ Yes"' | grep -o '"...\+"' | tr -d '"' (get anything which is between double quotes)
and:
echo "Text and #hashtag" | grep -o '#...\+' | tr -d '"' (get anything which is after a hashtag)
Update:
If you want to support things with only 1 character (such as numbers ranging from 0 to 9), replace ... to . (single dot)
It works, as in the question, for: emojis, letters, numbers and other special characters.

Related

In bash, why does it output a right bracket as a value in when replacing?

So I was looking into tr and I was playing around with the following command:
echo "test 123 new LINE" | tr -c 'A-Za-z' '[\n*]' vs echo "test 123 new LINE" | tr -c 'A-Za-z' '[\n]'. These are the different outputs:
> echo "test 123 new LINE" | tr -c 'A-Za-z' '[\n*]'
test
new
LINE
> echo "test 123 new LINE" | tr -c 'A-Za-z' '[\n]'
test]]]]]new]LINE]
Without the addition of the wildcard, it appears to be replacing each new line with a right bracket character. Taking a look at the man page (https://linuxcommand.org/lc3_man_pages/tr1.html), it says that for the argument [CHAR*], it "in SET2, copies of CHAR until length of SET1". So clearly it still replaces characters in SET1 based on SET2 but where is it getting the right bracket from?
Nothing to do with my job so no need to tell me about sed or awk, just came across this and was curious.
In your second command, the replacement set is not in one of the special formats
[CHAR*]
[CHAR*REPEAT]
[:<keyword>:]
[=CHAR=]
So it doesn't get any special treatment and the square brackets are treated literally. So the first two non-alphabetic characters are replaced with [ and \n, respectively, and all other characters are replaced with ] (because the replacement set is extended by repeating the last character).

tr command: strange behavior with | and \

Let's say I have a file test.txt with contents:
+-foo.bar:2.4
| bar.foo:1.1:test
\| hello.goobye:3.3.3
\|+- baz.yeah:4
I want to use the tr command to delete all instances of the following set of characters:
{' ', '+', '-', '|', '\'}
Done some pretty extensive research on this but found no clear/concise answers.
This is the command that works:
input:
cat test.txt | tr -d "[:blank:]|\\\+-"
output:
foo.bar:2.4
bar.foo:1.1:test
hello.goobye:3.3.3
baz.yeah:4
I experimented with many combinations of that set and I found out that the '-' was being treated as a range indicator (like... [a-z]) and therefore must be put at the end. But I have two main questions:
1) Why must the backslash be double escaped in order to be included in the set?
2) Why does putting the '|' at the end of the set string cause the tr program to delete everything in the file except for trailing new line characters?
Like this:
tr -d '\-|\\+[:blank:] ' < file
You have to escape the - because it is used for denoting ranges of characters like:
tr -d '1-5'
and must therefore being escaped if you mean a literal hyphen. You can also put it at the end. (learned that, thanks! :) )
Furthermore the \ must be escaped when you mean a literal \ because it has a special meaning needed for escape sequences.
The remaining characters must not being escaped.
Why must the \ being doubly escaped in your example?
It's because you are using a "" (double quoted) string to quote the char set. A double quoted string will be interpreted by the shell, a \\ in a double quoted string means a literal \. Try:
echo "\+"
echo "\\+"
echo "\\\+"
To avoid to doubly escape the \ you can just use single quotes as in my example above.
Why does putting the '|' at the end of the set string cause the tr program to delete everything in the file except for trailing new line characters?
Following CharlesDuffy's comment having the | at the end means also that you had the unescaped - not at the end, which means it was describing a range of characters where the actual range depends on the position you had it in the set.
another approach is to define the allowed chars
$ tr -cd '[:alnum:]:.\n' <file
foo.bar:2.4
bar.foo:1.1:test
hello.goobye:3.3.3
baz.yeah:4
or, perhaps delete all the prefix non-word chars
$ sed -E 's/\W+//' file

How to use sed to extract a string [duplicate]

This question already has answers here:
BASH extract value after string in variable Not file [duplicate]
(2 answers)
Closed last year.
I need to extract a number from the output of a command: cmd. The output is type: 1000
So my question is how to execute the command, store its output in a variable and extract 1000 in a shell script. Also how do you store the extracted string in a variable?
This question has been answered in pieces here before, it would be something like this:
line=$(sed -n '2p' myfile)
echo "$line"
if [ `echo $line || grep 'type: 1000' ` ] then;
echo "It's there!";
fi;
Store output of sed into a variable
String contains in Bash
EDIT: sed is very limited, you would need to use bash, perl or awk for what you need.
This is a typical use case for grep:
output=$(cmd | grep -o '[0-9]\+')
You can write the output of a command or even a pipeline of commands into a shell variable using so called command substitution:
variable=$(cmd);
In comments it appeared that the output of cmd contains more lines than the type : 1000. In this case I would suggest sed:
output=$(cmd | sed -n 's/type : \([0-9]\+\)/\1/p;q')
You tagged your question as sed but your question description does not restrict other tools, so here's a solution using awk.
output = `cmd | awk -F':' '/type: [0-9]+/{print $2}'`
Alternatively, you can use the newer $( ) syntax. Some find the newer syntax preferable and it can be conveniently nested, without the need for escaping backtics.
output = $(cmd | awk -F':' '/type: [0-9]+/{print $2}')
If the output is rigidly restricted to "type: " followed by a number, you can just use cut.
var=$(echo 'type: 1000' | cut -f 2 -d ' ')
Obviously you'll have to pipe the output of your command to cut, I'm using echo as a demo.
In addition, I'd use grep and then cut if the string you are searching is more complex. If we assume there can be all kind of numbers in the text, but only one occurrence of "type: " followed by a number, you can use the command:
>> var=$(echo "hello 12 type: 1000 foo 1001" | grep -oE "type: [0-9]+" | cut -f 2 -d ' ')
>> echo $var
1000
You can use the | operator to send the output of one command to another, like so:
echo " 1\n 2\n 3\n" | grep "2"
This sends the string " 1\n 2\n 3\n" to the grep command, which will search for the line containing 2. It sound like you might want to do something like:
cmd | grep "type"
Here is a plain sed solution that uses a regualar expression to find the number in your string:
cmd | sed 's/^.*type: \([0-9]\+\)/\1/g'
^ means from the start
.* can be any character (also none)
\([0-9]\+\) are numbers (minimum one character)
\1 means it takes the first pattern it finds (and only in this case) and uses it as replacement for the whole string

Get string between strings in bash

I want to get the string between <sometag param=' and '>
I tried to use the method from Get any string between 2 string and assign a variable in bash to get the "x":
echo "<sometag param='x'><irrelevant stuff='nonsense'>" | tr "'" _ | sed -n 's/.*<sometag param=_\(.*\)_>.*/\1/p'
The problem (apart from low efficiency because I just cannot manage to escape the apostrophe correctly for sed) is that sed matches the maximum, i.e. the output is:
x_><irrelevant stuff=_nonsense
but the correct output would be the minimum-match, in this example just "x"
Thanks for your help
You are probably looking for something like this:
sed -n "s/.*<sometag param='\([^']*\)'>.*/\1/p"
Test:
echo "<sometag param='x'><irrelevant stuff='nonsense'>" | sed -n "s/.*<sometag param='\([^']*\)'>.*/\1/p"
Results:
x
Explanation:
Instead of a greedy capture, use a non-greedy capture like: [^']* which means match anything except ' any number of times. To make the pattern stick, this is followed by: '>.
You can also use double quotes so that you don't need to escape the single quotes. If you wanted to escape the single quotes, you'd do this:
-
... | sed -n 's/.*<sometag param='\''\([^'\'']*\)'\''>.*/\1/p'
Notice how that the single quotes aren't really escaped. The sed expression is stopped, an escaped single quote is inserted and the sed expression is re-opened. Think of it like a four character escape sequence.
Personally, I'd use GNU grep. It would make for a slightly shorter solution. Run like:
... | grep -oP "(?<=<sometag param=').*?(?='>)"
Test:
echo "<sometag param='x'><irrelevant stuff='nonsense'>" | grep -oP "(?<=<sometag param=').*?(?='>)"
Results:
x
You don't have to assemble regexes in those cases, you can just use ' as the field separator
in="<sometag param='x'><irrelevant stuff='nonsense'>"
IFS="'" read x whatiwant y <<< "$in" # bash
echo "$whatiwant"
awk -F\' '{print $2}' <<< "$in" # awk

Print word between two characters by going backward in the line

I having problems in extracting the word from a line. What i want is that it picks the first word before the symbol # but after the /. Which is the only delimiter that stand out.
A line looks like this:
,["https://picasaweb.google.com/111560558537332305125/Programming#5743548966953176786",1,["https://lh6.googleusercontent.com/-Is8rb8G1sb8/T7UvWtVOTtI/AAAAAAAAG68/Cht3FzfHXNc/s0-d/Geek.jpg",1920,1200]
I want the word Programming.
To get that line i am using this which narrows it down.
sed -n '/.*picasa.*.jpg/p' 5743548866439293105
So i want it to pretty much find # and then go backward until it hit the first /. Then print it out. In this case the word should be Programming but could be anything.
I want it to be as short as possible and have experimented with
sed -n '/.*picasa.*.jpg/p' 5743548866439293105 | awk '$0=$2' FS="/" RS="[$#]"
You can do that with sed (slightly shortened for formatting but works on your original string as well):
pax> echo ',["https://p.g.com/111/Prog#574' | sed 's/^[^#]*\/\([^#]*\)#.*$/\1/'
Prog
pax>
Explaining in more detail:
/---+------------------> greedy capture up to '/'.
/ |
| | /------+---------> capture the stuff between '/' and '#'.
| |/ |
| || | /-+-----> everything from '#' to end of line.
| || |/ |
| || || |
's/^[^#]*\/\([^#]*\)#.*$/\1/'
||
\+---> replace with captured group.
It basically searches for an entire line that has the pattern you want (first # following a /), whilst capturing (with the \( and \) brackets) just the stuff between / and #.
The substitution then replaces the entire line with just that captured text you're interested in (via \1).
Using grep with some Perl regex extensions:
echo $string | grep -P -o "(?<=/)[^/]+(?=#)"
-P tells grep to use Perl extensions. -o tells grep to display only the matched text. To understand what gets matched, break the regex into three parts: (?<=/), [^/]+?, and (?=#). The first part says that the matched text must follow a '/', without including the '/' in the match. The second parts matches a string of non-'/' characters. The last part says that the matched text must be immediately followed by a '#', without including the '#' in the match.
Another grep, using the "\K" feature to "throw away" the match up to the last '/' before the '#':
# Match as much as possible up to a '/', but throw it away, then match as much as you can
# up to the first #
echo $string | grep -oP ".*/\K.+(?=#)"
Using cut and awk to get the first field (splitting on #) followed by the last field (splitting on /):
echo $string | cut -d# -f1 | awk -F/ '{print $NF}'
Using some temporary variables and bash's parameter expansion facilities:
$ FOO=["https://picasaweb.google.com/111560558537332305125/Programming#5743548966953176786",1,["https://lh6.googleusercontent.com/-Is8rb8G1sb8/T7UvWtVOTtI/AAAAAAAAG68/Cht3FzfHXNc/s0-d/Geek.jpg",1920,1200]
$ BAR=${FOO%#*} # Strip the last # and everything after
$ echo $BAR
[https://picasaweb.google.com/111560558537332305125/Programming
$ BAZ=${BAR##*/} # Strip everything up to and including the last /
$ echo $BAZ
Programming
This might work for you:
sed '/.*\/\([^#]*\)#.*/{s//\1/;q};d' file

Resources