Extract and list every occurrence of string between two delimiters match from text body - bash

I would like to understand how to extract all links (starting with www and ending with .com) from a text body such as below. Multiple occurrences may or may not occur per line.
cat body.txt
text more-text url="http://www.link1.com">textblabla textbla=textblabla url="http://www.link2.com">textblabla textblabla=textblabla textblabla
url="http://www.link3.com"> textblabla textblablabla=bla
Desired output:
www.link1.com
www.link2.com
www.link3.com

Hope this helps:
myStr='text more-text url="http://www.link1.com">textblabla textbla=textblabla url="http://www.link2.com">textblabla textblabla=textblabla textblabla url="http://www.link3.com"> textblabla textblablabla=bla';
for aString in ${myStr[#]}; do
if [[ ${aString} =~ www.*?com ]]; then
echo ${BASH_REMATCH[0]}
fi
done

Using grep
$ grep -o 'www\.[^.]*\.com' input_file
www.link1.com
www.link2.com
www.link3.com

Related

In bash how can I get the last part of a string after the last hyphen [duplicate]

I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?
Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.
Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in ยง3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)
Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion
The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar
How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.
As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123
More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic
You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

Print text between two strings on the same line

I've been searching for a ling time, and have not been able to find a working answer for my problem.
I have a line from an HTML file extracted with sed '162!d' skinlist.html, which contains the text
<a href="/skin/dwarf-red-beard-734/" title="Dwarf Red Beard">.
I want to extract the text Dwarf Red Beard, but that text is modular (can be changed), so I would like to extract the text between title=" and ".
I cannot, for the life of me, figure out how to do this.
awk 'NR==162 {print $4}' FS='"' skinlist.html
set field separator to "
print only line 162
print field 4
Solution in sed
sed -n '162 s/^.*title="\(.*\)".*$/\1/p' skinlist.html
Extracts line 162 in skinlist.html and captures the title attributes contents in\1.
The shell's variable expansion syntax allows you to trim prefixes and suffixes from a string:
line="$(sed '162!d' skinlist.html)" # extract the relevant line from the file
temp="${line#* title=\"}" # remove from the beginning through the first match of ' title="'
if [ "$temp" = "$line" ]; then
echo "title not found in '$line'" >&2
else
title="${temp%%\"*}" # remote from the first '"' through the end
fi
You can pass it through another sed or add expressions to that sed like -e 's/.*title="//g' -e 's/">.*$//g'
also sed
sed -n '162 s/.*"\([a-zA-Z ]*\)"./\1/p' skinlist.html

Bash - extracting a string between two points

For example:
((
extract everything here, ignore the rest
))
I know how to ignore everything within, but I don't know how to do the opposite. Basically, it'll be a file and it needs to extract the data between the two points and then output it to another file. I've tried countless approaches, and all seem to tell me the indentation I'm stating doesn't exist in the file, when it does.
If somebody could point me in the right direction, I'd be grateful.
If your data are "line oriented", so the marker is alone (as in the example), you can try some of the following:
function getdata() {
cat - <<EOF
before
((
extract everything here, ignore the rest
someother text
))
after
EOF
}
echo "sed - with two seds"
getdata | sed -n '/((/,/))/p' | sed '1d;$d'
echo "Another sed solution"
getdata | sed -n '1,/((/d; /))/,$d;p'
echo "With GNU sed"
getdata | gsed -n '/((/{:a;n;/))/b;p;ba}'
echo "With perl"
getdata | perl -0777 -pe "s/.*\(\(\s*\\n(.*)?\)\).*/\$1/s"
Ps: yes, its looks like a dance of crazy toothpicks
Assuming you want to extract the string inside (( and )):
VAR="abc((def))ghi"
echo "$VAR"
VAR=${VAR##*((}
VAR=${VAR%%))*}
echo "$VAR"
## cuts away the longest string from the beginning; # cuts away the shortest string from the beginning; %% cuts away the longest string at the end; % cuts away the shortes string at the end
The file :
$ cat /tmp/l
((
extract everything here, ignore the rest
someother text
))
The script
$ awk '$1=="((" {p=1;next} $1=="))" {p=o;next} p' /tmp/l
extract everything here, ignore the rest
someother text
sed -n '/^((/,/^))/ { /^((/b; /^))/b; p }'
Brief explanation:
/^((/,/^))/: range addressing (inclusive)
{ /^((/b; /^))/b; p }: sequence of 3 commands
1. skip line with ^((
2. skip line with ^))
3. print
The line skipping is required to make the range selection exclusive.

List path with word match in bash script

I have a string that contain a list of lines.I want to search any particular string and list all the path that contains the string.
The given string contains the following:
755677 myfile/Edited-WAV-Files
756876 orignalfile/videofile
758224 orignalfile/audiofile
758224 orignalfile/photos
758225 others/video
758267 others/photo
758268 orignalfile/videofile1
758780 others/photo1
I want to extract and list only the path that start from Orignal File. My output should be like this:
756876 orignalfile/videofile
758224 orignalfile/audiofile
758224 orignalfile/photos
758268 orignalfile/videofile1
That looks easy enough...
echo "$string" | grep originalfile/
or
grep originalfile/ << eof
$string
eof
or, if it's in a file,
grep originalfile/ sourcefile
A bash solution:
while read f1 f2
do
[[ "$f2" =~ ^orignal ]] && echo $f1 $f2
done < file
If your string spans several lines like this:
755677 myfile/Edited-WAV-Files
756876 orignalfile/videofile
758224 orignalfile/audiofile
758224 orignalfile/photos
758225 others/video
758267 others/photo
758268 orignalfile/videofile1
758780 others/photo1
Then you can use this code:
echo "$(echo "$S" | grep -F ' orignalfile/')"
If the string is not separated by new lines then
echo $S | grep -oE "[0-9]+ orignalfile/[^ ]+"
Are you sure that your string contains linebreaks/newlines?
If it does then the solution of DigitalRoss will apply.
If it doesn't contain newlines then you must include them. In example if your code looks like
string=$(ls -l)
then you must prepend it with field separator string without linefeed:
IFS=$'\t| ' string=$(ls -l)
or with an empty IFS var:
IFS='' string=$(ls -l)
Docs for IFS from the bash man page:
IFS The Internal Field Separator that is used for word splitting after
expansion and to split lines into words with the read builtin command. The
default value is ``<space><tab><newline>''.
egrep '^[0-9]{6} orignalfile/' <<<"$string"
note:
the ^ matches the start of the string. You don't want to match things that happen to have orignalfile/ somewhere in the middle
[0-9]{6} matches the six digits at the start of each line

How can I cut(1) camelcase words?

Is there an easy way in Bash to split a camelcased word into its constituent words?
For example, I want to split aCertainCamelCasedWord into 'a Certain Camel Cased Word' and be able to select those fields that interest me. This is trivially done with cut(1) when the word separator is the underscore, but how can I do this when the word is camelcased?
sed 's/\([A-Z]\)/ \1/g'
Captures each capital letter and substitutes a leading space with the capture for the whole stream.
$ echo "aCertainCamelCasedWord" | sed 's/\([A-Z]\)/ \1/g'
a Certain Camel Cased Word
This solution works if you need to not split up words that are all caps. For example, using the top answer you'll get:
$ echo 'FAQPage' | sed 's/\([A-Z]\)/ \1/g'
F A Q Page
But instead with my solution, you'll get:
$ echo 'FAQPage' | sed 's/\([A-Z][^A-Z]\)/ \1/g'
FAQ Page
Note: This does not work correctly when there is a second instance of multiple uppercase words, for example:
$ echo 'FAQPageOneReplacedByFAQPageTwo' | sed 's|\([A-Z][^A-Z]\)| \1|g'
FAQ Page One Replaced ByFAQ Page Two
This answer does not work correctly when there is a second instance of multiple uppercase
echo 'FAQPageOneReplacedByFAQPageTwo' | sed 's|\([A-Z][^A-Z]\)| \1|g'
FAQ Page One Replaced ByFAQ Page Two
So and additional expression is required for that
echo 'FAQPageOneReplacedByFAQPageTwo' | sed -e 's|\([A-Z][^A-Z]\)| \1|g' -e 's|\([a-z]\)\([A-Z]\)|\1 \2|g'
FAQ Page One Replaced By FAQ Page Two
Pure Bash:
name="aCertainCamelCasedWord"
declare -a word # the word array
counter1=0 # count characters
counter2=0 # count words
while [ $counter1 -lt ${#name} ] ; do
nextchar=${name:${counter1}:1}
if [[ $nextchar =~ [[:upper:]] ]] ; then
((counter2++))
word[${counter2}]=$nextchar
else
word[${counter2}]=${word[${counter2}]}$nextchar
fi
((counter1++))
done
echo -e "'${word[#]}'"

Resources