Extract first three elements from an URL with a Regular expression

Extract first three elements from an URL with a Regular expression - bash

Given the following URL:
http://www.example.com/path1/path2/page
Is there a simple way to extract the first three blocks of it with a regular expression, that is:
http://www.example.com/path1/path2
I've found some examples how to do it with some coding (perl/javascript) however I'd really appreciate if somebody pointed me to a sed/awk example which uses a regular expression to do it.
Thanks

Solution 1st: With simple parameter expansion.
echo "${val%/*}"
Solution 2nd: with awk.
echo "$val" | awk 'match($0,/.*\//){print substr($0,RSTART,RLENGTH-1)}'
Solution 3rd: With one more awk.
echo "$val" | awk -F"/" 'NF--;1' OFS="/"
Solution 4th: With sed.
echo "$val" | sed 's/\(.*\/\).*/\1/;s/\/$//'

to extract the first three blocks (as opposed to for example remove last block) of it with a regular expression using Bash regex:
$ [[ "$var" =~ ^(https?://)?([^/]+/){0,3} ]] && echo $BASH_REMATCH
http://www.example.com/path1/path2/
Explained:
^(https?://)? Don't worry about that
([^/]+/){0,3} 0 to 3 blocks matched to output
It supports for example:
$ var=https://www.example.com/path1/path2/page
https://www.example.com/path1/path2/
$ var=www.example.com/path1/path2/page
www.example.com/path1/path2/
$ var=www.example.com/path1/
www.example.com/path1/

Related

How to overcome greedy match everything when looking for a particular string later?

echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.*([0-9]+) guys.*/\1/g'
The above command currently outputs just 5. Essentially I'd like to parse the number of "guys" in a random sentence that could have numbers (or not.. I'd also like to parse just echo "365 guys") preceding the number of guys. My .* is matching the 36 and preventing it from appearing in the \1. How can I write a sed command (or any other regex/perl/awk) to accomplish what I want?

Use the "frugal" quantifier *? in Perl:
perl -pe 's/.*?([0-9]+) guys.*/$1/'

With GNU grep:
$ grep -Po '\b[0-9]+(?= guys\b)' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
-P actives support for PCREs, which enables advanced regex features.
-o specifies that only the matching parts of input lines should be printed.
\b matches only on a word boundary, including at the start of a line;
this prevents matching numbers that aren't stand-alone numbers but part of other words, such as in foo365 guys, and words that start with guys, such as guysanddolls.
(?= guys) is a look-ahead assertion that matches the enclosed subexpression without including it in the matched string returned.
As demonstrated, this may match multiple patterns on a given line, with each number extracted printed on its own output line.
If that is undesired, grep cannot be used, because -o invariably returns all of a line's matches; see the perl command below for a solution.
Inspired by Sobrique's comment on choroba's answer, here is the perl equivalent of the above grep command:
$ perl -lne 'print for m/\b(\d+) guys\b/g' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
Simply omit the g to only match at most 1 number per line.

Since your number is preceded by a blank, you can make it a part of the regex:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.* ([0-9]+) guys.*/\1/g'
# => 365

In Bash:
$ s="A number is about to show up 1 and now I want to parse 365 guys and some extra junk"
$ [[ $s =~ ([0-9]+)\ +guys.*$ ]] && echo ${BASH_REMATCH[1]}
365
Or, with awk:
$ echo "$s" | awk '/guys/{for (i=1;i<=NF;i++) if ($i=="guys" && $(i-1)+0==$(i-1)) print $(i-1)}'
365

with standard sed regex you can benefit from greedy match if you reverse the string and matching
echo ... | rev | sed -E 's/.*syug ([0-9]+).*/\1/g' | rev
obviously this is a hack, but desperate times...

#Andrew Cassidy: #try:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" |
awk '/guys/{print VAL;exit} {VAL=$0}' RS=" "

This might work for you (GNU sed):
sed -r 's/.*\b([0-9]+) guys.*/\1/' file
or perhaps:
sed -r 's/.*\<([0-9]+) guys.*/\1/' file
Make the numeric part of the pattern match a word boundary.

How to use tab separators with grep in ash or dash script?

Task at hand:
I have a file with four tab separated values:
peter 123 five apples
jane 1234 four rubberducks
jimmy 01234 seven nicknames
I need to get a line out of this file based on second column, and the value is in a variable. Let's assume I have number 123 stored in a variable foo. In bash I can do
grep $'\s'$foo$'\s'
and I get out of peter's info and nothing else. Is there a way to achieve the same on dash or ash?

You can use awk here:
var='1234'
awk -v var="$var" '$2 == var ""' f
jane 1234 four rubberducks
PS: I am doing var "" to make sure var is treated as a string instead of as a number.

If your file is small enough that the inefficiency of doing iteration in a shell doesn't matter, you don't actually need grep for this at all. The following is valid in any POSIX-compliant shell, including ash or dash:
var=123
while read -r first second rest; do
if [ "$second" = "$var" ]; then
printf '%s\t' "$first" "$second"; printf '%s\n' "$rest"
fi
done
(In practice, I'd probably use awk here; consider the demonstration just that).

How can I strip first X characters from string using sed?

I am writing shell script for embedded Linux in a small industrial box. I have a variable containing the text pid: 1234 and I want to strip first X characters from the line, so only 1234 stays. I have more variables I need to "clean", so I need to cut away X first characters and ${string:5} doesn't work for some reason in my system.
The only thing the box seems to have is sed.
I am trying to make the following to work:
result=$(echo "$pid" | sed 's/^.\{4\}//g')
Any ideas?

The following should work:
var="pid: 1234"
var=${var:5}
Are you sure bash is the shell executing your script?
Even the POSIX-compliant
var=${var#?????}
would be preferable to using an external process, although this requires you to hard-code the 5 in the form of a fixed-length pattern.

Here's a concise method to cut the first X characters using cut(1). This example removes the first 4 characters by cutting a substring starting with 5th character.
echo "$pid" | cut -c 5-

Use the -r option ("use extended regular expressions in the script") to sed in order to use the {n} syntax:
$ echo 'pid: 1234'| sed -r 's/^.{5}//'
1234

Cut first two characters from string:
$ string="1234567890"; echo "${string:2}"
34567890

pipe it through awk '{print substr($0,42)}' where 42 is one more than the number of characters to drop. For example:
$ echo abcde| awk '{print substr($0,2)}'
bcde
$

Chances are, you'll have cut as well. If so:
[me#home]$ echo "pid: 1234" | cut -d" " -f2
1234

Well, there have been solutions here with sed, awk, cut and using bash syntax. I just want to throw in another POSIX conform variant:
$ echo "pid: 1234" | tail -c +6
1234
-c tells tail at which byte offset to start, counting from the end of the input data, yet if the the number starts with a + sign, it is from the beginning of the input data to the end.

Another way, using cut instead of sed.
result=`echo $pid | cut -c 5-`

I found the answer in pure sed supplied by this question (admittedly, posted after this question was posted). This does exactly what you asked, solely in sed:
result=\`echo "$pid" | sed '/./ { s/pid:\ //g; }'\``
The dot in sed '/./) is whatever you want to match. Your question is exactly what I was attempting to, except in my case I wanted to match a specific line in a file and then uncomment it. In my case it was:
# Uncomment a line (edit the file in-place):
sed -i '/#\ COMMENTED_LINE_TO_MATCH/ { s/#\ //g; }' /path/to/target/file
The -i after sed is to edit the file in place (remove this switch if you want to test your matching expression prior to editing the file).
(I posted this because I wanted to do this entirely with sed as this question asked and none of the previous answered solved that problem.)

Rather than removing n characters from the start, perhaps you could just extract the digits directly. Like so...
$ echo "pid: 1234" | grep -Po "\d+"
This may be a more robust solution, and seems more intuitive.

This will do the job too:
echo "$pid"|awk '{print $2}'

Bash - Extract numbers from String

I got a string which looks like this:
"abcderwer 123123 10,200 asdfasdf iopjjop"
Now I want to extract numbers, following the scheme xx,xxx where x is a number between 0-9. E.g. 10,200. Has to be five digit, and has to contain ",".
How can I do that?
Thank you

You can use grep:
$ echo "abcderwer 123123 10,200 asdfasdf iopjjop" | egrep -o '[0-9]{2},[0-9]{3}'
10,200

In pure Bash:
pattern='([[:digit:]]{2},[[:digit:]]{3})'
[[ $string =~ $pattern ]]
echo "${BASH_REMATCH[1]}"

Simple pattern matching (glob patterns) is built into the shell. Assuming you have the strings in $* (that is, they are command-line arguments to your script, or you have used set on a string you have obtained otherwise), try this:
for token; do
case $token in
[0-9][0-9],[0-9][0-9][0-9] ) echo "$token" ;;
esac
done

Check out pattern matching and regular expressions.
Links:
Bash regular expressions
Patterns and pattern matching
SO question
and as mentioned above, one way to utilize pattern matching is with grep.
Other uses: echo supports patterns (globbing) and find supports regular expressions.

A slightly non-typical solution:
< input tr -cd [0-9,\ ] | tr \ '\012' | grep '^..,...$'
(The first tr removes everything except commas, spaces, and digits. The
second tr replaces spaces with newlines, putting each "number" on a separate
line, and the grep discards everything except those that match your criterion.)

The following example using your input data string should solve the problem using sed.
$ echo abcderwer 123123 10,200 asdfasdf iopjjop | sed -ne 's/^.*\([0-9,]\{6\}\).*$/\1/p'
10,200

String Manipulation in Bash

I am a newbie in Bash and I am doing some string manipulation.
I have the following file among other files in my directory:
jdk-6u20-solaris-i586.sh
I am doing the following to get jdk-6u20 in my script:
myvar=`ls -la | awk '{print $9}' | egrep "i586" | cut -c1-8`
echo $myvar
but now I want to convert jdk-6u20 to jdk1.6.0_20. I can't seem to figure out how to do it.
It must be as generic as possible. For example if I had jdk-6u25, I should be able to convert it at the same way to jdk1.6.0_25 so on and so forth
Any suggestions?

Depending on exactly how generic you want it, and how standard your inputs will be, you can probably use AWK to do everything. By using FS="regexp" to specify field separators, you can break down the original string by whatever tokens make the most sense, and put them back together in whatever order using printf.
For example, assuming both dashes and the letter 'u' are only used to separate fields:
myvar="jdk-6u20-solaris-i586.sh"
echo $myvar | awk 'BEGIN {FS="[-u]"}; {printf "%s1.%s.0_%s",$1,$2,$3}'
Flavour according to taste.

Using only Bash:
for file in jdk*i586*
do
file="${file%*-solaris*}"
file="${file/-/1.}"
file="${file/u/.0_}"
do_something_with "$file"
done

i think that sed is the command for you

You can try this snippet:
for fname in *; do
newname=`echo "$fname" | sed 's,^jdk-\([0-9]\)u\([0-9][0-9]*\)-.*$,jdk1.\1.0_\2,'`
if [ "$fname" != "$newname" ]; then
echo "old $fname, new $newname"
fi
done

awk 'if(match($9,"i586")){gsub("jdk-6u20","jdk1.6.0_20");print $9;}'
The if(match()) supersedes the egrep bit if you want to use it. You could use substr($9,1,8) instead of cut as well.

garph0 has a good idea with sed; you could do
myvar=`ls jdk*i586.sh | sed 's/jdk-\([0-9]\)u\([0-9]\+\).\+$/jdk1.\1.0_\2/'`

You're needing the awk in there is an artifact of the -l switch on ls. For pattern substitution on lines of text, sed is the long-time champion:
ls | sed -n '/^jdk/s/jdk-\([0-9][0-9]*\)u\([0-9][0-9]*\)$/jdk1.\1.0_\2/p'
This was written in "old-school" sed which should have greater portability across platforms. The expression says:
don't print lines unless they match -n
on lines beginning with 'jdk' do:
on a line that contains only "jdk-IntegerAuIntegerB"
change it to "jdk.1.IntegerA.0_IntegerB"
and print it
Your sample becomes even simpler as:
myvar=`echo *solaris-i586.sh | sed 's/-solaris-i586\.sh//'`

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extract first three elements from an URL with a Regular expression - bash

Solution 1st: With simple parameter expansion. echo "${val%/}" Solution 2nd: with awk. echo "$val" | awk 'match($0,/.\//){print substr($0,RSTART,RLENGTH-1)}' Solution 3rd: With one more awk. echo "$val" | awk -F"/" 'NF--;1' OFS="/" Solution 4th: With sed. echo "$val" | sed 's/\(.\/\)./\1/;s/\/$//'

Related

How to overcome greedy match everything when looking for a particular string later?

How to use tab separators with grep in ash or dash script?

How can I strip first X characters from string using sed?

Bash - Extract numbers from String

String Manipulation in Bash

Categories

Resources