Linux bash parsing URL - bash

How to parse the url, for example: https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
So that only virtualbox.org/virtualbox/6.1.36 remains?
TEST_URLS=(
https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe
https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2
)
for url in "${TEST_URLS[#]}"; do
without_proto="${url#*:\/\/}"
without_auth="${without_proto##*#}"
[[ $without_auth =~ ^([^:\/]+)(:[[:digit:]]+\/|:|\/)?(.*) ]]
PROJECT_HOST="${BASH_REMATCH[1]}"
PROJECT_PATH="${BASH_REMATCH[3]}"
echo "given: $url"
echo " -> host: $PROJECT_HOST path: $PROJECT_PATH"
done

Using sed to match whether a sub domain is present (no matter how deep) or not.
$ sed -E 's~[^/]*//(([^.]*\.)+)?([^.]*\.[a-z]+/[^0-9]*[0-9.]+).*~\3~' <<< "${TEST_URLS[0]}"
virtualbox.org/virtualbox/6.1.36
Or in a loop
for url in "${TEST_URLS[#]}"; do
sed -E 's~[^/]*//(([^.]*\.)+)?([^.]*\.[a-z]+/[^0-9]*[0-9.]+).*~\3~' <<< "$url"
done
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1

With your shown samples here is an awk solution. Written and tested in GNU awk.
awk '
match($0,/https?:\/\/([^/]*)(\/.*)\//,arr){
num=split(arr[1],arr1,"/")
if(num>2){
for(i=2;i<=num;i++){
firstVal=(firstVal?firstVal:"") arr1[i]
}
}
else{
firstVal=arr[1]
}
print firstVal arr[2]
}
' Input_file
Explanation: Using awk's match function here. Using GNU awk version of it, where it supports capturing groups getting stored into an array, making use of that functionality here. Using regex https?:\/\/([^/]*)(\/.*) could be also written as ^https?:\/\/([^/]*)(\/.*) where its getting created 2 capturing groups and creating arr also. Then checking if elements are more than 2 then keep last 2 else keep first 2(domain names one), then printing values as per requirement.

I tought about regex but cut makes this work easy.
url=https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
echo $url | grep -Po '([^\/]*)(?=[0-9\.]*)(.*)\/' | cut -d '/' -f 3-
Result
virtualbox.org/virtualbox/6.1.36

So, if I am correct in assuming that you need to extract a string of the form...
hostname.tld/dirname
...where tld is the top-level domain and dirname is the path to the file.
So filtering out any url scheme and subdomains at the beginning, then also filtering out any file basename at the end?
All solutions have assumptions. Assuming one of the original thee letter top level domains ie. .com, .org, .net, .int, .edu, .gov, .mil.
This possible solution uses sed with the -r option for the regular expressions extension.
It creates two filters and uses them to chop off the ends that you don't want (hopefully).
It also uses a capture group in filter_end, so as to keep the / in the filter.
test_urls=(
'https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe'
'https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe'
'https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2'
)
for url in ${test_urls[#]}
do
filter_start=$(
echo "$url" | \
sed -r 's/([^.\/][a-z]+\.[a-z]{2,})\/.*//' )
filter_end=$(
echo "$url" | \
sed 's/.*\(\/\)/\1/g' )
out_string="${url#$filter_start}"
out_string="${out_string%$filter_end}"
echo "$out_string"
done
Output:
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1

Related

Extract a word and `n` characters after it from a line?

I am trying to extract the JIRA Ticket number from a string.
The Jira ticket might be mentioned any where in the line like:
Merge pull request #1387 from Config-change/REL-12345
REL-12345: Enable XAPI at config level
I just want REL-12345 as the output.
grep -Eow 'REL-[0-9]+'
+ is one or more, to specifiy N numbers (eg 5):
grep -Eow 'REL-[0-9]{5}
Ranges: {3,6} is 3 to 6, {5,} is 5 or more, etc.
On GNU/Linux: man grep -> /Repetition for more details.
-o prints only matching strings
-w matches full words only, ie. to avoid matching WREL-12345 (for example)
grep -Eow 'REL-[[:alnum:]]+' for both letters and numbers (after REL-).
If this is the standard.....
Input: Merge pull request #1387 from Config-change/REL-12345
echo "Merge pull request #1387 from Config-change/REL-12345" | cut -d/ -f2
Input: REL-12345: Enable XAPI at config level
echo "REL-12345: Enable XAPI at config level" | cut -d: -f1
You can pass a String to sed and use substitution with REGEX, like this:
myString="This is REL-12345 a test string "
sed -n 's/.*\(\REL-5*[0-9]*\).*/\1/p' <<< $myString
this should return: REL-12345
Sample data:
$ cat jira.dat
Merge pull request #1387 from Config-change/REL-12345
REL-12346: Enable XAPI at config level
One idea using bash regex matching and the resulting BASH_REMATCH[]:
regex='(REL-[[:digit:]]+)'
while read -r line
do
printf "\n########## ${line}\n"
[[ "${line}" =~ ${regex} ]] && echo "${BASH_REMATCH[1]}"
done < jira.dat
This generates:
REL-12345
REL-12346
Sample data:
$ cat jira.dat
Merge pull request #1387 from Config-change/REL-12345
REL-12346: Enable XAPI at config level
One idea using grep:
$ grep -Eo 'REL-[[:digit:]]+' jira.dat
REL-12345
REL-12346

shell script compare file with multiple line pattern

I have a file which is created after some manual configuration.
I need to check this file automatically with a shell script.
The file looks like this:
eth0;eth0;1c:98:ec:2a:1a:4c
eth1;eth1;1c:98:ec:2a:1a:4d
eth2;eth2;1c:98:ec:2a:1a:4e
eth3;eth3;1c:98:ec:2a:1a:4f
eth4;eth4;48:df:37:58:da:44
eth5;eth5;48:df:37:58:da:45
eth6;eth6;48:df:37:58:da:46
eth7;eth7;48:df:37:58:da:47
I want to compare it to a pattern like this:
eth0;eth0;*
eth1;eth1;*
eth2;eth2;*
eth3;eth3;*
eth4;eth4;*
eth5;eth5;*
eth6;eth6;*
eth7;eth7;*
If I would only have to check this pattern I could run this loop:
c=0
while [ $c -le 7 ]
do
if [ "$(grep "eth"${c}";eth"${c}";*" current_mapping)" ];
then
echo "eth$c ok"
fi
(( c++ ))
done
There are 6 or more different patterns possible. A pattern could also look like this for example (depending and specific configuration requests):
eth4;eth0;*
eth5;eth1;*
eth6;eth2;*
eth7;eth3;*
eth0;eth4;*
eth1;eth5;*
eth2;eth6;*
eth3;eth7;*
So I don't think I can run a standard grep per line command in a loop. The eth numbers are not consistently the same.
Is it possible somehow to compare the whole file to pattern like it would be possible with grep for a single line?
Assuming file is your data file and patt is your file that contains above pattern. You can use this grep -f in conjunction with sed in a process substitution that replaces * with .* and ? with . to make it a workable regex.
grep -f <(sed 's/\*/.*/g; s/?/./g' patt) file
eth0;eth0;1c:98:ec:2a:1a:4c
eth1;eth1;1c:98:ec:2a:1a:4d
eth2;eth2;1c:98:ec:2a:1a:4e
eth3;eth3;1c:98:ec:2a:1a:4f
eth4;eth4;48:df:37:58:da:44
eth5;eth5;48:df:37:58:da:45
eth6;eth6;48:df:37:58:da:46
eth7;eth7;48:df:37:58:da:47
I wrote this loop now and it does the job (current_mapping being the file with the content in the first code block of the question). I would have to create arrays with different patterns and use a case for every pattern. I was just wondering if there is something like grep for multiple lines, that could the same without writing this loop.
array=("eth0;eth0;*" "eth1;eth1;*" "eth2;eth2;*" "eth3;eth3;*" "eth4;eth4;*" "eth5;eth5;*" "eth6;eth6;*" "eth7;eth7;*")
c=1
while [ $c -le 8 ]
do
if [ ! "$(sed -n "${c}"p current_mapping | grep "${array[$c-1]}")" ];
then
echo "somethings wrong"
fi
(( c++ ))
done
Try any:
grep -P '(eth[0-9]);\1'
grep -E '(eth[0-9]);\1'
sed -n '/\(eth[0-9]\);\1/p'
awk -F';' '$1 == $2'
There are commands only. Apply them to a pipe or file.
Updated the answer after the question was edited.
As we can see the task requirements are as follows:
a file (a set of lines) formatted like ethN;ethM;MAC
examine each line for equality ethN and ethM
if they are equal, output a string ethN ok
If I understand the task correctly we can achieve this using the following code without loops:
awk -F';' '$1 == $2 { print $1, "ok" }'

Extracting a part of lines matching a pattern

I have a configuration file and need to parse out some values using bash
Ex. Inside config.txt
some_var= Not_needed
tests= spec1.rb spec2.rb spec3.rb
some_other_var= Also_not_needed
Basically I just need to get "spec1.rb spec2.rb spec3.rb" WITHOUT all the other lines and "tests=" removed from the line.
I have this and it works, but I'm hoping there's a much more simple way to do this.
while read run_line; do
if [[ $run_line =~ ^tests=* ]]; then
echo "FOUND"
all_selected_specs=`echo ${run_line} | sed 's/^tests= /''/'`
fi
done <${config_file}
echo "${all_selected_specs}"
all_selected_specs=$(awk -F '= ' '$1=="tests" {print $2}' "$config_file")
Using a field separator of "= ", look for lines where the first field is tests and print the second field.
This should work too
grep "^tests" ${config_file} | sed -e "s/^tests= //"
How about grep and cut?
all_selected_specs=$(grep "^tests=" "$config_file" | cut -d= -f2-)
try:
all_selected_specs=$(awk '/^tests/{sub(/.*= /,"");print}' Input_file)
searching for string tests which comes in starting of a line then substituting that line's all values till (= ) to get all spec values, once it is substituted then we are good to get the spec values so printing that line. Finally saving it's value to variable with $(awk...).

How do I get all the text between the last two instances of a token in bash?

I’m using bash and running the following command to get all the file text between two tokens (including the tokens themselves):
cat /usr/java/jboss/standalone/log/server.log | sed -n \
'/Starting deployment of "myproject.war"/,/Registering web context: \/myproject/p'
However, sometimes the tokens occur multiple times in the file. How do I adjust the above so that only the text between the last two occurrences of the tokens (including the tokens themselves) will be returned?
How about some tic-tac-toe.
tac /usr/java/jboss/standalone/log/server.log |
awk '/Registering web context: \/myproject/{p=1;++cnt}/Starting deployment of "myproject.war"/{if(cnt==2){print $0;exit};print $0;p=0}p' |
tac
This solution is not efficient, but easier to understand:
file='/usr/java/jboss/standalone/log/server.log'
s1='Starting deployment of "myproject.war"'
s2='Registering web context: \/myproject'
sed -n '/'"$s1"'/,/'"$s2"'/p' "$file" |
tac |
awk '/'"$s1"'/ {print;exit} 1' |
tac
Lets sed report ALL ranges first.
Reverses the result using tac (on OSX, use tail -r).
Using awk, outputs everything up to and including the first occurrence of the first substring, which - in the reversed result - spans the end of the last range to the start of the last range.
Reverses the output from awk to render the last range in correct order.
Note: For consistency with the variable use in the sed command I've spliced a variable reference directly into the awk program, too, which is otherwise poor practice (use -v to pass variables instead).
You can do this in native bash -- no need for awk, tac, or any other external tool.
token1='Starting deployment of "myproject.war"'
token2='Registering web context: /myproject/'
writing=0
while read -r; do
(( ! writing )) && [[ $REPLY = $token1 ]] && {
# start collecting content, into an empty buffer, when we see token1
writing=1 # set flag to store lines we see
collected_content=() # clear the array of lines found so far
}
(( writing )) && {
# when the flag is set, collect content into an array
collected_content+=( "$REPLY" )
}
[[ $REPLY = $token2 ]] && {
# stop collecting content when we see token2
writing=0
}
done <server.log # redirect from the log into the loop
# print all collected lines
printf '%s\n' "${collected_content[#]}"
This awk can work:
awk '/Starting deployment of "myproject.war"/{i=0; s=1; delete a;}
s{a[++i]=$0}
/Registering web context: \/myproject/{s=0}
END {print i; for (k=1; k<=i; k++) print a[k]}' file
With perl:
perl -0xFF -nE '#x = /WWWW Starting deployment of "myproject.war"(.*?)Registering web context: \/myproject/sg; say $x[-1] ' file

Bash command to extract characters in a string

I want to write a small script to generate the location of a file in an NGINX cache directory.
The format of the path is:
/path/to/nginx/cache/d8/40/32/13febd65d65112badd0aa90a15d84032
Note the last 6 characters: d8 40 32, are represented in the path.
As an input I give the md5 hash (13febd65d65112badd0aa90a15d84032) and I want to generate the output: d8/40/32/13febd65d65112badd0aa90a15d84032
I'm sure sed or awk will be handy, but I don't know yet how...
This awk can make it:
awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}'
Explanation
BEGIN{FS=""; OFS="/"}. FS="" sets the input field separator to be "", so that every char will be a different field. OFS="/" sets the output field separator as /, for print matters.
print ... $(NF-1)$NF, $0 prints the penultimate field and the last one all together; then, the whole string. The comma is "filled" with the OFS, which is /.
Test
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' <<< "13febd65d65112badd0aa90a15d84032"
d8/40/32/13febd65d65112badd0aa90a15d84032
Or with a file:
$ cat a
13febd65d65112badd0aa90a15d84032
13febd65d65112badd0aa90a15f1f2f3
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' a
d8/40/32/13febd65d65112badd0aa90a15d84032
f1/f2/f3/13febd65d65112badd0aa90a15f1f2f3
With sed:
echo '13febd65d65112badd0aa90a15d84032' | \
sed -n 's/\(.*\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\)$/\2\/\3\/\4\/\1/p;'
Having GNU sed you can even simplify the pattern using the -r option. Now you won't need to escape {} and () any more. Using ~ as the regex delimiter allows to use the path separator / without need to escape it:
sed -nr 's~(.*([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2}))$~\2/\3/\4/\1~p;'
Output:
d8/40/32/13febd65d65112badd0aa90a15d84032
Explained simple the pattern does the following: It matches:
(all (n-5 - n-4) (n-3 - n-2) (n-1 - n-0))
and replaces it by
/$1/$2/$3/$0
You can use a regular expression to separate each of the last 3 bytes from the rest of the hash.
hash=13febd65d65112badd0aa90a15d84032
[[ $hash =~ (..)(..)(..)$ ]]
new_path="/path/to/nginx/cache/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$hash"
Base="/path/to/nginx/cache/"
echo '13febd65d65112badd0aa90a15d84032' | \
sed "s|\(.*\(..\)\(..\)\(..\)\)|${Base}\2/\3/\4/\1|"
# or
# sed sed 's|.*\(..\)\(..\)\(..\)$|${Base}\1/\2/\3/&|'
Assuming info is a correct MD5 (and only) string
First of all - thanks to all of the responders - this was extremely quick!
I also did my own scripting meantime, and came up with this solution:
Run this script with a parameter of the URL you're looking for (www.example.com/article/76232?q=hello for example)
#!/bin/bash
path=$1
md5=$(echo -n "$path" | md5sum | cut -f1 -d' ')
p3=$(echo "${md5:0-2:2}")
p2=$(echo "${md5:0-4:2}")
p1=$(echo "${md5:0-6:2}")
echo "/path/to/nginx/cache/$p1/$p2/$p3/$md5"
This assumes the NGINX cache has a key structure of 2:2:2.

Resources