extract substring from lines using grep, awk,sed or etc

extract substring from lines using grep, awk,sed or etc - bash

I have a files with many lines like:
lily weisy
I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/
so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy
how to achieve this? thanks

do this:
sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data
will give you the first part. But I'm not sure what you are doing with it after this.

Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := #href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'
Or if you want a CSV like result:
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((#href, substring-after(#href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names
It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as
xidel /yourfile.html -e '{$name}*'
(and you were supposed to be able to use xidel '{$name}*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )

$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy

I think something like this must work
while read line
do
href=$(echo $line | grep -o 'http[^"]*')
user=$(echo $href | grep -o '[^/]*$')
text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')
echo href: $href
echo user: $user
echo text: $text
done < yourfile
Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
Upd: checked and fixed

Related

Linux bash parsing URL

How to parse the url, for example: https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
So that only virtualbox.org/virtualbox/6.1.36 remains?
TEST_URLS=(
https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe
https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2
)
for url in "${TEST_URLS[#]}"; do
without_proto="${url#*:\/\/}"
without_auth="${without_proto##*#}"
[[ $without_auth =~ ^([^:\/]+)(:[[:digit:]]+\/|:|\/)?(.*) ]]
PROJECT_HOST="${BASH_REMATCH[1]}"
PROJECT_PATH="${BASH_REMATCH[3]}"
echo "given: $url"
echo " -> host: $PROJECT_HOST path: $PROJECT_PATH"
done

Using sed to match whether a sub domain is present (no matter how deep) or not.
$ sed -E 's~[^/]*//(([^.]*\.)+)?([^.]*\.[a-z]+/[^0-9]*[0-9.]+).*~\3~' <<< "${TEST_URLS[0]}"
virtualbox.org/virtualbox/6.1.36
Or in a loop
for url in "${TEST_URLS[#]}"; do
sed -E 's~[^/]*//(([^.]*\.)+)?([^.]*\.[a-z]+/[^0-9]*[0-9.]+).*~\3~' <<< "$url"
done
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1

With your shown samples here is an awk solution. Written and tested in GNU awk.
awk '
match($0,/https?:\/\/([^/]*)(\/.*)\//,arr){
num=split(arr[1],arr1,"/")
if(num>2){
for(i=2;i<=num;i++){
firstVal=(firstVal?firstVal:"") arr1[i]
}
}
else{
firstVal=arr[1]
}
print firstVal arr[2]
}
' Input_file
Explanation: Using awk's match function here. Using GNU awk version of it, where it supports capturing groups getting stored into an array, making use of that functionality here. Using regex https?:\/\/([^/]*)(\/.*) could be also written as ^https?:\/\/([^/]*)(\/.*) where its getting created 2 capturing groups and creating arr also. Then checking if elements are more than 2 then keep last 2 else keep first 2(domain names one), then printing values as per requirement.

I tought about regex but cut makes this work easy.
url=https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
echo $url | grep -Po '([^\/]*)(?=[0-9\.]*)(.*)\/' | cut -d '/' -f 3-
Result
virtualbox.org/virtualbox/6.1.36

So, if I am correct in assuming that you need to extract a string of the form...
hostname.tld/dirname
...where tld is the top-level domain and dirname is the path to the file.
So filtering out any url scheme and subdomains at the beginning, then also filtering out any file basename at the end?
All solutions have assumptions. Assuming one of the original thee letter top level domains ie. .com, .org, .net, .int, .edu, .gov, .mil.
This possible solution uses sed with the -r option for the regular expressions extension.
It creates two filters and uses them to chop off the ends that you don't want (hopefully).
It also uses a capture group in filter_end, so as to keep the / in the filter.
test_urls=(
'https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe'
'https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe'
'https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2'
)
for url in ${test_urls[#]}
do
filter_start=$(
echo "$url" | \
sed -r 's/([^.\/][a-z]+\.[a-z]{2,})\/.*//' )
filter_end=$(
echo "$url" | \
sed 's/.*\(\/\)/\1/g' )
out_string="${url#$filter_start}"
out_string="${out_string%$filter_end}"
echo "$out_string"
done
Output:
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1

grep string containing `":"` patterns

This is the piece of my log file in server
"order_items_subtotal":"60.5100","order_final_due_amount":"0.0000","items":[{"product_id"
I need to grep the logs which contain "order_final_due_amount":"0.0000" in my whole log file.
for this, I did like this
tail -f pp_create_shipment2018-12-05.log | grep "order_final_due_amount":"0.0000"
but I got zero results. what would be wrong on my tail command

" is interpreted by the shell (it's used to quote e.g. spaces).
grep "order_final_due_amount":"0.0000"
is equivalent to
grep order_final_due_amount:0.0000
To pass " to grep, you need to quote it:
grep '"order_final_due_amount":"0\.0000"'
(Also, . is special in regexes and should be escaped.)

Using Perl, you need to just escape the "." alone. The qr// takes cares of remaining.
Check this out:
> cat product.log
order items
items1
"order_items_subtotal":"60.5100","order_final_due_amount":"0.0000","items":[{"product_id"
item2
"order_items_subtotal":"60.5100","order_final_due_amount":"000000","items":[{"product_id"
items3
"order_items_subtotal":"60.5100",order_final_due_amount:"0.0000","items":[{"product_id"
items4
> perl -ne ' $pat=qr/"order_final_due_amount":"0\.0000"/; print if /$pat/ ' product.log
"order_items_subtotal":"60.5100","order_final_due_amount":"0.0000","items":[{"product_id"
>
Thanks to melpomene, the below also works
> perl -ne ' print if /"order_final_due_amount":"0\.0000"/ ' product.log
"order_items_subtotal":"60.5100","order_final_due_amount":"0.0000","items":[{"product_id"
>

Extract values from a property file using bash

I have a variable which contains key/values separated by space:
echo $PROPERTY
server_geo=BOS db.jdbc_url=jdbc\:mysql\://mysql-test.com\:3306/db02 db.name=db02 db.hostname=/mysql-test.com datasource.class.xa=com.mysql.jdbc.jdbc2.optional.MysqlXADataSource server_uid=BOS_mysql57 hibernate33.dialect=org.hibernate.dialect.MySQL5InnoDBDialect hibernate.connection.username=db02 server_labels=mysql57,mysql5,mysql db.jdbc_class=com.mysql.jdbc.Driver db.schema=db02 hibernate.connection.driver_class=com.mysql.jdbc.Driver uuid=a19ua19 db.primary_label=mysql57 db.port=3306 server_label_primary=mysql57 hibernate.dialect=org.hibernate.dialect.MySQL5InnoDBDialect
I'd need to extract the values of the single keys, for example db.jdbc_url.
Using one code snippet I've found:
echo $PROPERTY | sed -e 's/ db.jdbc_url=\(\S*\).*/\1/g'
but that returns also other properties found before my key.
Any help how to fix it ?
Thanks

If db.name always follow db.jdbc_url, then use grep lookaround,
$ echo "${PROPERTY}" | grep -oP '(?<=db.jdbc_url=).*(?=db.name)'
jdbc\:mysql\://mysql-test.com\:3306/db02
or add the VAR to an array,
$ myarr=($(echo $PROPERTY))
$ echo "${myarr[1]}" | grep -oP '(?<=db.jdbc_url=).*(?=$)'
jdbc\:mysql\://mysql-test.com\:3306/db02

This is caused because you are using the substitute command (sed s/.../.../), so any text before your regex is kept as is. Using .* before db\.jdbc_url along with the begin (^) / end ($) of string marks makes you match the whole content of the variable.
In order to be totaly safe, your regex should be :
sed -e 's/^.*db\.jdbc_url=\(\S*\).*$/\1/g'

You can use grep for this, like so:
echo $PROPERTY | grep -oE "db.jdbc_url=\S+" | cut -d'=' -f2
The regex is very close to the one you used with sed.
The -o option is used to print the matched parts of the matching line.
Edit: if you want only the value, cut on the '='
Edit 2: egrep say it is deprecated, so use grep -oE instead, same result. Just to cover all bases :-)

Iterate over string without space as seperator in bash-shell

I'm trying to parse some output of xmllint since hours, but I can't get it to work like i would need it.
Output of "xmllint --xpath "//fub/#name" menu.xml"*
name="Kessel" name="Lager" name="Puffer " name="Boiler.sen"
name="Boiler.jun" name="HK Senior" name="HK Junior" name="Fbh"
name="Solar" name="F.Wärme" name="Sys "
Now I need to seperate all the names (inclusive spaces) and get them in to seperate variables.
My approach was this:
fubNames=$(xmllint --xpath "//fub/#name" menu.xml | sed 's/name=//g')
for name in $fubNames
do
echo $name
done
but this does not workout because the for-loop seperates the string on spaces.
i need the names with spaces. (note: some names have a space at the end)
Does anyone know how to do this properly?

I suggest:
xmllint --xpath "//fub/#name" menu.xml | grep -o '"[^"]*"' | while IFS= read -r name; do echo "$name"; done

grep approach:
xmllint --xpath "//fub/#name" menu.xml | grep -Po 'name=\K\"([^"]+)\"'
The output:
"Kessel"
"Lager"
"Puffer "
"Boiler.sen"
"Boiler.jun"
"HK Senior"
"HK Junior"
"Fbh"
"Solar"
"F.Wärme"
"Sys "
-P option, allows Perl regular expressions
-o option, tells to print only matched parts

Parsing Numbers from a File and using those to calculate

I seem to be to stupid to parse some HTML Files with Bash. We have some files which have lines like:
var A4_total = 2018 + 4730;
var Other1_total = 3242 + 3828;
(They tell us how many pages the Printers have printed).
I need to calculate the first two Values together (2018 and 3242). My Approach is:
hilfsvar1=$(echo `grep -F var\ A4_total StatCntMedia.htm | sed 's/var\ A4\_total\ \=\ //g'` | sed -ne "s/^[^=]\++//p" | sed 's/;//g'); hilfsvar2=$(echo `grep -F var\ Other1_total StatCntMedia.htm | sed 's/var\ Other1\_total\ \=\ //g'` | sed -ne "s/^[^=]\++//p" | sed 's/;//g'); echo "$hilfsvar1 + $hilfsvar2" | bc
This will fail. The two variables do have the right content:
[User]# echo $hilfsvar1
4730
[User]# echo $hilfsvar2
3828
But this is where I can't get forward:
[User]# echo "$hilfsvar1 + $hilfsvar2"
+ 3828
(Sorry for my scripting, I don't have deeper knowlede of Script languages :) ) - I would be happy to resolve this in another way if someone does have a solution.
Thanks in advance, Jonas

It sounds like this might be what you're looking for:
$ awk '{sum[4]+=$4; sum[6]+=$6} END{print sum[4], sum[6]}' file
5260 8558
If not, update your question to show expected output.
It seems like you're probably on completely the wrong track though and trying to do somethign in shell that should be done entirely in awk.

Try something like:
awk '/A4_total|Other1_total/ {print $4+$6}' StatCntMedia.htm
Output:
6748
7070
search for either A4_total or Other1_total, add 4th and 6th field separated by space and display the output.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract substring from lines using grep, awk,sed or etc - bash

I have a files with many lines like: lily weisy I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/ so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy how to achieve this? thanks

do this: sed -e 's/.href="\([^"]\)".>\([^<]\)<.*/link:\1 name:\2/' < data will give you the first part. But I'm not sure what you are doing with it after this.

$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file www.youtube.com/user/airuike airuike lily weisy

Related

Linux bash parsing URL

grep string containing `":"` patterns

Extract values from a property file using bash

Iterate over string without space as seperator in bash-shell

Parsing Numbers from a File and using those to calculate

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract substring from lines using grep, awk,sed or etc - bash

I have a files with many lines like: lily weisy I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/ so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy how to achieve this? thanks

do this: sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data will give you the first part. But I'm not sure what you are doing with it after this.

$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file www.youtube.com/user/airuike airuike lily weisy

Related

Linux bash parsing URL

grep string containing `":"` patterns

Extract values from a property file using bash

Iterate over string without space as seperator in bash-shell

Parsing Numbers from a File and using those to calculate

Categories

Resources

do this: sed -e 's/.href="\([^"]\)".>\([^<]\)<.*/link:\1 name:\2/' < data will give you the first part. But I'm not sure what you are doing with it after this.