I can't figure out how to extract a string in bash - bash

I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.

With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq

You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'

Related

How to convert piped/awk output to string/variable

I'm trying to create a bash function that automatically updates a cli tool. So far I've managed to get this:
update_cli_tool () {
# the following will automatically be redirected to .../releases/tag/vX.X.X
# there I get the location from the header, and remove it to get the full url
latest_release_url=$(curl -i https://github.com/.../releases/latest | grep location: | awk -F 'location: ' '{print $2}')
# to get the version, I get the 8th element from the url .../releases/tag/vX.X.X
latest_release_version=$(echo "$latest_release_url" | awk -F '/' '{print 8}')
# this is where it breaks
# the first part just replaces the "tag" with "download" in the url
full_url="${latest_release_url/tag/download}/.../${latest_release_version}.zip"
echo "$full_url" # or curl $full_url, also fails
}
Expected output: https://github.com/.../download/vX.X.X/vX.X.X.zip
Actual output: -.zip-.../.../releases/download/vX.X.X
When I just echo "latest_release_url: $latest_release_url" (same for version), it prints it correctly, but not when I use the above mentioned flow. When I hardcode the ..._url and ..._version, the full_url works fine. So my guess is I have to somehow capture the output and convert it to a string? Or perhaps concatenate it another way?
Note: I've also used ..._url=`curl -i ...` (with backticks instead of $(...)), but this gave me the same results.
The curl output will use \r\n line endings. The stray carriage return in the url variable is tripping you up. Observe it with printf '%q\n' "$latest_release_url"
Try this:
latest_release_url=$(
curl --silent -i https://github.com/.../releases/latest \
| awk -v RS='\r\n' '$1 == "location:" {print $2}'
)
Then the rest of the script should look right.

Bash regex: get value in conf file preceded by string with dot

I have to get my db credentials from this configuration file:
# Database settings
Aisse.LocalHost=localhost
Aisse.LocalDataBase=mydb
Aisse.LocalPort=5432
Aisse.LocalUser=myuser
Aisse.LocalPasswd=mypwd
# My other app settings
Aisse.NumDir=../../data/Num
Aisse.NumMobil=3000
# Log settings
#Aisse.Trace_AppliTpv=blabla1.tra
#Aisse.Trace_AppliCmp=blabla2.tra
#Aisse.Trace_AppliClt=blabla3.tra
#Aisse.Trace_LocalDataBase=blabla4.tra
In particular, I want to get the value mydb from line
Aisse.LocalDataBase=mydb
So far, I have developed this
mydbname=$(echo "$my_conf_file.conf" | grep "LocalDataBase=" | sed "s/LocalDataBase=//g" )
that returns
mydb #Aisse.Trace_blabla4.tra
that would be ok if it did not return also the comment string.
Then I have also tryed
mydbname=$(echo "$my_conf_file.conf" | grep "Aisse.LocalDataBase=" | sed "s/LocalDataBase=//g" )
that retruns void string.
How can I get only the value that is preceded by the string "Aisse.LocalDataBase=" ?
Using sed
$ mydbname=$(sed -n 's/Aisse\.LocalDataBase=//p' input_file)
$ echo $mydbname
mydb
I'm afraid you're being incomplete:
You mention you want the line, containing "LocalDataBase", but you don't want the line in comment, let's start with that:
A line which contains "LocalDataBase":
grep "LocalDataBase" conf.conf.txt
A line which contains "LocalDataBase" but who does not start with a hash:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#"
??? grep -v "^ *#"
That means: don't show (-v) the lines, containing:
^ : the start of the line
* : a possible list of space characters
# : a hash character
Once you have your line, you need to work with it:
You only need the part behind the equality sign, so let's use that sign as a delimiter and show the second column:
cut -d '=' -f 2
All together:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#" | cut -d '=' -f 2
Are we there yet?
No, because it's possible that somebody has put some comment behind your entry, something like:
LocalDataBase=mydb #some information
In order to prevent that, you need to cut that comment too, which you can do in a similar way as before: this time you use the hash character as a delimiter and you show the first column:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#" | cut -d '=' -f 2 | cut -d '#' -f 1
Have fun.
You may use this sed:
mydbname=$(sed -n 's/^[^#][^=]*LocalDataBase=//p' file)
echo "$mydbname"
mydb
RegEx Details:
^: Start
[^#]: Matches any character other than #
[^=]*: Matches 0 or more of any character that is not =
LocalDataBase=: Matches text LocalDataBase=
You can use
mydbname=$(sed -n 's/^Aisse\.LocalDataBase=\(.*\)/\1/p' file)
If there can be leading whitespace you can add [[:blank:]]* after ^:
mydbname=$(sed -n 's/^[[:blank:]]*Aisse\.LocalDataBase=\(.*\)/\1/p' file)
See this online demo:
#!/bin/bash
s='# Database settings
Aisse.LocalHost=localhost
Aisse.LocalDataBase=mydb
Aisse.LocalPort=5432
Aisse.LocalUser=myuser
Aisse.LocalPasswd=mypwd
# My other app settings
Aisse.NumDir=../../data/Num
Aisse.NumMobil=3000
# Log settings
#Aisse.Trace_AppliTpv=blabla1.tra
#Aisse.Trace_AppliCmp=blabla2.tra
#Aisse.Trace_AppliClt=blabla3.tra
#Aisse.Trace_LocalDataBase=blabla4.tra'
sed -n 's/^Aisse\.LocalDataBase=\(.*\)/\1/p' <<< "$s"
Output:
mydb
Details:
-n - suppresses default line output in sed
^[[:blank:]]*Aisse\.LocalDataBase=\(.*\) - a regex that matches the start of a string (^), then zero or more whiespaces ([[:blank:]]*), then a Aisse.LocalDataBase= string, then captures the rest of the line into Group 1
\1 - replaces the whole match with the value of Group 1
p - prints the result of the successful substitution.

replace string with exact match in bash script

I have a many repeated content as give below in a file . These are only uniq content.
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="
I want to replace empty field with "Null" and need output as :
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="Null"
What I can think of as :
#First find the matching content
cat file.txt | egrep 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_contain.txt
# Find the content where given string are not there
cat file.txt | egrep -v 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_donot_contain.txt
# Replace the string in content not found file
sed -i 's/CHECKSUM="/CHECKSUM="Null"/g' file_donot_contain.txt
# Merge the files
cat file_contain.txt file_donot_contain.txt > output.txt
But I find this is not efficient way of doing. Any other suggestion ?
To achieve this you need to mark that this is the end of the line, not just part of it, using $ (And optionally ^ to mark the start of the line too):
sed -i s'/^CHECKSUM="$/CHECKSUM="Null"/' file.txt

modify the contents of a file without a temp file

I have the following log file which contains lines like this
1345447800561|FINE|blah#13|txReq
1345447800561|FINE|blah#13|Req
1345447800561|FINE|blah#13|rxReq
1345447800561|FINE|blah#14|txReq
1345447800561|FINE|blah#15|Req
I am trying extract the first field from each line and depending on whether it belongs to blah#13 or blah#14, blah#15 i am creating the corresponding files using the following script, which seems quite in-efficient in terms of the number of temp files creates. Any suggestions on how I can optimize it ?
cat newLog | grep -i "org.arl.unet.maca.blah#13" >> maca13
cat newLog | grep -i "org.arl.unet.maca.blah#14" >> maca14
cat newLog | grep -i "org.arl.unet.maca.blah#15" >> maca15
cat maca10 | grep -i "txReq" >> maca10TxFrameNtf_temp
exec<blah10TxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
cat maca10 | grep -i "Req" >> maca10RxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
rm -rf *_temp
Something like this ?
for m in org.arl.unet.maca.blah#13 org.arl.unet.maca.blah#14 org.arl.unet.maca.blah#15
do
grep -i "$m" newLog | grep "txReq" | cut -d' ' -f1 > log.$m
done
I've found it useful at times to use ex instead of grep/sed to modify text files in place without using temps ... saves the trouble of worrying about uniqueness and writability to the temp file and its directory etc. Plus it just seemed cleaner.
In ksh I would use a code block with the edit commands and just pipe that into ex ...
{
# Any edit command that would work at the colon prompt of a vi editor will work
# This one was just a text substitution that would replace all contents of the line
# at line number ${NUMBER} with the word DATABASE ... which strangely enough was
# necessary at one time lol
# The wq is the "write/quit" command as you would enter it at the vi colon prompt
# which are essentially ex commands.
print "${NUMBER}s/.*/DATABASE/"
print "wq"
} | ex filename > /dev/null 2>&1

Get random link from page using Shell

I'm trying to write a very basic benchmarking script which will load random pages from a website, starting with the home page.
I will be using curl to grab the contents of the page, but then I want to load a random next page from that as well. Could someone give me a bit of Shell code that will get the URL from a random a href from a the output of the curl command?
Here's what I came up with:
curl <url> 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
Replacing the bit with the URL you are trying to get a link from.
EDIT:
It might be easier to make a script called getrandomurl.sh containing:
#!/bin/sh
curl $1 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
and run it like ./getrandomurl.sh http://stackoverflow.com or something.
Using both lynx and bash arrays:
hrefs=($(lynx -dump http://www.google.com |
sed -e '0,/^References/{d;n};s/.* \(http\)/\1/'))
echo ${hrefs[$(( $RANDOM % ${#hrefs[#]} ))]}
Not a curl solution, but I think more effective given the task.
I would suggest using the perl WWW::Mechanize module for this. For example to dump all links from a page use something like this:
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);
Note URL should be replaced with the wanted page.
Then either continue within perl, the following follows a random link on the URL page:
$number_of_links = "" . #{$mech->links()};
$mech->follow_link( n => int(rand($number_of_links)) )
Or use the dump_links version above to get urls and process further within shell, e.g. to get random url (if the above script is called get_urls.pl):
./get_urls.pl | shuf | while read; do
# Url is now in the $REPLY variable
echo "$REPLY"
done
Using pup
A flexible solution to getting all the links on a page is to use pup to specify CSS selectors. For instance, I can grab all the links (<a> tags) from my blog using:
curl https://jlericson.com/ 2>/dev/null \
| pup 'a attr{href}'
The attr{href} at the end outputs only the href attribute. If you run that command, you'll notice several links aren't to posts on my site, but to my email address and Twitter account.
If I want to get just the blog post links, I can be a bit more choosy:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}'
That grabs only links with class='post-link', which are the links to my posts.
Now we can select a random line of output:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| shuf | head -1
The shuf command mixes the lines like a deck of cards and head -1 draws the top card off the deck. (Or the first line, if you prefer.)
My links are all relative, so I will want to append the domain using sed:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
The sed command replaces the first / with the rest of the URL.
I might also want to include the text of the link. That gets a bit tricky because pup doesn't support two output functions. But it does support outputting to JSON, which can be read with jq:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link json{}' \
| jq -r '.[] | [.text,.href] | #tsv' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
This is a tab-separated value output, which may or might not be what you want.

Resources