Bash - Read HTML & find div based on two different variables - bash

I am trying to get information from a div based on a date that I am holding as a variable, then I am trying to filter the returned results based on another variable to narrow down the results list to a single match in order to extract the URL.
Example of the HTML of the page, this will have another 10 items with the information being different. The same date may appear more than once..
<div class="bhangra-artist details ">
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Artist Name </div>
<time>
September 08, 2018 </time>
<div class="release-information">
<a class="date-of-release" href="releases-today" data-trackid="releases today" title="releases today">
<span class="label-left-box">releases today</span>
<span class="label-text">releases today</span>
</a>
<span class="label-hd "></span>
</div>
</div>
In my script I am running
DATE=$(cat html.txt | sed -n -e '/bhangra-artist card/,/<\/time>/ p' )
echo "${DATE}"
This returns the below but all results so theirs about 10 matches returned.. I am simply showing example of 3.
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Artist Name </div>
<time>
September 08, 2018 </time>
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Name Artist </div>
<time>
September 08, 2018 </time>
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Artist1 Name & Artist2 Name </div>
<time>
September 05, 2018 </time>
With the returned results I am now attempting to narrow them down to one result.
I have a variable called $ReleaseDate which will have have value September 08, 2018
So now that ${DATE} has 10 different divs with dates I need to match all the ones containing the date in $ReleaseDate This is the part I am not sure about how to do.
I'd expect the results to be narrowed down to the date variable so with the above example i'd expect the 3 results to be down to 2 results.
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Artist Name </div>
<time>
September 08, 2018 </time>
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Name Artist </div>
<time>
September 08, 2018 </time>
Once I have narrowed down the results from 10 to ones matching my date variable their will be 1-3 results left. So now I need to filter this down to 1 result.
I have my final variable which is $artistName This unfortunately contains "Artist Name Name Of The Album" so what I am looking to do is simply match the first word which will always be a artist name.
So I am looking to match $artistName to the line "Artist Name " once this has been done I'd want the containing div and all other divs removed so that I am left with one result.
<div class="bhangra-artist card">
<div class="bhangra-artist-title" style="text-overflow: none;">
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
</div>
<div class="artist-names">
Artist Name </div>
<time>
September 08, 2018 </time>
Once I have only one result. I am attempting to get the link for this album. I believe I can target this already but I am matching it against the html so all instances as I cannot filter the divs based on variables I have.
<a href="/bhangra/artist/album/id/123456/title-of-the-album/" data-trackid="title of the album" title="Title Of The Album" style="position: relative; left: 0px;">
Title Of The Album </a>
End Result: /bhangra/artist/album/id/123456/title-of-the-album/
I have certain elements done but I am totally stuck on how to piece it all together.
So to Recap:-
My Variables and Values are:-
$DATE=September 08, 2018
$artistName=Artist Name Name Of The Album
The code I have so far.
#!/bin/bash
echo "date : ${DATE}" #This has the value September 08, 2018
echo "artist: ${artistName}" #This has the value Artist Name Name Of The Album
# Get HTML and find the DIV's containing the information that is required.
# GetContainer Reads the html file html.txt, using sed to target the entire bhangra artist panel until the time section. This returns 10 results.
GetContainer=$(cat html.txt | sed -n -e '/bhangra-artist details/,/<\/time>/ p' )
IFS=$OIFS
# Run GetContainer into another variable called filterDATE and now search this for a date retain the containing div and remove all other results. This step should take the results down from 10 results to only a handful 1-3 usually.
As the the date match would only return 1-3 results in total.
filterDATE=$(echo -n "$GetContainer" )
filterDATE=$(echo -n "$filterDATE" ) #Unsure how to do this so this step is blank as I am unsure how to verify the selected date against my variable $DATE and the get all of the outer div.
# Now that I only have 1-3 results I want to narrow this down to one result.
Using $artistName retain the containing div and remove all other results. This step would mean that I should now only be left with 1 result.
# Once I have narrowed down the results to 1 strip away html so that only the link is remaining. I guess at this point I need to use the results from filterDATE but for now I am checking I can pull back the link from the raw html.
GETURL=$(cat html.txt | sed -n -e '/bhangra-artist-title/,/<\/a>/ p' | grep "bhangra" | sed 's/<a href=\"//g' | sed 's/"//g' )
echo "${DATE}"
echo "${filterDATE}"
echo "${GETURL}"
Any help would be appreciated.

Related

Get Element using XPath in Puppeteer

I am trying to scrape multiple elements with the same class names but each has a different number of children. I am looking for a way to select specific elements using the xpath(this would make it easiest for my loop).
const gameTimeElement = await page.$$('//*[#id="section-content"]/div[2]/div[1]/div/div['+ i + ']');
const gameTimeString = await gameTimeElement[j].$eval('h3', (h3) => h3.innerHTML);
This currently does not work.
After I select the element, I grab the h3 tag inside and evaluate it to get the innerHTML.
Is there a way to do this utilizing xpath?
<div id="section-content" style="display: block;">
</div>
<div class="matches">
<div class="day day-28-1" data-week="1" style="display: block;">
<h4>Sat, March 28, 2020</h4>
<div class="day-wrap">
<div class="match region-7-57d5ab4-9qs98v" data-week="1">
<h3 class="time">2:00PM
<span>(Central Daylight Time)</span>
<span class="fr">Best of 7</span>
</h3>
<div class="row ac ">
<div class="col-xs-3 ar">
<img class="team-logo" src="url"></div>
<div class="col-xs-2 al">
<h4 class="loss">(NA)<br>
<span class="team-name">Team1</span>
<br>
<span class="win spoiler-wrap">0</span>
</h4>
</div>
<div class="col-xs-2">
<img class="league-logo" src="url">
<h4> V.S.</h4>
</div>
<div class="col-xs-2 ar">
<h4 class="">(NA)<br>
<span class="team-name">Team2</span>
<br>
<span class="win spoiler-wrap">4</span>
</h4>
</div>
This is a sample of what I am working with for HTML on the website.
Yes, div class="day-wrap" could have a different number of childs. But I don't think that's a problem.
You want to get game times of all Rocket League matches. As you've noticed, games times are located within h3 elements. You can access it directly with one of the following XPaths :
//div[#id="section-content"]//h3
//div[#class="day-wrap"]//h3
//div[contains(#class,"match region")]//h3
If you want something for a loop then you can try :
(//div[#class="day-wrap"]//h3)[i]
where i is the number to increment (from 1 to x).
Side notes : your sample data looks incorrect (according to your XPath). You have a closing div line 2 and it seems you omit div class="row middle-xs center-xs weeks" before div class="matches".

I am trying to figure out the correct way to use microdata on a <figure> and <figcaption>

I tried this but keep getting errors saying that the address needs a value. However, this is exactly how they had it on schema.org
<section class="content col-sm-4 mx-auto" id="germination" itemscope
itemtype="http://schema.org/Event">
<a itemprop="url"
href="http://mevocals.weebly.com/germination.html">
<img itemprop="image" class="img-fluid rounded "
src="https://freedomfieldme.net/images/germinationPoster_02.png"
alt="Germination poster of Freedom Field Cannabis Friendly camping
and music festival" title="Germination">
<h4 itemprop="name">Germination Camping and Music Fesitval!
</h4>
</a>
<p>Our first festival of the year. Come and help us start the
season off with a bang!</p>
<meta itemprop="startDate" content="2018-04-17T1600">
Thu, 04/17/2018<br>
4:pm.
<div itemprop="location" itemscope
itemtype="http://schema.org/Place">Freedom Field</div>
<div itemprop="address" itemscope
itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality">Harmony</span>
<span itemprop="addressRegion">ME</span>
</div>
</section><!--germination-->

How to select child element of each li in an each.do list iteration?

I need to run through a list of users and open each in a new tab. The tab should close upon performing an action and return to the first tab to select the next user in the list. I'm having trouble selecting the next user in the list by order. The problem is that each user can be selected only by clicking their photo- which is a child element of the list item.
HTML
<div class="_gs38e">
<ul class="_8q670 _b9n99">
<li class="_6e4x5">
<div class="_npuc5">
<a class="_pg23k _9irns _gvoze" style="width: 100px; height: 100px;">
<img class="_rewi8" src="https://pic.com/t51.2885-
_1930374081979351040.jpg"></a>
<div class="_eryrc">
<a class="_2g7d5 notranslate _o5iw8" title="username1_"
href="/username1_/">username1_</a>
</div>
</div>
</li>
<li class="_6e4x5">
<div class="_npuc5">
<a class="_pg23k _9irns _gvoze" style="width: 100px; height: 100px;">
<img class="_rewi8" src="https://pic.com/t51.2885-
_1930374081979351040.jpg"></a>
<div class="_eryrc">
<a class="_2g7d5 notranslate _o5iw8" title="username1_"
href="/username1_/">username1_</a>
</div>
</div>
</li>
<li class="_6e4x5">
<div class="_npuc5">
<a class="_pg23k _9irns _gvoze" style="width: 100px; height: 100px;">
<img class="_rewi8" src="https://pic.com/t51.2885-
_1930374081979351040.jpg"></a>
<div class="_eryrc">
<a class="_2g7d5 notranslate _o5iw8" title="username1_"
href="/username1_/">username1_</a>
</div>
</div>
</li>
<li class="_6e4x5">
<div class="_npuc5">
<a class="_pg23k _9irns _gvoze" style="width: 100px; height: 100px;">
<img class="_rewi8" src="https://pic.com/t51.2885-
_1930374081979351040.jpg"></a>
<div class="_eryrc">
<a class="_2g7d5 notranslate _o5iw8" title="username1_"
href="/username1_/">username1_</a>
</div>
</div>
</li>
<li class="_6e4x5">
<div class="_npuc5">
<a class="_pg23k _9irns _gvoze" style="width: 100px; height: 100px;">
<img class="_rewi8" src="https://pic.com/t51.2885-
_1930374081979351040.jpg"></a>
<div class="_eryrc">
<a class="_2g7d5 notranslate _o5iw8" title="username1_"
href="/username1_/">username1_</a>
</div>
</div>
</li>
</ul>
</div>
This results in the first user being selected each time as it just finds the first picture on the page for all the list items. I've added an output of the username text to make sure the each do loop is running through the list items in order. It successfully prints the username of a different list item every time while only opening the first user. I need a way to define the photo element as a child of each list item to be included in the each.do loop
Mylist = browser.ul(:class => "_8q670 _b9n99")
#Index the UL to open each user in order
Mylist.users.each_with_index do |user|
#open user in new tab by selecting photo
browser.a(:class => '_pg23k _9irns _gvoze').exists?
browser.a(:class => '_pg23k _9irns _gvoze').click(:control)
browser.windows.last.use
#performs action and closes tab
browser.windows.last.close
puts user
puts user.text
end
How do I define the element to be selected as the child of the current li in the list?
Inside the loop you need to use
user.a(:class => '_pg23k _9irns _gvoze').click(:control)
The reason is that a browser object can be used to search elements inside the main html page. If you use 'browser' inside the loop it will be looking at the entire HTML, and where more than one element that meets the search criteria exists, it will return the first one found. So even through it is inside the loop, it would find the same element on the page every time.
To use the current item in the list that is being iterated you need to use the name (in this case user) that you've told it to assign to the list elements as it walks the list.

Unable to Select List Element looking for solution when element has no usable ID

The html page contains two containers. Each container has two columns, the left for selectable list items and the right for selected list items. So once you click on the list item it moves from the left column to the right column.
The first container is for associated clients.
The second container is for countries.
They both use similar code without a unique id or name.
HTML code for first container:
<div class="col-sm-12 col-md-6">
<div class="tab-section">
<h3 class="section-header"> Associated Client(s) </h3>
<div class="row">
<div class="col-sm-12">
<div id="ClientControlDiv">
<div style="margin: 0 auto; width: 450px;">
<select id="AssociatedClientList" class="multi-select" name="AssociatedClientList" multiple="multiple" style="position: absolute; left: -9999px;">
<div id="ms-AssociatedClientList" class="ms-container">
<div class="ms-selectable">
<div class="panel-heading ">
<ul class="ms-list" tabindex="-1" style="height: 250px; width: 200px;">
<li id="3ce0a0cc_378d_4477_8787_84033319940f-selectable" class="ms-elem-selectable ms-hover">
<span>(Test) 3M</span>
</li>
</ul>
</div>
<div class="ms-selection">
<div class="panel-heading ">
<div class="panel-title">Selected Client(s)</div>
</div>
<ul class="ms-list" tabindex="-1" style="height: 250px; width: 200px;">
<li id="3ce0a0cc_378d_4477_8787_84033319940f-selection" class="ms-elem-selection" style="display: none;">
<span>(Test) 3M</span>
</li>
HTML code for second container for countries:
<div class="col-sm-12">
<div id="DesignationControlDiv">
<div style="margin: 0 auto; width: 450px;">
<select id="AssociatedDesignationsList" class="multi-select" name="AssociatedDesignationsList" multiple="multiple" style="position: absolute; left: -9999px;">
<div id="ms-AssociatedDesignationsList" class="ms-container">
<div class="ms-selectable">
<div class="panel-heading ">
<ul class="ms-list" tabindex="-1" style="height: 250px; width: 200px;">
<li id="d86b9350_aa83_43c7_bc2b_5fc7f5c6ccae-selectable" class="ms-elem-selectable ms-hover">
<span>Afghanistan</span>
</li>
</div>
<div class="ms-selection">
<div class="panel-heading ">
<ul class="ms-list" tabindex="-1" style="height: 250px; width: 200px;">
<li id="d86b9350_aa83_43c7_bc2b_5fc7f5c6ccae-selection" class="ms-elem-selection" style="display: none;">
<span>Afghanistan</span>
</li>
Once selected the html code is:
<div class="ms-selection">
<div class="panel-heading ">
<div class="panel-title">Selected Client(s)</div>
</div>
<ul class="ms-list" tabindex="-1" style="height: 250px; width: 200px;">
<li id="3ce0a0cc_378d_4477_8787_84033319940f-selection" class="ms-elem-selection ms-selected ms-hover" style="">
<span>(Test) 3M</span>
</li>
Ruby code I tried both:
#b.select_list(:class => "ms-list").li(:text => "(Test) 3M").when_present.select
#b.select_list(:class => "ms-list").li.span(:text => "(Test) 3M").select
Based on the supplied HTML, there's no select_list element. For example:
b.ul(class: "ms-list").li(class: "ms-elem-selectable").span(text: "(Test) 3M").exists?
#=> true
b.ul(class: "ms-list").li(id: /selectable/).span(text: "(Test) 3M").exists?
#=> true
b.select_list(:class => "ms-list").li(class: "ms-elem-selectable").span(text: "(Test) 3M").exists?
#=> false
If you want to click on "(Test) 3M", try something like:
b.ul(class: "ms-list").li(class: "ms-elem-selectable").span(text: "(Test) 3M").click
This is a preliminary answer since you question needs more code to be able to determine what you really are trying to do:
#b.li(:id => /.*selectable/, :text => "(Test) 3M").hover
#b.li(:id => /.*selectable/, :text => "(Test) 3M").click
this assumes that your other list that you do not show in your question has the id like /.*selected/. So Watir should look for all of the li items that have an id that contains selectable and then look for the first one with the text "(Test) 3M").select.

extract text without <div> and <p> with xpath

<tr><td class=term>1st param</td>
<td>PUTIN
<div class='info-icon'>
<a href='#' onmouseover='show_pd(351);' onmouseout='hide_pd(351);' id='info-icon-351'></a>
</div>
<div id='pd-351' style='display: none; position: absolute;'>
<b>СПРАВКА</b>
<br /><br />
<P align=justify><NOBR><STRONG>ABS</STRONG></NOBR>bla-bla-bla text</P>
<P align=justify>bla-bla-bla text 2</P>
<P align=justify>bla-bla-bla text 3</P>
<P align=justify>bla-bla-bla text 4</P>
</div>
</td>
I need extract only "PUTIN".
Now I'm on
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/[not(self::p)]
With some adjustments to your XML following XPath
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/node()[1]
has the output PUTIN
Adjustments were to change <td class=term> into <td class="term"> and all <P align=justify> into <P align="justify"> (maybe not necessary for your settings but was required for the XPath evaluator I just used).

Resources