I have a basic web scraper written which pulls short sections of text from a webpage and puts them into a list. My problem is that there are dynamic ads that appear on the page and mess up the lists.
The page I'm scraping is a Yelp restaurant listing page.
I pull out the biz-name (business name) and add it to the list and it works fine but when the ads appear the scraper pulls the biz-name also.
This is the structure but I can't figure out how to ignore the 'AD element' and just scrape the normal business names. I've cut it down a lot and removed the 'unimportant' elements.
This is with an AD:
<li class="yloca-search-result">
...
...
<a class="biz-name"...><span>San Lorenzo’s</span></a>
</li>
This is a normal listing:
<li class="regular-search-result">
...
...
<a class="biz-name"...><span>BigGrill</span></a>
</li>
I've been trying to make Nokogiri ignore the business name inside the <li class="yloca-search-result"> and only select the others inside the regular-search-result class.
I can't figure it out. Can someone point me in the right direction at least? Is it possible?
I figured it out. Wasn't difficult but I just couldn't see the answer.
ad = doc3.at_css("li.yloca-search-result")
ad.remove
Related
I'm using Laravel, and have a view structure like such:
Each of my pages have a navbar at the top, and the navbar has a few links on it:
<ul class="uk-nav uk-navbar-dropdown-nav">
<li>KENNEL INFORMATION</li>
<li>SIRES & DAMS</li>
<li>LITTERS</li>
<li>HELP</li>
</ul>
If I click one of these links (lets say info) from the homepage (home.blade.php), it redirects to the following url:
localhost:8080/mykennel/info
However, if I'm already on a view that is in the mykennel subfolder (again, lets say I'm on the info page), and I click a link from the navbar, it redirects to:
localhost:8080/mykennel/mykennel/info
Which throws a 404. I understand WHY this is happening but I can't seem to find how to fix it. How can I create an href in my anchor tag that knows to use only a single /mykennel/ prefix, regardless of where the user is currently situated on the site?
Any help is appreciated.
Try to use an absolute path instead, by adding / at the beginning of the href so links:
<ul class="uk-nav uk-navbar-dropdown-nav">
<li>KENNEL INFORMATION</li>
<li>SIRES & DAMS</li>
<li>LITTERS</li>
<li>HELP</li>
</ul>
Not even sure I labeled this correctly, I am in the process of converting a site to Umbraco, and there are sections of the site that needs to be edited using the CMS tools in the back end, basically it is a grid with pictures and description text
Here is a sample of the HTML
<div class="hi-icon-effect-1 hi-icon-effect-1a">
<a class="hi-icon">
<img class="img-responsive " id="ImgSales" src="../../Images/sales_icon_circle_grey.png" alt="">
</a>
<p style="padding-left:5px;" id="lblSales" class="">Sales</p>
</div>
What I would like to be able to do is go to the content section of the admin and edit the list of items and configure the image and text for each item.
http://www2.strikemedia.co.za/
If you view the above link and scroll down there will be a grid of items (services) and it is this list that I want to be able to generate.
I am comfortable with all the technologies used in Umbraco, I just do not know the system well enough to do these kinds of modifications, can someone please assist or point me to the resources that will help me build this.
Thanks
You should take a look at the Archetype package: https://our.umbraco.org/projects/backoffice-extensions/archetype/
As far as I understand your question you are looking for a way to add X amount of similar items to the contents of a page - for this, Archetype is probably perfect :-)
Once you have your list of items added inside Umbraco, look here: https://github.com/kgiszewski/ArchetypeManual/blob/master/03%20-%20Template%20Usage.md
Use case #1 in this example will allow you to iterate through items and output it with whatever "template" you want (aka the HTML sample you provided).
I tried this code:
{block:Posts}
<ul>
{block:HasTags}
{block:Tags}
<li>{Tag}</li>
{/block:Tags}
{/block:HasTags}
</ul>
{/block:Posts}
However when you click one it takes you to another page. How can I make this sort them without going to another page?
Example of the sorting I want to achieve: http://purifytheme.tumblr.com/
I have a webpage looks something like this:
<html>
...
<div id="menu">
...
<ul id="listOfItems">
<!--- repeated block start -->
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title</span></span>
...
</li>
<!-- repeated block end-->
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title something</span></span>
...
</li>
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title other thing</span></span>
...
</li>
</ul>
...
</div>
...
</html>
I would like to know what is the xpath of the titles ("title", "title something", "title other thing"). The point is that the order of the <li> elements are not specified. It could be different after every page loading. Is there any method how to discover a certain structure of the page with xpath? I have an notion about how to solve this issue, but before I'm going to write iterations with C# to discover the page I ask you.
Thanks in advance!
First of all, id's should be unique, so your portrayed webpage would not work well when it comes to testing.
I did however test, and got some XPath locators to work for selecting specific titles (although I recommend you fix your webpage instead of actually using this):
//li[#id='item']/span/span
//li[#id='item'][1]/span/span
//li[#id='item'][3]/span/span
If you're after all three titles, you could try Dimitre Novatchev's suggestion:
//span[#class='title']
This should get all titles on the page.
I would like to say one thing however, if you're getting into Selenium, I recommend you download the Selenium IDE extension for Firefox. It's a great tool for beginners. It helps you both to make your Selenium tests by recording your clicks on a website, and it also helps you auto-generate and test your XPath locators and other locators.
And again: I urge you to not make a website with duplicate id elements :-)
Does Selenium support XPath expressions like:
//span[#class='title']
If yes, than use the above XPath expression. It selects every span element in the XML document, whose class attribute has string value of "title".
I recommend to use a tool like the XPath Visualizer to play with different XPath expressions and see the selected nodes highlighted in the source XML document.
After switching from firefox testing to internet explorer testing, some elements couldn't be found by selenium anymore.
i tracked down one locator:
xpath=(//a[#class='someclass'])[2]
While it works as it should under firefox, it could not find this element in ie.
What alternatives do i have now? JS DOM? CSS Selector? How would this locator look like?
Update:
I will provide an example to make my point:
<ul>
<li>
<a class='someClass' href="http://www.google.com">BARF</a>
</li>
<li>
<a class='someClass' href="http://www.google.de">BARF2</a>
</li>
</ul>
<div>
<a class='someClass' href="http://www.google.ch">BARF3</a>
</div>
The following xpath won't work:
//a[#class='someclass'][2]
In my understanding this should be the same as:
//a[#class='someclass' and position()=2]
and i don't have any links that are the second child of any node. All i want is, to address one link from the set of links of class 'someClass'.
Without knowing the rest of your HTML source it's difficult to give you alternatives that are guaranteed to work. Hopefully the following suggestions will help point you in the right direction:
//a[#class='someClass'][2]This is like your example, but the parantheses are not needed.
//a[contains(#class, 'someClass')][2] This will work even if the link has other classes.
css=a.someClass:nth-child(2) This will only work if the link is the 2nd child element of it's parent.
Update
Based on your update, try the following: //body/descendant::a[#class='someClass'][2]