I am new to web crawling as well as xpath. However, I am trying to crawl the following website: https://sabobic.blogabet.com/
Basically, I want to extract all "feed_pick_analysis", i.e., all text content which belongs to each post.
I cannot use the statement bellow, because the ID is changing dynamically.
xpath('.//div[#class="feed-pick-title"]/div[#class="col-xs-12 _text-more feed-analysis"]/div[#id="feed_pick_analysis_27759116"]/p').extract()
Thus, I tried to use the following statement:
xpath('.//div[#class="col-xs-12 _text-more feed-analysis"]/div[#contenteditable="false"]/p').extract()
However, I am not even getting any data or tag responded by the website... What is my mistake?
[EDIT] This is the html I am working on:
<div class="col-xs-12 _text-more feed-analysis">
<div contenteditable="false" id="feed_pick_analysis_27759116">
<p>Cant verify asians because nothing is working on this site.<br>
<br>
Game is available in IBC,ISN,SBO<br>
<br>
Game on neutral ground.<br>
<br>
No home advantage for Persipura and thats big minus for them today.<br>
<br>
So Persija will have many fans on the stands, supporting them, so thats more home game for Persija.<br>
<br>
They sign some quality players(Aryanto) and foreigners Xandao and spanish playmaker Tomas who seems to be best player in the league.<br>
<br>
Big value on Persija +0.25 and DNB.<br>
<br>
Fair odds Persija ML #2.10 and dnb #1.50. GL!</p> </div>
<div class="col-xs-12 no-padding margin-top-10">
<small class="last-edit "><em>
last edited: Wed, Sep 11th, 2019, 09:47 </em></small>
</div>
</div>
To make your XPath expression more flexible you can ignore the number in the last #id value. Also note that your expression was missing a space between _text-more and feed-analysis - it has to be _text-more feed-analysis.
.//div[#class="col-xs-12 _text-more feed-analysis"]/div[contains(#id,"feed_pick_analysis")]/p
I merely removed the first div because it was not part of the sample HTML. Add it again, if the expression is not specific enough.
Related
We have embedded rich snippets for a rental objects rating website. Here an example:
http://www.google.de/webmasters/tools/richsnippets?q=http%3A%2F%2Fwww.sonnenquartiere.de%2Fferienwohnungen%2F4-wohnung-8-boardinghaus-norderney-2-zimmer-apartment.html
Works fine when listed in Google search results.
Now we want to aggregate all ratings and post them on the homepage so the homepage itself gets a rating in the Google search results. We did it some time ago:
http://www.google.de/webmasters/tools/richsnippets?q=www.sonnenquartiere.de
We did that some time ago, but the result in the Google Search is still not being displayed with the aggregated rating. Here an Example:
https://www.google.de/search?q=Boardinghaus+Norderney (2nd place here)
Is there something we can do to get this working?
One thing that I did notice about your markup is that you are using the schema.org/WebPage markup for your aggregate review rating. So search engines are seeing that schema as a rating for your home page. You should be using the aggregate rating schema with a schema that better describes your type of business, perhaps something in the schema.org/LodgingBusiness category.
I hope this is not too late and that I will be able to help.
Following the standard procedure is the best practice in such thing. Make sure you use the correct markup. So for review use the reviews-schema as the example below:
Use this tool to generate the code and then modify your website according to this working sample
http://www.microdatagenerator.com/reviews-schema/
For example:
<div itemscope itemtype="http://schema.org/Review">
<div itemprop="itemReviewed" itemscope temtype="http://schema.org/Thing"><span itemprop="name">home mortgage</span> </div>
<div itemprop="author" itemscope itemtype="http://schema.org/Person">
<span itemprop="name"> Nick M.</span>
</div><meta itemprop="datePublished" content = "01/01/2016">
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"
<meta itemprop="worstRating" content = "1"/><span itemprop="ratingValue">5</span>/<span itemprop="bestRating">5</span> stars </div>
<span itemprop="description">My experience with ABC Company was very good and I recommend it to everyone. </span>
</div>
you take that code and take the parts you need to your webpage.
This video was great:
https://www.youtube.com/watch?v=N2PjWtybDOs
Im trying to get this rich snipnet going, so far no errors but no image shown:
<div itemscope itemtype="http://schema.org/Product" style="display: none;">
<span itemprop="name">Canon EOS 5D (prove)</span>
<span itemprop="description"><p>Canon's press material for the <strong>EOS 5D</strong> states that it 'defines (a) new D-SLR category', while we're not typically too concerned with marketing talk this particular statement is clearly pretty accurate. The EOS 5D is unlike any previous digital SLR in that it combines a full-frame (35 mm sized) high resolution sensor (12.8 megapixels) with a relatively compact body (slightly larger than the <strong>EOS 20D</strong>, although in your hand it feels noticeably 'chunkier').</p>
</span>
<img itemprop="image" src="http://wonna.it/image/cache/data/demo/canon_eos_5d_2-74x74.jpg"/>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">3</span>
<span itemprop="reviewCount">3 recensioni</span>,
</div>
</div>
This is how I check there is no image: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fwonna.it%2Fcamera-eos-canon&html=
I think you're getting things mixed up.
Microdata is designed to describe your website so software can interpret your page and have a much better understanding of what you are publishing to the world.
This standard is not designed about how you look on Google!
Google also uses this standard (and many more techniques) to decide how they want to represent your site in their results.
Because you start to describe a product with an image won't force them to display it in their result.
I have a "Review-aggregate" microdata snippet in my site, and google has cached it, but it is not appearing in the google search results with the rating stars.
The URL that has the microdata in is:
http://www.rnsalert.com/
And here is the snippet:
<div class="ui-corner-bottom" id="micro-data-reviews" itemscope="" itemtype="http://data-vocabulary.org/Review-aggregate">
<span itemprop="itemreviewed">RNSalert</span> is rated
<span itemprop="rating" itemscope="" itemtype="http://data-vocabulary.org/Rating">
<span itemprop="average">9.0</span>
out of <span itemprop="best">10</span>
</span>
based on <span itemprop="votes">16</span> independent ratings.
</div>
Using Google's structured data test tool, it shows that the microdata is being parsed correctly...
http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fwww.rnsalert.com%2F&html=
Yet the google search results aren't showing it. The page has been cached.
Search google for "RNS alert" and you will get the page listed as the first organic result.
Any thoughts?
Many thanks,
Dan
Google should recognise all types defined in schema.org, but it supports rich snippets in the search results for these content types only:
Reviews
People
Products
Businesses and organizations
Recipes
Events
Music
Source:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&topic=1088472&ctx=topic
Also, it may take a while before you're able to see them in the search results - maybe even days.
I'm 95% sure that for your rating/review stars to appear in Google's web results you must also setup Google authorship with the site/page.
6+ months ago this step wasn't required.
Your page contains microdata for a review aggregate, but it is not related to a product or item.
Check the examples here https://schema.org/AggregateRating
The aggregate needs to be within the scope of another item:
<div itemscope itemtype="http://schema.org/Restaurant">
<span itemprop="name">GreatFood</span>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">4</span> stars -
based on <span itemprop="reviewCount">250</span> reviews
</div>
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">1901 Lemur Ave</span>
<span itemprop="addressLocality">Sunnyvale</span>,
<span itemprop="addressRegion">CA</span> <span itemprop="postalCode">94086</span>
</div>
<span itemprop="telephone">(408) 714-1489</span>
<a itemprop="url" href="http://www.dishdash.com">www.greatfood.com</a>
Hours:
<meta itemprop="openingHours" content="Mo-Sa 11:00-14:30">Mon-Sat 11am - 2:30pm
<meta itemprop="openingHours" content="Mo-Th 17:00-21:30">Mon-Thu 5pm - 9:30pm
<meta itemprop="openingHours" content="Fr-Sa 17:00-22:00">Fri-Sat 5pm - 10:00pm
Categories:
<span itemprop="servesCuisine">
Middle Eastern
</span>,
<span itemprop="servesCuisine">
Mediterranean
</span>
Price Range: <span itemprop="priceRange">$$</span>
Takes Reservations: Yes
</div>
Unfortunately, just because you put something on your site doesn't mean that Google has to show it in their search results.
I would assume that they need to have a certain amount of trust in your site before anything will show. If not then everyone would just mark their own products/posts with 5 stars so it looks good in a search listing page.
In my experience, it is a lot easier to get rating stars in listings when your site contains votes for other people's products rater than your own.
Google help:
The search preview is approximate.
Real Google Search results for your data might look different.
The search preview is illustrative.
We do not guarantee that the content you preview will be displayed in Search results. Your content will likely appear in Search results for relevant queries, but that content must first pass through our systems to be appropriately indexed and ranked before it appears in actual Search results. We reserve the right to filter out any results at our discretion if it violates our quality guidelines. Finally, any preview tool URL for your content is only valid for a limited duration (e.g. weeks).
I been trying for over 2 hours to import timestamp from zap2it.com link to my google spreasheet.
Here is link I am trying to importxml from.
http://affiliate.zap2it.com/tvlistings/ZCGrid.do?zipcode=78238&lineupId=DISH641:-
Here is what I am tryign to import
Here is what I tried so far
=importxml("http://affiliate.zap2it.com/tvlistings/ZCGrid.do?aid=dish&pkg=8388608&fromProvider=true&zipcode=78238&x=52&y=18"&B1,"//body//div[3]/div/div/div[3]/div/div")
EDIT
I was able to improve and get better results
//body//div[3]/div/div/div[1]//*
but it shows timestamp from all over the page. not exactly what I need.
[The first complication is that the data stream returned from dereferencing that URI is not actually XML; it has several thousand well-formedness errors (unescaped ampersands in URIs, unescaped ampersands and less-than signs in scripts, some embedded HTML, some miscellaneous errors). Since you're not reporting problems from that, however, I'll assume that somewhere between the server and your XPath expression someone is doing some tidying.]
I think you'll get better results if you use the id and class attributes that are extensively used in the document. The material you want looks like this in the source (you can use any browser-based debugging tool to find it; I used the 'Web Inspector' in Safari); I have indented to make the structure more visible, and fixed some well-formedness errors in one of the a elements (missing whitespace between attribute-value pairs).
<div class="zc-tn" id="zc-tn-top">
<div class="zc-tn-i">
<a href="ZCGrid.do?fromTimeInMillis=1355781600000"
class="zc-tn-l"
title="Move the grid three hours earlier"></a>
<div class="zc-tn-c">
<span class="zc-tn-z"
title="Central Standard Time">CST</span>
<div class="zc-tn-t">7:00 PM</div>
<div class="zc-tn-t">7:30 PM</div>
<div class="zc-tn-t">8:00 PM</div>
<div class="zc-tn-t">8:30 PM</div>
<div class="zc-tn-t">9:00 PM</div>
<div class="zc-tn-t">9:30 PM</div>
</div>
<a href="ZCGrid.do?fromTimeInMillis=1355803200000"
class="zc-tn-r"
title="Advance the grid three hours"></a>
</div>
</div>
A simple search verifies that the value zc-tn-top is indeed unique as an ID value in the document. Given that, a simple XPath expression to retrieve all the elements whose display is circled in your image is (assuming xhtml is bound to the XHTML namespace):
//xhtml:div[#id='zc-tn-top']//xhtml:div[#class='zc-tn-t']
It looks from your question as if your XPath evaluator is namespace-challenged or namespace-oblivious, so you may need to write this as
//div[#id='zc-tn-top']//div[#class='zc-tn-t']
I'm trying to automate testing of the code... well, written without testing in mind (no IDs on many elements, and a lot of elements with the same class names). I would appreciate any help (questions are below the code):
<div id="author-taxonomies" class="menu-opened menu-hover-opened-inactive" onmouseover="styleMenuElement(this)" onmouseout="styleMenuElement(this)" onclick="toggleSFGroup(this)">Author</div>
<div id="author-taxonomies-div" class="opened">
<div id="top-level-menu" class="opened">
<div id="top-level-menu-item-1" class="as-master">
<div class="filter-label"> Name</div>
</div>
<div id="top-level-menu-item-1" class="as-slave"
style="top: 525px; left: 34px; z-index: 100; display: none;"> </div>
<div id="top-level-menu-item-2" class="as-master">
<div class="filter-label">Title</div>
</div>
<div id="top-level-menu-item-2" class="as-slave">
<div id="top-level-menu-item-2" class="as-slave-title as-slave-title-subgroup"
>Title</div>
<div id="top-level-menu-item-2" class="as-slave-body"> </div>
<div class="as-slave-buffer"> </div>
</div>
<div id="top-level-menu-item-3" class="as-master">
<div class="filter-label">Location</div>
</div>
<div id="top-level-menu-item-3" class="as-slave"> </div>
</div>
</div>
The question is: how to refer particular labels of this menu and the properties with xPath expressions? For example, if I want to:
verify the "Location" label is there
check if "Title" with class "as-slave" is not visible at the moment
It would be something similar to:
//div[#id="top-level-menu-item-3"]/div[#class="filter-label"]
//div[#id="top-level-menu1"] --- and check in code for display: none ... assuming it is selenium rc you are using
Update: also be sure to install the following firefox addin, it is Really useful when trying different xpath expressions on a site https://addons.mozilla.org/en-US/firefox/addon/1095
As a side note: try to avoid using xpath locators in Selenium, if possible. If you have a long xpath expression, it can be up to 20 times slower for Selenium to find the element compared to identifying it using its unique ID. Of course, sometimes there is no alternative to using xpath. However, when you do use it, keep '//' expressions to minimum - this is a real performance killer.
If you're just starting with Selenium, download the selenium add-on for Firefox. As you click on DOM elements, Selenium shows you the xpath to access it.
I am currently working on an open source library for generating xpath expressions through a fluent .Net API. The idea is to be able to generate xpath based selenium locators without having to know xpath.
Here's an example of how the library can be used in your case:
XPathFinder.Find.Tag("div").With.Attribute("id", "top-level-menu-item-3").And.Child("div").With.Attribute("class", "filter-label").ToXPathExpression();
This will produce the following xpath:
"//div[#id='top-level-menu-item-3']/div[#class='filter-label']"
Check it out at
http://code.google.com/p/xpathitup/
You can use firepath that can be installed over firebug(both firefox plugin). When you get a xpath, dont forget to append // before using it. Either in code or in selenium IDE. You are not appending it thats why its unusable. There are two types of xpath absolute and relative. If you use absolute then it will take care of dynamic ids. But if you use relative it will break with each run.