Xpath syntax when scraping headlines from CNN homepage

Xpath syntax when scraping headlines from CNN homepage - xpath

I tried to scrape CNN homepage with scrapy.
I used the following xpath selectors, but all of them returned empty lists.
Current results : all of these returns []
"//strong"
"//h2"
"//span[#class='cd__headline-text']"
Expected results :
[Headline_1, Headline_2, Headline_3, ...]
Can someone help me figure out why?
Is CNN doing something to stop people from scraping headlines?
I use Scrapy.

In order to write XPath/CSS selector or any web page, first of all, check page source that whether the selectors which you are looking for exists or not. In the current case none of the above selectors are found in page source. They are getting page content in various requests, try checking the network and find appropriate requests for your case. You need to make those requests in your spider in order to scrape news from CNN.

Related

Xpath is correct but Scrapy spider doesn't work

I'm trying to download from a webpage, I identify the XPath expression and then run the spider, but nothing is downloaded.
The webpage: https://octopart.com/electronic-parts/integrated-circuits-ics
Here is the code:
for product in response.xpath("//div[#class='serp-card-header media']/div[#class='media-body']"):
yield {'name': product.xpath("//a/span[#class='part-card-manufacturer']/text()").extract_first()}

This website seems to be using some simple bot detection. You are most likely using the default scrapy user agent. So instead you need to set a real user agent in your settings.py:
USER_AGENT = '[replace with a real user agent]'
Refer to the documentation.
After doing this you will get some results. However, your XPath is incorrect as well. Inside the for loop, when you do a relative XPath, it needs to start with .//a/span.... See here for the reason why: https://docs.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths

Xpath to url for import.io

I'm getting list of offered jobs on this site: http://telekom.jobs/global-careers
I'm trying to get XPath of link to get more info about job.
Here is the whole XPath to the first link:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[2]/td/div/a/#href
and this is what I should paste to import.io:
tr[2]/td/div/a/#href
But it doesn't work, I don't know why.
Links to more info about job offer pages are having XPath:
tr[2]/td/div/a/#href
tr[4]/td/div/a/#href
tr[6]/td/div/a/#href
tr[8]/td/div/a/#href
and so on.
Maybe that's why it doesn't work? Because the numbers arent 1,2,3 etc but 2,4,6? Or do I do something wrong?

If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:
DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]
Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.
Link xpath: .//a[#class=’forward jobadview’]/#href
Title xpath: .//div[#class=’info’]//h3
Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.
https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Yahoo Pipes to loop through all pages

I am looking to pull job postings from a site that has multiple pages of postings. I can pull the content from one page
On a simple example I can get it to iterate and grab page content (this is a simple example site base)
However when I take the first example and try to clean the data (I can't use the Xpath filter to grab the HTML id and I cand seem to find a way to limit the scope elsewhere. Here is what I am trying (regex, rename...):
http://pipes.yahoo.com/pipes/pipe.edit?_id=3619ea93d66e47442659a1976746ba6c
Any thoughts?

Use of Mechanize

I want to get response from websites that take a simple input, which is also reflected in the parameter of the url. Is it better to simply get the result by using conventional methods, for example OpenURI.open_uri(...) with some parameter set, or it is better to use mechanize, extract the form, and get the result through submit?
The mechanize page gives an example of extracting a form and submitting it to get the search result from Google search. However, this much can be done simply as OpenURI.open_uri("http://www.google.com/search?q=...").read. Is there any reason I should try to use one way or the other?

There are lots of sites where it turns out to be easiest to use mechanize. If you need to log in, and set a cookie before accessing the data, then mechanize is a simple way of doing this. Similarly, if there are lots of hidden fields that need to be matched (such as CSRF token), then fetching the page using mechanize then submitting it with the data filled out is often a more foolproof method that crafting the URL yourself.
If it is a simple URI, like google's search pages, then manually constructing it may be simpler.

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.

When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Xpath syntax when scraping headlines from CNN homepage - xpath

Related

Xpath is correct but Scrapy spider doesn't work

Xpath to url for import.io

Yahoo Pipes to loop through all pages

Use of Mechanize

Scraping pages with asynchronous responses with Hpricot

Categories

Resources