Yahoo Pipes to loop through all pages - yahoo-pipes

I am looking to pull job postings from a site that has multiple pages of postings. I can pull the content from one page
On a simple example I can get it to iterate and grab page content (this is a simple example site base)
However when I take the first example and try to clean the data (I can't use the Xpath filter to grab the HTML id and I cand seem to find a way to limit the scope elsewhere. Here is what I am trying (regex, rename...):
http://pipes.yahoo.com/pipes/pipe.edit?_id=3619ea93d66e47442659a1976746ba6c
Any thoughts?

Related

Xpath syntax when scraping headlines from CNN homepage

I tried to scrape CNN homepage with scrapy.
I used the following xpath selectors, but all of them returned empty lists.
Current results : all of these returns []
"//strong"
"//h2"
"//span[#class='cd__headline-text']"
Expected results :
[Headline_1, Headline_2, Headline_3, ...]
Can someone help me figure out why?
Is CNN doing something to stop people from scraping headlines?
I use Scrapy.
In order to write XPath/CSS selector or any web page, first of all, check page source that whether the selectors which you are looking for exists or not. In the current case none of the above selectors are found in page source. They are getting page content in various requests, try checking the network and find appropriate requests for your case. You need to make those requests in your spider in order to scrape news from CNN.

Best way to output the content i load from ajax

I want to load objects with ajax and then for every object make with options.
What is better to do when i load content from server via ajax:
put everything in a string variable (including html tags) named output with += and put it on a loop for each object,then append it.
append an output for every object i load in a div
or a better solution
if there is a better solution is there anyone who can help me ?
Well that really depends on a technology stack that you are using.
But in essence it is exactly what is going on.
You have some array of products coming back from an ajax request.
You want to clear out the contents of the area where you are inserting those.
Then you wanna iterate over your collection of items and using some kind of template generate an html for each one. the simplest one in plain java script would be string concatenation.
then you concatenate them all and insert the result as innerHTML inside your container.
It may be ugly for a start but as you will learn more - you will improve.

Xpath to url for import.io

I'm getting list of offered jobs on this site: http://telekom.jobs/global-careers
I'm trying to get XPath of link to get more info about job.
Here is the whole XPath to the first link:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[2]/td/div/a/#href
and this is what I should paste to import.io:
tr[2]/td/div/a/#href
But it doesn't work, I don't know why.
Links to more info about job offer pages are having XPath:
tr[2]/td/div/a/#href
tr[4]/td/div/a/#href
tr[6]/td/div/a/#href
tr[8]/td/div/a/#href
and so on.
Maybe that's why it doesn't work? Because the numbers arent 1,2,3 etc but 2,4,6? Or do I do something wrong?
If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:
DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]
Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.
Link xpath: .//a[#class=’forward jobadview’]/#href
Title xpath: .//div[#class=’info’]//h3
Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.
https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Disqus Comment Count Not Always Passing URL with AJAX

Basically in a nut shell I'm using AJAX to pass through say 30 items in an infinite scroll/masonry setup. In my AJAX file, one of the lines basically is requesting the WEB URL + the Unique Key for the page + the Disqus Hash Tag at the end to get the comment count for the page and to display this on the index. E.g. it looks like this
http://url.com?id=".$row['pkey']."#disqus_thread\">
For some reason, I'm getting mixed results. Sometimes, I will get results for all of the items as I scroll down and they load with masonry. Other times I will get no comment count and it will simply be blank for all of the items loaded via masonry. It's very sporadic and doesn't always pass the value through.
I know I could use the API and do some more complicated things, but for my purpose I'm simply looking to get a post count for the relevant URL and display it, that's all.
Any tips on why this passes and other times it does not? Everything else in my page seems in order.

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Resources