Xpath to url for import.io - xpath

I'm getting list of offered jobs on this site: http://telekom.jobs/global-careers
I'm trying to get XPath of link to get more info about job.
Here is the whole XPath to the first link:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[2]/td/div/a/#href
and this is what I should paste to import.io:
tr[2]/td/div/a/#href
But it doesn't work, I don't know why.
Links to more info about job offer pages are having XPath:
tr[2]/td/div/a/#href
tr[4]/td/div/a/#href
tr[6]/td/div/a/#href
tr[8]/td/div/a/#href
and so on.
Maybe that's why it doesn't work? Because the numbers arent 1,2,3 etc but 2,4,6? Or do I do something wrong?

If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:
DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]
Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.
Link xpath: .//a[#class=’forward jobadview’]/#href
Title xpath: .//div[#class=’info’]//h3
Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.
https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Related

Xpath syntax when scraping headlines from CNN homepage

I tried to scrape CNN homepage with scrapy.
I used the following xpath selectors, but all of them returned empty lists.
Current results : all of these returns []
"//strong"
"//h2"
"//span[#class='cd__headline-text']"
Expected results :
[Headline_1, Headline_2, Headline_3, ...]
Can someone help me figure out why?
Is CNN doing something to stop people from scraping headlines?
I use Scrapy.
In order to write XPath/CSS selector or any web page, first of all, check page source that whether the selectors which you are looking for exists or not. In the current case none of the above selectors are found in page source. They are getting page content in various requests, try checking the network and find appropriate requests for your case. You need to make those requests in your spider in order to scrape news from CNN.

Extracting a link with jmeter

So I need to delete an "onclick" dynamic link using jmeter.
Here is the sample of one of the links:
"Delete"
What I need is to extract number and post it in order to do the delete action. Every link is the same except the number.
I have tried to implement some of the solutions I've found on this site but it didn't work.
Thanks in advance
Peace
If you need to do it with XPath you could try going for substring-after function like:
substring-after(//a[text()='Delete']/#href,'param=')
The above expression returns everything which is after param= text in href attribute of a HTML tag having Delete text.
You can test your XPath expressions against actual server response using XPath Tester tab of the View Results Tree listener.
References:
substring-after Function Reference
XPath 1.0 Language Reference
Using the XPath Extractor in JMeter
XPath Tutorial

Yahoo Pipes to loop through all pages

I am looking to pull job postings from a site that has multiple pages of postings. I can pull the content from one page
On a simple example I can get it to iterate and grab page content (this is a simple example site base)
However when I take the first example and try to clean the data (I can't use the Xpath filter to grab the HTML id and I cand seem to find a way to limit the scope elsewhere. Here is what I am trying (regex, rename...):
http://pipes.yahoo.com/pipes/pipe.edit?_id=3619ea93d66e47442659a1976746ba6c
Any thoughts?

How to load the jqgrid in a selector with context

In general we call the jqgrid as in$("#grid_loc").jqGrid({});
But i want to specify the context like $("#grid_loc",context).jqGrid({}). But this is not working. Can somebody help in this?
I have to load server side data using url option.
Infact i occured to have this, as i have tabs on my page.
In each tab, i have to have a jqgrid, not different grids but same grid with different data .
Here i am getting the tab context using var tabset = $("div.tabset");
newdivid = $("div[class*='active_tab']",tabset).attr("id");
var newmenudivid = $("#"+newdivid);
And
the grid code as
$("#grid_workflow", newmenudivid).jqGrid({....});
I have been trying to find out a way to do this. you can find some of my effort in the comments section of the link
how to develop same jqgrid in multiple tabs
i was successful with id overwriting for the same purpose. But that is not a good way though. So i am forced to have another approach ie. context
I suppose that you misunderstand some important things which corresponds to id attribute. The most important that all elements on the page having id attribute have to have unique value of the attribute. In other words the ids have to be unique over the whole HTML page.
So if you need create for example tree grids inside of tree tabs you have to define different id attributes for every grid. For example; grid_workflow1, grid_workflow2, grid_workflow3. If you create the tabs and grids dynamically then you can have some variable in the outer scope (for example global variable) and increase the value of the variable. You can construct id of the grid using some prefix (like "grid_workflow") and the value of the variable. In the way you can create multiple grids with unique ids. Many JavaScript libraries uses the way to generate unique id attribute. Ij you want you can use $.jgrid.randId() method which will returns you unique strings which can be used as ids.
Because of the syntax $("#grid_workflow", newmenudivid) you should understand one important thing. I would recommend never use it. The reason is very easy. It could help only if you have id duplicates. In all other cases if will works exactly like $("#grid_workflow") but slowly. The reason is easy to understand. Web browser hold internally the list if all ids on the page and if you use getElementById method directly of indirectly (in $("#grid_workflow")) the searching of the element with the required id will be like searching in the index in the database. So you will have best performance results. If you use $("#grid_workflow", newmenudivid) then you don't allow web browser to use the index of elements by id. So the usage of context will follow to slow searching throw all children elements of newmenudivid. So you should avoid usage of jQuery context with id selectors.

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Resources