Extract value from javascript object in site using xpath and import.io - xpath

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.

For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.

support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.

Related

GET a slide, edit it, and POST it again with minor changes

I was hoping someone could help me with something i've been stuck on. I'm not even sure if it's possible to do.
So i basically have a huge Json file which includes all objects used for a certain slide Specifically i used this GET command:
GET https://slides.googleapis.com/v1/presentations/{presentationId}
I then got a huge 200.000 line Json response which has alot of stuff like colors for each thing, position of every element on each slide ect. I save this as a JSON file on my pc. I only need it once as a form of template.
Then my golang code dynamically edits some of its values (after converting it to structs ofc).
Now i want to POST it back up. It has a new name now, new ID, but 99,9% of the values are the same.
Is it possible to do this?
And sorry in advance. I know people here tend to get mad at "stupid questions" or if i forget to add something, but i'm new here, and i hope I can get some help. Been stuck for a long while.
Yes, you can. There is a Go client library that allows you to do this if you're not set on using a REST API. If you are set on using a REST API, you should be able to post this endpoint:
POST https://slides.googleapis.com/v1/presentations/{presentationId}:batchUpdate
Side note, the Google documentation (imo) is fantastic, a little googling goes a long ways :)

Web Scraping Return Empty Value Using Xpath in Scrapy

Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!
You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.

Adding a UNIQUE ID to the URL parameter if the XMLHttp Open Method - What does this mean and how is it done?

I need help understanding AJAX. I am going through the tutorial on W3C schools ( creating a button that opens text file on the server and displays the result in a div)
One part of the tutorials seems abstract to me, without sufficient explanation. I am sure its a pre-requisite that I have missed or not aware of, detailing below
To avoid getting a cached result in response to an XMLHttpRequest made to the server, the tutorial says one needs to ADD A UNIQUE ID to the URL parameter in the XMLHttp Open Method which has been done (in the tutorial) by adding a ?, another character (t) and an = after the file extention followed by joining a random number to the URL (using Math.random()). See code below.
A simple GET request would be like:
xmlhttp.open("GET","demo_get.asp,true); \\I can understand this
Unique ID added to URL
xmlhttp.open("GET","demo_get.asp?t=" + Math.random(),true); \\ I can't undersatnd this
'?' , 't' & a random number generator added to demo_get.asp - Why T, why not P Q R Z ?? Why "?" after .asp
Should the compiler not go bonkers and report an error if arbitary characters are added to the file location. How is the part of the URL after the file extention handled as in this case ?t= + Math.random()
This has been a case of much agony and frustration for the last 3 days cause I don't get which part of JS i have missed here, what do you call this concept and where can I read it ??
This apart, specifying message headers while sending data - What are HTTP headers and what do they mean. How do I decide what the parameters of the setRequestHeader() method shall be ?
Please help. Rest of Ajax is clear to me.
(I haven't read on the second part - the message headers. I have asked that query here to avoiding posting another question later, just in case it turns out to be as eluding and enigmatic as the UNIQUE ID concept - Apologies in case its a direct simple question I ought to read up myself)
Cache compares the requested URL with those present with it, if a unique id is added to the URL, it does not match and the browser treats it as a fresh GET request, which then is forwarded to the server. This is a standard way to bypass / disable browser caching.
Please refer this document for a better understanding of browser Caching.
See Page No 4, which explains the same thing as stated above.
http://www.f5.com/pdf/white-papers/browser-behavior-wp.pdf

How do I parse a POST to my Rails 3.1 server manually?

Scenario:
I have a Board model in my Rails server side, and an Android device is trying to post some content to a specific board via a POST. Finally, the server needs to send back a response to the Android device.
How do I parse the POST manually (or do I need to)? I am not sure how to handle this kind of external request. I looked into Metal, Middleware, HttpParty; but none of them seems to fit what I am trying to do. The reason I want to parse it manually is because some of the information I want will not be part of the parameters.
Does anyone know a way to approach this problem?
I am also thinking about using SSL later on, how might this affect the problem?
Thank you in advance!! :)
I was trying to make a cross-domain request from ie9 to my rails app, and I needed to parse the body of a POST manually because ie9's XDR object restricts the contentType that we can send to text/plain, rather than application/x-www-urlencoded (see this post). Originally I had just been using the params hash provided by the controller, but once I restricted the contentType and dataType in my ajax request, that hash no longer contained the right information.
Following the URL in the comment above (link), I learned the how to recover that information. The author mentions that in a rails controller we always have access to a request variable that gives us an instance of the ActionDispatch::Request object. I tried to use request.query_string to get at the request body, but that just returned an empty string. A bit of snooping in the API, though, uncovered the raw_post method. That method returned exactly what I needed!
To "parse it manually" you could iterate over the string returned by request.raw_post and do whatever you want, but I don't recommend it. I used Rack::Utils.parse_nested_query, as suggested in Arthur Gunn's answer to this question, to parse the raw_post into a hash. Once it is in hash form, you can shove whatever else you need in there, and then merge it with the params hash. Doing this meant I didn't have to change much else in my controller!
params.merge!(Rack::Utils.parse_nested_query(request.raw_post))
Hope that helps someone!
Not sure exactly what you mean by "manually", posts are normally handled by the "create" or "update" methods in the controller. Check out the controller for your Board model, and you can add code to the appropriate method. You can access the params with the params hash.
You should be more specific about what you are trying to do. :)

interpreting ajax code

I have traced some ajax stuff and I am trying to figure out what it means. I was hoping to translate it into a url but it seems to involve some Get request based on searched I did.
An help appreciated.
**new Ajax.Request(fspring.baseURL+"search/getProfileResults",
{parameters:{ajax:1,q:_4,page:_5},onStatOK:function(_6){ var _7=new Element("div");
I thought it would be the baseURL in this case
http://helloworld.com/search/getProfileResults and I know it needs two parameters
fspring.baseURL could be anything, so I can't really help you there. It doesn't ring a bell to me for any particular Javascript library.
The parameters object will be converted by Prototype into a querystring, in this case it'll look something like this:
http://helloworld.com/search/getProfileResults?ajax=1&q=_4&page=_5
Except _4 and _5 will be replaced with the variable contents.
An easier way to figure out what's going on would be to just open up the page in Firebug and look in the Console to see what the AJAX query was.

Resources