How to use Nokogiri in ruby to parse link contain # charater - ruby

I use Nokogiri in ruby to parse link like this
link='http://vnreview.vn/danh-gia-di-dong#cur=2'
doc= Nokogiri::HTML(open(link,'User-Agent'=>'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31').read, nil, 'UTF-8')
but nokogiri return doc is source of link='http://vnreview.vn/danh-gia-di-dong'
How can i parse link with #cur=1, #cur=2...

Fragment is not sent to server with http request, i.e. if you open http://www.example.com/#fragment in browser following request will be made:
GET / HTTP/1.1
Host: example.com
Then after receiving response, browser will append fragment to URL and perform some actions (for example, scroll to element with id="fragment", or execute javascript callbacks)
If page content differs based on fragment, it's done via javascript. Nokogiri is not capable of running javascript, so you need some other tool, like selenium-webdriver or capybara-webkit.
Another option is to inspect ajax requests on page you trying to parse and probably you'll find JSON with data you need. Then download this json directly. Probably content is already on page, and it's just hidden via css (like tabs in twitter bootstrap).

Related

How to detect webp file behind .png link?

I have some tricky link:
https://www.pwc.com.tr/tr/sektorler/Perakende-T%C3%BCketici/kuresel-tuketicileri-tanima-arastirmasi/kuresel-tuketici-gorusleri-arastirmasi-info-5en.png
The last 4 characters in the link implies that we will get image in png format, and even GET HTTP request to that link brings the content-type ‘image/png’.
But if you’ll try to save it in browser, you will end up with webp file format
So, question is - how one can detect that it really webp image 'hidden' behind the link that looks like and act (remember headers!) like png file via program that can use only http protocol?
Update: I want to point out that I did http get request from different environments and get 'image/png' type in headers content-type. For example using node.js and axios
https://youtu.be/KiRrAVl67uQ
Update: The server will detect client type by request's User-Agent header, and return different Content-Type correspondingly. It makes sense, because not all client support webp.
Thus, to get image/webp type resource, you can send custom User-Agent header and simulate as Chrome etc. For example, in Node.js and axios:
const axios = require('axios');
axios.request({
url: 'https://www.pwc.com.tr/tr/sektorler/Perakende-T%C3%BCketici/kuresel-tuketicileri-tanima-arastirmasi/kuresel-tuketici-gorusleri-arastirmasi-info-5en.png',
method: 'get',
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
}).then(function(res) {
console.log(res.headers); // content-type header is image/webp now.
}).catch(function(err) {
console.log(err);
});
Browser try to save this picture as .webp format because: in HTTP response headers, the Content-Type header's value is image/webp:
how one can detect that it really webp image 'hidden' behind the link that looks like and act like png file...?
You can check HTTP response header and find what Content-Type it is.

Ruby rest-client "Accept" Header issue

I was experimenting with the Ruby rest-client gem and ran into an "issue" so to speak. I noticed when I would hit a certain URL that should just return HTML, I would get a 404 error unless I specifically specified:
RestClient.get('http://www.example.com/path/path', accept: 'text/html')
However, pretty much any other page that I would hit without specifying the Accept header explicitly would return HTML just fine.
I looked at the source for the Request object located here and in the default_headers method around line 486 it appears that the default Accept header is */*. I also found the relevant pull request here.
I'm not quite sure why on a particular site (not all) I have to explicitly specify Accept: text/html when any other site that returns HTML by default does it without any extra work. I should note that other pages on this same site work fine when requesting the page without explicitly specifying text/html.
It's not a huge issue and I can easily work around it using text/html but I just thought it was a bit odd.
I should also note that when I use another REST client, such as IntelliJ's built-in one and specify Accept: */* it returns HTML no problem...
EDIT: Ok, this is a bit strange...when I do this:
RestClient.get('http://www.example.com/path/path', accept: '*/*')
Then it returns HTML as I expect it to but leaving off that accept: */* parameter doesn't work even though by default that header should be */* according to the source code...
I wonder if because my URL has /path/path in it, RestClient thinks it's an endpoint to some API so it tries to retrieve XML instead...
EDIT 2: Doing a bit more experimenting...I was able to pass a block to the GET request as follows:
RestClient.get('http://example.com/path/path') {
|response, request, result|
puts response.code
puts request.processed_headers
}
And I get a 404 error and the processed_headers returns:
{"Accept"=>"*/*; q=0.5, application/xml", "Accept-Encoding"=>"gzip, deflate"}
The response body is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<hash>
<errors>Not Found</errors>
</hash>
So, it is sending a */* header but for some reason it looks like the application/xml gets priority. Maybe this is just something on the server-side and out of my control? I guess I'm just not sure how that application/xml is even being added into the Accept header. I can't find anything skimming through the source code.
Found the "problem". It looks like the PR I mentioned in my original post wasn't actually released until rest-client2.0.0.rc1 which is still a release candidate so it isn't actually out yet or at least obtainable via my gem update rest-client.
I used the following command to install 2.0.0rc2:
gem install rest-client -v 2.0.0.rc2 --pre
Then referenced it in my code and it works now:
#request = RestClient::Request.new(:method => :get, :url => 'http://some/resource')
puts #request.default_headers[:accept]
Prints...
*/*
As expected now.

Google crawler does not translate #! to _escaped_fragment_ mapping in ajax application

I have a single page applicaton that is supposed to use #! (hash bang) for navigation. I now read Google's specification on Making AJAX Applications Crawlable. How can I test that my application works in the required way?
I entered my application in the google plus debugger, e.g. http://www.mysite.org/de#!foo=bar. However, apache's access log tells me that the google crawler then does not translate #! to _escaped_fragment_, hence the google debugger still retrieves /de without the hash bang:
66.249.81.165 - - [06/Mar/2014:15:54:06 +0100] "GET /de HTTP/1.1" 200 177381 "Mozilla/5.0 (compatible; X11; Linux x86_64; Google-StructuredDataTestingTool; +http://www.google.com/webmasters/tools/richsnippets)"
(Note well: GET /de without _escaped_fragment_ hash fragment still). I'd expect Google to retrieve something like this instead:
... "GET /de?_escaped_fragment_ mapping HTTP/1.1" ...
As far as I know is _escaped_fragment_= with the = at the end.
Have you tried with:
<meta name="fragment" content="!" />
on your HTML head?

Why does requests library fail on this URL?

I have a url. When I try to access it programmatically, the backend server fails (I don't run the server):
import requests
r = requests.get('http://www.courts.wa.gov/index.cfm?fa=controller.managefiles&filePath=Opinions&fileName=875146.pdf')
r.status_code # 200
print r.content
When I look at the content, it's an error page, though the status code is 200. If you click the link, it'll work in your browser -- you'll get a PDF -- which is what I expect in r.content. So it works in my browser, but fails in Requests.
To diagnose, I'm trying to eliminate differences between my browser and Requests library. So far I've:
Disabled Javascript
Disabled (and deleted) cookies
Set the User-Agent to be the same in each
But I can't get the thing to work properly in Requests or fail in my browser due to disabling something. Can somebody with a better idea of browser-magic help me diagnose and solve this?
Does the request work in Chrome? If so, you can open the web inspector and right-click the request to copy it as a curl command. Then you'll have access to all the headers, params, and request body, which you can play around with to see which are triggering the failure you're seeing with the requests library.
You're probably running into a server that discriminates based on User-Agent. This works:
import requests
S = requests.Session()
S.headers.update({'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'})
r = S.get('http://www.courts.wa.gov/index.cfm?fa=controller.managefiles&filePath=Opinions&fileName=875146.pdf')
with open('dl.pdf', 'wb') as f:
f.write(r.content)

Does jsoup.connect().get() return cached Document?

I use jsoup and following code to get the HTML content of a website Document doc = Jsoup.connect(this.getUrl()).get();.
Does I get a cached version of the website? Is it possible to request a non-cached version? I knew I could set a header request. Something like:
header("Cache-control", "no-cache");
header("Cache-store", "no-store");
But I’m not sure if that works. I just knew that these tags are used for the client browser.
It would be awesome if someone could clarify. Greetings.
Any headers that you correctly (HTTP spec) specify will be sent to target host via java.net.URLConnection.addRequestProperty(String, String). You should get a cached version of the page if server supports this header, end-to-end. jSoup just supplies the headers as the request it made and when I looked through the source, it does not make any explicit effort to cache off the response content.

Resources