I have a single page applicaton that is supposed to use #! (hash bang) for navigation. I now read Google's specification on Making AJAX Applications Crawlable. How can I test that my application works in the required way?
I entered my application in the google plus debugger, e.g. http://www.mysite.org/de#!foo=bar. However, apache's access log tells me that the google crawler then does not translate #! to _escaped_fragment_, hence the google debugger still retrieves /de without the hash bang:
66.249.81.165 - - [06/Mar/2014:15:54:06 +0100] "GET /de HTTP/1.1" 200 177381 "Mozilla/5.0 (compatible; X11; Linux x86_64; Google-StructuredDataTestingTool; +http://www.google.com/webmasters/tools/richsnippets)"
(Note well: GET /de without _escaped_fragment_ hash fragment still). I'd expect Google to retrieve something like this instead:
... "GET /de?_escaped_fragment_ mapping HTTP/1.1" ...
As far as I know is _escaped_fragment_= with the = at the end.
Have you tried with:
<meta name="fragment" content="!" />
on your HTML head?
Related
I have a login page in an Oracle Apex application that works fine with a normal web browser like chrome. However when i try to perform the same operation using CURL (command-line browser), a HTTP 404 error is returned:
Request:
curl -i -d "P9999_USERNAME=MOIZ&P9999_PASSWORD=xxxx" -X POST http://localhost:8080/apex/f?p=101:9999:0:
Response:
HTTP/1.1 404 Not found
Server: Oracle XML DB/Oracle Database
Content-Type: text/html
Transfer-Encoding: chunked
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>404 Not found</TITLE>
</HEAD><BODY><H1>Not found</H1>
The requested URL /apex/f was not found on this server</BODY></HTML>
Using a normal browser, there are two sever request: one GET and one POST. However when using curl i am just making a single POST request.
Is that difference the cause of problem?
Is it possible to POST apex page without calling GET?
If yes then whether this solution will also work for file uploads?
Based on the post Python web scraping with BS using correct url? I was able to get answer to my above questions. [although this post uses python's requests and BeautifulSoup libraries for demonstration purpose, I think with some shell/windows scripting it may also be achievable with curl too]
Is that difference the cause of problem?
For POST request in APEX, we need to first perform a GET request as APEX expects the following hidden items to be sent as a payload of POST request
p_flow_id # application id
p_flow_step_id # page id
p_instance # session id
p_page_submission_id # page submission id
p_request # request
p_md5_checksum # md5 checksum
p_page_checksum # page checksum
p_arg_names # list of arguments
In addition, you may also need to add your page specific input items. For example,
p_t01 # username
p_t02 # password
Is it possible to POST apex page without calling GET?
For reasons mentioned in above answer, GET request will be required before POST request can be performed.
I use Nokogiri in ruby to parse link like this
link='http://vnreview.vn/danh-gia-di-dong#cur=2'
doc= Nokogiri::HTML(open(link,'User-Agent'=>'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31').read, nil, 'UTF-8')
but nokogiri return doc is source of link='http://vnreview.vn/danh-gia-di-dong'
How can i parse link with #cur=1, #cur=2...
Fragment is not sent to server with http request, i.e. if you open http://www.example.com/#fragment in browser following request will be made:
GET / HTTP/1.1
Host: example.com
Then after receiving response, browser will append fragment to URL and perform some actions (for example, scroll to element with id="fragment", or execute javascript callbacks)
If page content differs based on fragment, it's done via javascript. Nokogiri is not capable of running javascript, so you need some other tool, like selenium-webdriver or capybara-webkit.
Another option is to inspect ajax requests on page you trying to parse and probably you'll find JSON with data you need. Then download this json directly. Probably content is already on page, and it's just hidden via css (like tabs in twitter bootstrap).
I have a program that uses an XMLHTTPRequest to gather contents from another web page.
Problem is, that web page has cloaking custom errors set-up (ie. /thisurl doesn't literally exist as a file on their web server, it is being generated by the custom 404 error file.), so its not returning the page it shows in the browser, instead its showing its default 404 error response from that custom error page, in my HTTPRequest response.
By using this website http://web-sniffer.net/ I have narrowed down what the problem may be, but I don't know how to fix it.
Web-sniffer has 3 different versions to submit the request:
HTTP version: HTTP/1.1
HTTP/1.0 (with Host header)
HTTP/1.0 (without Host header)`
When I use HTTP/1.1 or HTTP/1.0 (with Host header) I get the correct response (html) from the page. But when I use HTTP/1.0 (without Host header) it does not return the content, instead it returns a 404 error script (showing the custom error page).
So I have concluded that the problem may be due to the Host header not being present in the request.
But I am using MSXML2.XMLHTTP.3.0 and haven't been able to read the page using HTTP/1.1 or HTTP/1.0 (with Host header). The code looks like this:
Set objXML = Server.CreateObject("MSXML2.XMLHTTP.3.0")
objXML.Open "GET", URL, False
objXML.setRequestHeader "Host", MyDomain '< Doesnt work with or w/out this line
objXML.Send
Even after adding a Host header to the request, I still get the template of the 404 error returned by that custom error script in my response, the same as HTTP/1.0 (without Host Header) option on that web-sniffer site. This should be returning 200 OK like it does on the first two options on web-sniffer, and like in a web browser.
So I guess my question is, what is that website (web-sniffer.net) able to get the proper response with their first two HTTP version options, so I can emulate this in my app. I want to get the right page, but its only returning the 404 error from their 404 error template.
In response to an answerer, I have provided screen shots from 2 seperate cUrl requests below, one from each one of my servers.
I executed the same cURL command, same url (that points to a site on the main host), which is cURL -v -I www.site.com/cloakedfile . But looks like its not working on the main server, where it needs to be. It can't be a self-residing issue, because from secondary to secondary it works fine, these are both identical applications/sites, just different ip's/host names. It appears to be an internal issue, that may not be about the application side of things.
I dont have any idea bout MSXML2.XMLHTTP.3.0. But from you problem statement i understand that the issues is certainly due to some HTTP header field that is wrongly set or missed out in your request.
By default HTTP 1.1 clients set Host header. For example if you are connecting to google.com then the request will look like this
GET / HTTP/1.1
Host: google.com
The "Host" header should have the domain name of the server in which the requested resource is residing. Severs that has virtual hosting will get confused if "Host:" header is not present. This is what happens with groups.yahoo.com if you havent specified Host header
$ nc groups.yahoo.com 80
GET / HTTP/1.1
HTTP/1.1 400 Host Header Required
Date: Fri, 06 Dec 2013 05:40:26 GMT
Connection: close
Via: http/1.1 r08.ycpi.inc.yahoo.net (ApacheTrafficServer/4.0.2 [c s f ])
Server: ATS/4.0.2
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Content-Language: en
Content-Length: 447
And this should be the same issue you are facing with. And also make sure that you are sending the domain name of the server from which you are trying to fetch the resource. And the Host header should have a colon ":" to delimit the value like "Host: www.example.com".
I have a url. When I try to access it programmatically, the backend server fails (I don't run the server):
import requests
r = requests.get('http://www.courts.wa.gov/index.cfm?fa=controller.managefiles&filePath=Opinions&fileName=875146.pdf')
r.status_code # 200
print r.content
When I look at the content, it's an error page, though the status code is 200. If you click the link, it'll work in your browser -- you'll get a PDF -- which is what I expect in r.content. So it works in my browser, but fails in Requests.
To diagnose, I'm trying to eliminate differences between my browser and Requests library. So far I've:
Disabled Javascript
Disabled (and deleted) cookies
Set the User-Agent to be the same in each
But I can't get the thing to work properly in Requests or fail in my browser due to disabling something. Can somebody with a better idea of browser-magic help me diagnose and solve this?
Does the request work in Chrome? If so, you can open the web inspector and right-click the request to copy it as a curl command. Then you'll have access to all the headers, params, and request body, which you can play around with to see which are triggering the failure you're seeing with the requests library.
You're probably running into a server that discriminates based on User-Agent. This works:
import requests
S = requests.Session()
S.headers.update({'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'})
r = S.get('http://www.courts.wa.gov/index.cfm?fa=controller.managefiles&filePath=Opinions&fileName=875146.pdf')
with open('dl.pdf', 'wb') as f:
f.write(r.content)
I have a Firefox quicksearch bookmark that runs a Maxmind query. This worked until recently. I type 'ip 82.176.230.15' (for example) into the URL bar and it queries Maxmind to retrieve the location of the IP:
http://www.maxmind.com/app/locate_demo_ip?ips=82.176.230.15
Within the past week, for reasons unknown, I now get a 403/Forbidden error when I try to access Maxmind.
"You don't have permission to access /app/locate_demo_ip on this server"
Strangely, the same URL is accessible in Chrome and Safari. I can also access the same URL with Firefox, Chrome, or Safari on my Mac.
I've deleted all cookies, disabled all addons, and still can't get it to work. Any idea what could be happening? I know that the 403 has to come from the server, so I don't know why it would work in other browsers. And it's been going on for days, definitely not some glitch on their server.
Get an HTTP debugger like firebug or fiddler (not sure that will work with FireFox, but probably if you set it up right)
Look at the difference between using your quick bookmark and just typing the URL. The server could return 403 whenever it feels like -- see if there's any difference, and what it is.
I recently had the same issue and was able to fix it.
In my case the problem was in headers that Mozilla sent.
Particularly it was because of header:
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
What makes web-site refuse connection is this part of string "(X11; Ubuntu; Linux x86_64; rv:100.0)" and i have no idea why.
I found a nice solution, you can change Mozilla settings to include other browsers in this header (Chrome and Safari) and it could make sites with this problem works.
Here is how to do it:
Type about:config into the URL bar. Press Enter.
Create a new entry with key=general.useragent.override and add this string there Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36. I found that Google Chrome uses this string as User-Agent header probably to prevent such issues. So you should see something like:
Now save this settings and go reload your page, it should work now