how curl retrieves a url with # and ! symbols in it? - ajax

I was considering using curl to retrieve a page from a url(http://bbs.byr.cn/#!board/JobInfo?p=3) but ended up getting a notice from bash like
$ curl bbs.byr.cn/#!article/JobInfo/102321
bash: !article/JobInfo/102321: event not found
this url is accessible in my browser window, how can I write a curl command line that works on this url?

In general this is not possible that stuff after the hashtag (#) is just handled by JavaScript on the client side. Curl cannot execute JavaScript. You can put that URL in quotes to get the static part of the page, but this is however surly not that what you want.
If you observe the traffic of that page in Firebug you will see that the url http://bbs.byr.cn/board/JobInfo?p=3 will be downloaded. This file you can download to get your results.

Related

curl 1020 error when trying to scrape page using bash script

I'm trying to write a bash script to access a journal overview page on SSRN.
I'm trying to use curl for this, which works for me on other webpages, but it returns error code: 1020 for me if I try to run the following codes:
curl https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1925128
I thought it might have to do with the question mark in the URL, but I got it to work with other pages that contained question marks.
It probably has something to do with what the page's allows to do. However, I can also access the page using R's rvest package, so I think it should work in general also using bash.
Looks like the site has blocked access via curl. Change the user agent and it should work fine i.e.
curl --user-agent 'Chrome/79' "https://papers.ssrn.com/sol3/papersstract_id=1925128"

Get request with curl works but Net::HTTP and OpenURI fail

My Rails application regularly polls partners' ICS files and sometimes it fails for no reason whatsoever. When I do:
curl https://www.airbnb.es/calendar/ical/234892374.ics?s=23412342323
(params #'s faked here)
I get output matching the content of the ICS file. Just opening it in the browser works fine as well.
When I use:
Net::HTTP.get(URI(a.ics_link))
I get a "503 Service Temporarily Unavailable" response. I also tried the same with OpenURI with similar results.
Why is it that the server is treating requests from curl or a browser differently?
Is there some way to get Ruby to get around this?
It's an https issue... not sure why, but switch your url in Ruby to https and it should work.

Wget does not fetch google search results

I noticed when running wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo and similar queries, I don't get the search results, but the google homepage.
There seems to be some redirect within the google page. Does anyone know a fix to wget so it would work?
You can use this curl commands to pull Google query results:
curl -sA "Chrome" -L 'http://www.google.com/search?hl=en&q=time' -o search.html
For using https URL:
curl -k -sA "Chrome" -L 'https://www.google.com/search?hl=en&q=time' -o ssearch.html
-A option sets a custom user-agent Chrome in request to Google.
#q=foo is your hint, as that's a fragment ID, which never gets sent to the server. I'm guessing you just took this URL from your browser URL-bar when using the live-search function. Since it is implemented with a lot of client-side magic, you cannot rely on it to work; try using Google with live search disabled instead. A URL pattern that seems to work looks like this: http://www.google.com/search?hl=en&q=foo.
However, I do notice that Google returns 403 Forbidden when called naïvely with wget, indicating that they don't want that. You can easily get past it by setting some other user-agent string, but do consider all the implications before doing so on a regular basis.

ajax request and robots.txt

A website has a URL http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME. When I click the URL, it executes the ajax function.
When I put the URL in the address bar, it gives a redirect error because the URL doesn't actually take you anywhere, but it definitely still executes the ajax function.
When I use the command line bash call: firefox -new-window http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME, it opens a empty page except for the line "Bad user...". After some digging I found that the robots.txt file has "Disalow: /wp-admin/". I am assuming this is why it isn't working in the command line. I have used wget -e robots=off URL before, but there isn't anything to download so it doesn't apply here.
What type of URL is this? (I believe it's dynamic or formula, but not sure)
I want to get the same results with the command line as when I plug the URL into the address bar. Ideas?
It's nothing special it just display a that html no matter what. HTTP servers don't have use files. It could be written in c++, java, python or nodejs(probably not).

ruby/bash: How do I download a large file with using the "If-Range" and "Range" headers?

I've been trying to use mechanize to download mp3 files, but the server always returns a 404.
Looking at the headers my browser sends (checked on Chrome and FF), I noticed that the If-Range and Range headers are used to initiate a successful download, so I'm guessing the server is rejecting any request that doesn't specify them.
What is the right way to download files in this way, using ruby (Net::HTTP) or bash (curl or wget)?
404 is file not found. Are you sure your URL is correct? If it is correct then you should be able to use wget <full url and file name> to test it.

Resources