curl 1020 error when trying to scrape page using bash script

curl 1020 error when trying to scrape page using bash script - bash

I'm trying to write a bash script to access a journal overview page on SSRN.
I'm trying to use curl for this, which works for me on other webpages, but it returns error code: 1020 for me if I try to run the following codes:
curl https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1925128
I thought it might have to do with the question mark in the URL, but I got it to work with other pages that contained question marks.
It probably has something to do with what the page's allows to do. However, I can also access the page using R's rvest package, so I think it should work in general also using bash.

Looks like the site has blocked access via curl. Change the user agent and it should work fine i.e.
curl --user-agent 'Chrome/79' "https://papers.ssrn.com/sol3/papersstract_id=1925128"

Related

Creating a script to automate submitting something on a webpage

I want to create a script, which accesses a website behind a login (with 2FA) and press the submit button every x seconds.
Unfortunately, I am a total Shell noob, but I already automated the process with the Chrome extensions "Kantu Browser Automation", but the extension has limits on the looping and a looping timeout.

use curl command for this and put it crontab.
curl:
https://curl.haxx.se/
you have to use POST method.
crontab:
https://crontab.guru/

Pull info from website

I'm looking to pull the timer from this site: http://invasiontimer.com/
But it looks like the timer isn't in html, so the normal curl or wget isn't getting it for me.
Is there any way to get this in a bash script and print it to a text file.
Thanks.

I think what you want is the content loaded by javascript. Check out this answer for more details: How to get webcontent that is loaded by JavaScript using cURL?

Wget does not fetch google search results

I noticed when running wget https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=foo and similar queries, I don't get the search results, but the google homepage.
There seems to be some redirect within the google page. Does anyone know a fix to wget so it would work?

You can use this curl commands to pull Google query results:
curl -sA "Chrome" -L 'http://www.google.com/search?hl=en&q=time' -o search.html
For using https URL:
curl -k -sA "Chrome" -L 'https://www.google.com/search?hl=en&q=time' -o ssearch.html
-A option sets a custom user-agent Chrome in request to Google.

#q=foo is your hint, as that's a fragment ID, which never gets sent to the server. I'm guessing you just took this URL from your browser URL-bar when using the live-search function. Since it is implemented with a lot of client-side magic, you cannot rely on it to work; try using Google with live search disabled instead. A URL pattern that seems to work looks like this: http://www.google.com/search?hl=en&q=foo.
However, I do notice that Google returns 403 Forbidden when called naïvely with wget, indicating that they don't want that. You can easily get past it by setting some other user-agent string, but do consider all the implications before doing so on a regular basis.

ajax request and robots.txt

A website has a URL http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME. When I click the URL, it executes the ajax function.
When I put the URL in the address bar, it gives a redirect error because the URL doesn't actually take you anywhere, but it definitely still executes the ajax function.
When I use the command line bash call: firefox -new-window http://example.com/wp-admin/admin-ajax.php?action=FUNCTIOn_NAME, it opens a empty page except for the line "Bad user...". After some digging I found that the robots.txt file has "Disalow: /wp-admin/". I am assuming this is why it isn't working in the command line. I have used wget -e robots=off URL before, but there isn't anything to download so it doesn't apply here.
What type of URL is this? (I believe it's dynamic or formula, but not sure)
I want to get the same results with the command line as when I plug the URL into the address bar. Ideas?

It's nothing special it just display a that html no matter what. HTTP servers don't have use files. It could be written in c++, java, python or nodejs(probably not).

how curl retrieves a url with # and ! symbols in it?

I was considering using curl to retrieve a page from a url(http://bbs.byr.cn/#!board/JobInfo?p=3) but ended up getting a notice from bash like
$ curl bbs.byr.cn/#!article/JobInfo/102321
bash: !article/JobInfo/102321: event not found
this url is accessible in my browser window, how can I write a curl command line that works on this url?

In general this is not possible that stuff after the hashtag (#) is just handled by JavaScript on the client side. Curl cannot execute JavaScript. You can put that URL in quotes to get the static part of the page, but this is however surly not that what you want.
If you observe the traffic of that page in Firebug you will see that the url http://bbs.byr.cn/board/JobInfo?p=3 will be downloaded. This file you can download to get your results.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

curl 1020 error when trying to scrape page using bash script - bash

Looks like the site has blocked access via curl. Change the user agent and it should work fine i.e. curl --user-agent 'Chrome/79' "https://papers.ssrn.com/sol3/papersstract_id=1925128"

Related

Creating a script to automate submitting something on a webpage

Pull info from website

Wget does not fetch google search results

ajax request and robots.txt

how curl retrieves a url with # and ! symbols in it?

Categories

Resources