Get source code of this website - bash

I would like to get some data from some books I want to buy. But for that I need to get the source code of the page and I can not.
A exemplo URL is:
http://www.mcu.es/webISBN/tituloDetalle.do?sidTitul=793927&action=busquedaInicial&noValidating=true&POS=0&MAX=50&TOTAL=0&prev_layout=busquedaisbn&layout=busquedaisbn&language=es
I'm testing with various possibilities in curl, wget, lynx, accepting cookies, etc.
# curl http://www.mcu.es/webISBN/tituloDetalle.do?sidTitul=793927&action=busquedaInicial&noValidating=true&POS=0&MAX=50&TOTAL=0&prev_layout=busquedaisbn&layout=busquedaisbn&language=es
[1] 1680
[2] 1681
[3] 1682
[4] 1683
[5] 1684
[6] 1685
[7] 1686
[8] 1687
If I see the headers, I marked a 302
curl -I 'http://www.mcu.es/webISBN/tituloDetalle.do?sidTitul=793927&action=busquedaInicial&noValidating=true&POS=0&MAX=50&TOTAL=0&prev_layout=busquedaisbn&layout=busquedaisbn&language=es'
**HTTP/1.1 302 Movido tempor�lmente**
Date: Fri, 08 Jul 2016 09:31:07 GMT
Server: Apache
X-Powered-By: Servlet 2.4; JBoss-4.2.1.GA (build: SVNTag=JBoss_4_2_1_GA date=200707131605)/Tomcat-5.5
Location: http://www.mcu.es/paginaError.html
Vary: Accept-Encoding,User-Agent
Content-Type: text/plain; charset=ISO-8859-1
The same goes for me if I use '', "", \? \&, wget, lynx -source, accept cookies, etc.The only thing I get download error page (where I send the code 302)
You know how I can download the source code of the URL that I put an example? (Bash, php, python, perl ...)
Thank you very much.

The page you are looking for isn't available. Try visiting the website on your browser, you will still not be able to get the information you need. If you need the source you need to give the -L flag and it will get the source code.

Related

Shell script - Force HTTP 1.1 pipelining in curl, nc or telnet

I need to force HTTP pipelining (1.1) on serveral GET requests with curl, telnet or netcat on a bash script. I've already tried to do so with curl, but as far as I know the tool has dropped HTTP pipelining support since version 7.65.0, and I wasn't able to find much information about how to do so. Still, if with telnet or netcat couldn't be possible, I have access to curl version 7.29.0 in other computer.
From Wikipedia:
HTTP pipelining is a technique in which multiple HTTP requests are sent on a single TCP (transmission control protocol) connection without waiting for the corresponding responses.
To send multiple GET requests with netcat, something like this should do the trick:
echo -en "GET /index.html HTTP/1.1\r\nHost: example.com\r\n\r\nGET /other.html HTTP/1.1\r\nHost: example.com\r\n\r\n" | nc example.com 80
This will send two HTTP GET requests to example.com, one for http://example.com/index.html and another for http://example.com/other.html.
The -e flag means interpret escape sequences (the carriage returns and line feeds, or \r and \n). The -n means don't print a newline at the end (it would probably work without the -n).
I just ran the above command and got two responses from this, one was a 200 OK, the other was a 404 Not Found.
It might be easier to see the multiple requests and responses if you do a HEAD request instead of a GET request. That way, example.com's server will only respond with the headers.
echo -en "HEAD /index.html HTTP/1.1\r\nHost: example.com\r\n\r\nHEAD /other.html HTTP/1.1\r\nHost: example.com\r\n\r\n" | nc example.com 80
This is the output I get from the above command:
$ echo -en "HEAD /index.html HTTP/1.1\r\nHost: example.com\r\n\r\nHEAD /other.html HTTP/1.1\r\nHost: example.com\r\n\r\n" | nc example.com 80
HTTP/1.1 200 OK
Content-Encoding: gzip
Accept-Ranges: bytes
Age: 355429
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 24 Feb 2020 14:48:39 GMT
Etag: "3147526947"
Expires: Mon, 02 Mar 2020 14:48:39 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (dna/63B3)
X-Cache: HIT
Content-Length: 648
HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Age: 162256
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 24 Feb 2020 14:48:39 GMT
Expires: Mon, 02 Mar 2020 14:48:39 GMT
Last-Modified: Sat, 22 Feb 2020 17:44:23 GMT
Server: ECS (dna/63AD)
X-Cache: 404-HIT
Content-Length: 1256
If you want to see more details, run one of these commands while wireshark is running. The request is sent in No. 7 (highlighted) and the two responses are received on No. 11 and No. 13.

Cannot background HTTPie http request with `&` in .sh script

How come this works from the BASH prompt:
/testproj> http http://localhost:5000/ping/ &
[1] 10733
(env)
/testproj> HTTP/1.0 200 OK
Content-Length: 2
Content-Type: application/json
Date: Sat, 17 Nov 2018 19:27:01 GMT
Server: Werkzeug/0.14.1 Python/3.6.4
{}
... but fails when executed from in a .sh:
/testproj> cat x.sh
http http://localhost:5000/ping/ &
(env)
/testproj> ./x.sh
(env)
/testproj> HTTP/1.0 405 METHOD NOT ALLOWED
Allow: GET, HEAD, OPTIONS
Content-Length: 178
Content-Type: text/html
Date: Sat, 17 Nov 2018 19:29:00 GMT
Server: Werkzeug/0.14.1 Python/3.6.4
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title>
<h1>Method Not Allowed</h1>
<p>The method is not allowed for the requested URL.</p>
?
EDIT: http is HTTPie
EDIT: type http gives http is hashed (/testproj/env/bin/http)
EDIT: One can reproduce the error with just http http://www.google.com </dev/null & (Thanks #e36freak)
EDIT: from e36freak on IRC:
it appears to be an issue with stdin
i get the same error with just http http://www.google.com </dev/null
http wants stdin to be attached to a tty it looks like
for whatever reason
couldn't find it in the man page but i'm sure it's out there
You most like like need to include the --ignore-stdin option to prevent httpie from trying to read it. See: https://httpie.org/doc#scripting

Is there a way to improve copying to the clipboard format from Bash on Ubuntu on Windows?

When using Bash on Windows, copying from the terminal results in terribly formatted content when I paste. Here is a curl request I made and copied directly from Bash on Ubuntu on Windows:
$ curl -IL aol.com HTTP/1.1 301 Moved Permanently Date: Fri, 27 Apr 2018 15:58:31 GMT Server: Apache Location: https://www.aol.com/ Content-Type: text/html; charset=iso-8859-1
It's basically all on one line, almost like the carriage returns are not respected. I would expect the paste output to look like this:
$ curl -IL aol.com
HTTP/1.1 301 Moved Permanently
Date: Fri, 27 Apr 2018 15:19:49 GMT
Server: Apache
Location: https://www.aol.com/
Content-Type: text/html; charset=iso-8859-1
Is there a way to fix this? Right now I have to go into a text editor to format every copy/paste I need to make.
Edit: Here is a screen shot of the above code blocked copied to the clipboard directly from Bash on Win10: https://i.imgur.com/VUKRhLX.png

Download file with url redirection

I can download a file by url but when I try it from bash I get a html page instead of a file.
How to download file with url redirection (301 Moved Permanently) using curl, wget or something else?
UPD
Headers from the url request.
curl -I http://www.somesite.com/data/file/file.rar
HTTP/1.1 301 Moved Permanently
Date: Sat, 07 Dec 2013 10:15:28 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3
Location: http://www.somesite.com/files/html/archive.html
Vary: Accept-Encoding
Content-Type: text/html
X-Pad: avoid browser bug
Use -L, --location to follow redirects:
$ curl -L http://httpbin.org/redirect/1

How to properly handle a gzipped page when using curl?

I wrote a bash script that gets output from a website using curl and does a bunch of string manipulation on the html output. The problem is when I run it against a site that is returning its output gzipped. Going to the site in a browser works fine.
When I run curl by hand, I get gzipped output:
$ curl "http://example.com"
Here's the header from that particular site:
HTTP/1.1 200 OK
Server: nginx
Content-Type: text/html; charset=utf-8
X-Powered-By: PHP/5.2.17
Last-Modified: Sat, 03 Dec 2011 00:07:57 GMT
ETag: "6c38e1154f32dbd9ba211db8ad189b27"
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: must-revalidate
Content-Encoding: gzip
Content-Length: 7796
Date: Sat, 03 Dec 2011 00:46:22 GMT
X-Varnish: 1509870407 1509810501
Age: 504
Via: 1.1 varnish
Connection: keep-alive
X-Cache-Svr: p2137050.pubip.peer1.net
X-Cache: HIT
X-Cache-Hits: 425
I know the returned data is gzipped, because this returns html, as expected:
$ curl "http://example.com" | gunzip
I don't want to pipe the output through gunzip, because the script works as-is on other sites, and piping through gzip would break that functionality.
What I've tried
changing the user-agent (I tried the same string my browser sends, "Mozilla/4.0", etc)
man curl
google search
searching stackoverflow
Everything came up empty
Any ideas?
curl will automatically decompress the response if you set the --compressed flag:
curl --compressed "http://example.com"
--compressed
(HTTP) Request a compressed response using one of the algorithms libcurl supports, and save the uncompressed document. If this option is used and the server sends an unsupported encoding, curl will report an error.
gzip is most likely supported, but you can check this by running curl -V and looking for libz somewhere in the "Features" line:
$ curl -V
...
Protocols: ...
Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz
Note that it's really the website in question that is at fault here. If curl did not pass an Accept-Encoding: gzip request header, the server should not have sent a compressed response.
In the relevant bug report Raw compressed output when not using --compressed but server returns gzip data #2836 the developers says:
The server shouldn't send content-encoding: gzip without the client having signaled that it is acceptable.
Besides, when you don't use --compressed with curl, you tell the command line tool you rather store the exact stream (compressed or not). I don't see a curl bug here...
So if the server could be sending gzipped content, use --compressed to let curl decompress it automatically.

Resources