progressive http download to a file - download

I want to redirect the output of the command "wget" to a file, but i do not want to fill the file at the end of the command, but I want to fill it progressively. Thanks.

wget -qO- www.stackoverflow.com > file.html

Take a look at curl with the --no-buffer flag.

Related

iterate through specific files using webHDFS in a bash script

I want to download specific files in a HDFS directory, with their names starting with "total_conn_data_". Since I've got many files I want to write a bash script.
Here's what I do:
myPatternFile="total_conn_data_*.csv"
for filename in `curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/?OP=LISTSTATUS" -u username`; do
curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/$filename?OP=OPEN" -u username -L -o "./data/$filename" -k;
done
But it does not work since curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/?OP=LISTSTATUS" -u username is sending back a json text and not file names.
How should I do? Thanks
curl provides output in json format only. you will have to use other tools like jquery and sed to format that output and get the list of files.

Using wget to download images from a webpage

I tried to download all the images from a given URL using wget. Below are some of the commands I had used.
wget -A.jpg [URL]
wget -A .jpg [URL]
wget -A *.jpg [URL]
wget -A .jpg [URL]
wget -nd -r -P /my/directory/ -A jpeg,jpg [URL]
Non of the above commands worked. So to make sure, i checked the file extensions of each images from the URL I specified and realized they are formatted as this:
URL/image.jpg?quality=85&strip=info&w=1200
How can i work around this issue where there are parameters at the end of a file extension from the URL i tried to retrieve from? Is there an option in wget that I am missing?
Thanks, any help will be appreciated.
You have to put an asterisk after jpg too, if you want to match those files that doesn't end with jpg.
Like:
wget -A *.jpg* [URL]
You can also use regex patterns with the --accept-regex argument if you want more complex filtering.

wget to parse a webpage in shell

I am trying to extract URLS from a webpage using wget. I tried this
wget -r -l2 --reject=gif -O out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'
It is displaiyng FINISHED
Downloaded: 18,472 bytes in 1 files
But not displaying the weblinks. If I try to do it seperately
wget -r -l2 --reject=gif -O out.html www.google.com
sed -n 's/.*href="\([^"]*\).*/\1/p' < out.html
Output
http://www.google.com/intl/en/options/
/intl/en/policies/terms/
It is not displaying all the links
ttp://www.google.com
http://maps.google.com
https://play.google.com
http://www.youtube.com
http://news.google.com
https://mail.google.com
https://drive.google.com
http://www.google.com
http://www.google.com
http://www.google.com
https://www.google.com
https://plus.google.com
And more over I want to get links from 2nd level and more can any one give a solution for this
Thanks in advance
The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed.
You can say -O - to direct wget output to standard output.
If you don't want to use grep, you can try
sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*/\1/gp"

How to download a file using curl

I'm on mac OS X and can't figure out how to download a file from a URL via the command line. It's from a static page so I thought copying the download link and then using curl would do the trick but it's not.
I referenced this StackOverflow question but that didn't work. I also referenced this article which also didn't work.
What I've tried:
curl -o https://github.com/jdfwarrior/Workflows.git
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information
.
wget -r -np -l 1 -A zip https://github.com/jdfwarrior/Workflows.git
zsh: command not found: wget
How can a file be downloaded through the command line?
The -o --output option means curl writes output to the file you specify instead of stdout. Your mistake was putting the url after -o, and so curl thought the url was a file to write to rate and hence that no url was specified. You need a file name after the -o, then the url:
curl -o ./filename https://github.com/jdfwarrior/Workflows.git
And wget is not available by default on OS X.
curl -OL https://github.com/jdfwarrior/Workflows.git
-O: This option used to write the output to a file which named like remote file we get. In this curl that file would be Workflows.git.
-L: This option used if the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code), this option will make curl redo the request on the new place.
Ref: curl man page
The easiest solution for your question is to keep the original filename. In that case, you just need to use a capital o ("-O") as option (not a zero=0!). So it looks like:
curl -O https://github.com/jdfwarrior/Workflows.git
There are several options to make curl output to a file
# saves it to myfile.txt
curl http://www.example.com/data.txt -o myfile.txt -L
# The #1 will get substituted with the url, so the filename contains the url
curl http://www.example.com/data.txt -o "file_#1.txt" -L
# saves to data.txt, the filename extracted from the URL
curl http://www.example.com/data.txt -O -L
# saves to filename determined by the Content-Disposition header sent by the server.
curl http://www.example.com/data.txt -O -J -L
# -O Write output to a local file named like the remote file we get
# -o <file> Write output to <file> instead of stdout (variable replacement performed on <file>)
# -J Use the Content-Disposition filename instead of extracting filename from URL
# -L Follow redirects

wget command to download a file and save as a different filename

I am downloading a file using the wget command. But when it downloads to my local machine, I want it to be saved as a different filename.
For example: I am downloading a file from www.examplesite.com/textfile.txt
I want to use wget to save the file textfile.txt on my local directory as newfile.txt. I am using the wget command as follows:
wget www.examplesite.com/textfile.txt
Use the -O file option.
E.g.
wget google.com
...
16:07:52 (538.47 MB/s) - `index.html' saved [10728]
vs.
wget -O foo.html google.com
...
16:08:00 (1.57 MB/s) - `foo.html' saved [10728]
Also notice the order of parameters on the command line. At least on some systems (e.g. CentOS 6):
wget -O FILE URL
works. But:
wget URL -O FILE
does not work.
You would use the command Mechanical snail listed. Notice the uppercase O. Full command line to use could be:
wget www.examplesite.com/textfile.txt --output-document=newfile.txt
or
wget www.examplesite.com/textfile.txt -O newfile.txt
Hope that helps.
wget -O yourfilename.zip remote-storage.url/theirfilename.zip
will do the trick for you.
Note:
a) its a capital O.
b) wget -O filename url will only work. Putting -O last will not.
Either curl or wget can be used in this case. All 3 of these commands do the same thing, downloading the file at http://path/to/file.txt and saving it locally into "my_file.txt":
wget http://path/to/file.txt -O my_file.txt # my favorite--it has a progress bar
curl http://path/to/file.txt -o my_file.txt
curl http://path/to/file.txt > my_file.txt
Notice the first one's -O is the capital letter "O".
The nice thing about the wget command is it shows a nice progress bar.
You can prove the files downloaded by each of the 3 techniques above are exactly identical by comparing their sha512 hashes. Running sha512sum my_file.txt after running each of the commands above, and comparing the results, reveals all 3 files to have the exact same sha hashes (sha sums), meaning the files are exactly identical, byte-for-byte.
See also: How to capture cURL output to a file?
Using CentOS Linux I found that the easiest syntax would be:
wget "link" -O file.ext
where "link" is the web address you want to save and "file.ext" is the filename and extension of your choice.

Resources