curl script to extract tar files - bash

I have a curl script that i use to get read files from a url. I now need to extract the tar files after they are downloaded.
for file in $(/usr/bin/curl 'https://linktofile.com/yourfile.txt' );
do
echo $file;
/usr/bin/curl -L -J -O "$file";
done;
output to the screen is
curl: Saved to filename 'your_file_1.tar'
How do I extract the file that was saved?
I tried adding tar -xvf $file but nothing happens.
How do I get the name of the file that was just saved?

Using -J makes the file's location depend on the Content-Disposition header, so you need to retrieve the filename it specifies. You can do so by specifying through -w the output of the curl command, which you will want to output the filename :
for url in $(/usr/bin/curl 'https://linktofile.com/yourfile.txt' ); do
filename=$(/usr/bin/curl -sLJOw '%{filename_effective}' "$url")
tar -xvf "$filename"
done

Related

return filename after downloading file with curl

I would like to capture the filename to a variable of a file that is downloaded using curl. I am using the following flag to preserve the filename as in the below using --remote-name
My code:
file1=$(curl -O --remote-name 'https://url.com/download_file.tgz')
echo $file1
You can use the -w|--write-out switch of curl:
file1="$(curl -O --remote-name -s \
-w "%{filename_effective}" "https://url.com/download_file.tgz")"
echo "$file1"
file1=download_file.tgz
url=https://url.com/$file1 #encoding this might be necessary
curl -O --remote-name $url
echo $file1
If, to construct the URL, you need to know the filename that you want to download, then you don't need anything from curl to identify the file that it downloaded unless there is not a 1:1 relationship between the basename of the URL and file that was downloaded.

Have kafkacat produce both name of a file and its contents

I want kafkacat to write both the name of a file and its contents in a single bash command, if possible. The following apparently does not work. What am I missing?
kafkacat -T -P -b ... -t ... <(echo $file) $file

Check if a remote file exists in bash

I am downloading files with this script:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?
Something like:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
should work, but I don't know how to combine this code with GNU Parallel.
Edit:
Based on Ole's answer I wrote this piece of code:
#!/bin/bash
do_url() {
url="$1"
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.
Example urls.txt:
http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded)
http://host.com/HT11019.jpg (not works, only the main image is downloaded)
It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.
I have urls.txt containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.
So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.
So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash
do_url() {
url="$1"
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$3"
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$4"
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
You can also "unroll" the loop. This will save 5 overheads per URL:
do_url() {
url="$1"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
You may use curl instead to check if the URLs you are parsing are there without downloading any file as such:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
Explanation:
--fail will make the exit status nonzero on a failed request.
--head will avoid downloading the file contents
--silent will avoid status or errors from being emitted by the check itself.
To solve the "looping" issue, you can do:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
From what I can see, your question isn't really about how to use wget to test for the existence of a file, but rather on how to perform correct looping in a shell script.
Here is a simple solution for that:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
What this does is that it invokes Wget with the --method=HEAD option. With the HEAD request, the server will simply report back whether the file exists or not, without returning any data.
Of course, with a large data set this is pretty inefficient. You're creating a new connection to the server for every file you're trying. Instead, as suggested in the other answer, you could use GNU Wget2. With wget2, you can test all of these in parallel, and use the new --stats-server option to find a list of all the files and the specific return code that the server provided. For example:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
You can even get this data printed as a CSV or JSON for easier parsing
Just loop over the names?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
You could send a command via ssh to see if the remote file exists and cat it if it does:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
Could also try scp which supports glob expressions and recursion.

Content-Disposition in dropbox

is there a way to rename a file while downloading from dropbox without changing the filename itself :
for example :
dropbox link : https://www.dropbox.com/s/uex431ou02h2m2b/300x50.gif?dl=1
and get file downloaded : NewNameImage.gif instead of 300x50.gif
Content-Disposition header didn't work for me .
any ideas how to do that ?
If you're downloading the file locally you should either be able to control the name given to the local file, or be able to rename it after the fact. For example, using curl, -JO downloads the file using the remote name specified in the Content-Disposition header, while -o lets you specify a name:
$ curl -L -JO "https://www.dropbox.com/s/uex431ou02h2m2b/300x50.gif?dl=1"
...
curl: Saved to filename '300x50.gif'
$ ls
300x50.gif
$ rm 300x50.gif
$ curl -L -o "NewNameImage.gif" "https://www.dropbox.com/s/uex431ou02h2m2b/300x50.gif?dl=1"
...
$ ls
NewNameImage.gif
Alternatively, renaming it after the fact:
$ curl -L -JO "https://www.dropbox.com/s/uex431ou02h2m2b/300x50.gif?dl=1"
...
curl: Saved to filename '300x50.gif'
$ ls
300x50.gif
$ mv 300x50.gif "NewNameImage.gif"
$ ls
NewNameImage.gif

wget to download images from retrieved file

I've got a file I need to retrieve, then I need to go through that file and download all the images listed. The format is xml, but I don't want to use an xml parser.
When I use
sudo wget --restrict-file-names=windows -nH -nd -r -i -P images \ -A jpeg,jpg,gif,png https://url.com/api/ojgnvhy75hGvcf36dnJO0947bsh62gbs?_=1361842359357
I get the xml file downloaded, but I need the images which are referenced in that file.
What am I doing wrong here?
I ended up with the following code, get the xml file and save it to text, then I get the links form the text file using sed and write those into another file, then use wget on that file to download the images.
#!/bin/dash
wget -O xml.txt 'https://url_to_download_from'
links=$(sed -n "/image>/s/^ .\([^>]*\)<\/image>.*/\1/gpw links.txt" xml.txt)
wget -N -P images -A png -i $links
Sadly, this results a bunch of files which are not images, even though I'm requesting only images.
After this script has completed, I run the following commands to clean up the folder.
cd images
shopt -s extglob nocaseglob
rm !(*.png)

Resources