How to combine url with filename from file - bash

Text file (filename: listing.txt) with names of files as its contents:
ace.pdf
123.pdf
hello.pdf
Wanted to download these files from url http://www.myurl.com/
In bash, tried to merged these together and download the files using wget eg:
http://www.myurl.com/ace.pdf
http://www.myurl.com/123.pdf
http://www.myurl.com/hello.pdf
Tried variations of the following but without success:
for i in $(cat listing.txt); do wget http://www.myurl.com/$i; done

No need to use cat and loop. You can use xargs for this:
xargs -I {} wget http://www.myurl.com/{} < listing.txt

Actually, wget has options which can avoid loops & external programs completely.
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)
If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html
is not specified, then file should consist of a series of URLs, one per line.
However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding "<base href="url">" to the documents or by
specifying --base=url on the command line.
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file's location will be implicitly used as base href if none was specified.
-B URL
--base=URL
Resolves relative links using URL as the point of reference, when reading links from an HTML file specified via the -i/--input-file option (together with --force-html, or when the input file was fetched remotely from a
server describing it as HTML). This is equivalent to the presence of a "BASE" tag in the HTML input file, with URL as the value for the "href" attribute.
For instance, if you specify http://foo/bar/a.html for URL, and Wget reads ../baz/b.html from the input file, it would be resolved to http://foo/baz/b.html.
Thus,
$ cat listing.txt
ace.pdf
123.pdf
hello.pdf
$ wget -B http://www.myurl.com/ -i listing.txt
This will download all the 3 files.

Related

wget: using wildcards in the middle of the path

I am trying to recursively download .nc files from: https://satdat.ngdc.noaa.gov/sem/goes/data/full/*/*/*/netcdf/*.nc
A target link looks like this one:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/netcdf/
and I need to exclude this:
https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/csv/
I do not understand how to use wildcards for defining path in wget.
Also, the following command (a test for year 1981 only), only downloads subfolders 10, 11 and 12, failing with {01..09} subfolders:
for i in {01..12};do wget -r -nH -np -x --force-directories -e robots=off https://satdat.ngdc.noaa.gov/sem/goes/data/full/1981/${i}/goes02/netcdf/; done
I do not understand how to use wildcards for defining path in wget.
According to GNU Wget manual
File name wildcard matching and recursive mirroring of directories are
available when retrieving via FTP.
so you must not use one in URL provided when working with HTTP or HTTPS server.
You might combine -r with --accept-regex urlregex to
Specify a regular expression to accept(...)the complete URL.
Observe that it should match whole URL, for example if I wish pages linked in GNU Package blurbs which contain auto in path I could do that by
wget -r --level=1 --accept-regex '.*auto.*' https://www.gnu.org/manual/blurbs.html
which result in download main pages of autoconf, autoconf-archive, autogen, automake. Note: --level=1 is used to prevent going further down than links shown in blurbs.

pdftk update_info command raising a warning which I don't understand

I'm trying to use the update_info command in order to add some bookmarks to an existing pdf's metadata using pdftk and powershell.
I first dump the metadata into a file as follows:
pdftk .\test.pdf dump_data > test.info
Then, I edit the test.info file by adding the bookmarks, I believe I am using the right syntax. I save the test.info file and attempt to write the metadata to a new pdf file using update_info:
pdftk test.pdf update_info test.info output out.pdf
Unfortunately, I get a warning as follows:
pdftk Warning: unexpected case 1 in LoadDataFile(); continuing
out.pdf is generated, but contains no bookmarks. Just to be sure it is not a syntax problem, I also ran it without editing the metadata file, by simply overwriting the same metadata. I still got the same warning.
Why is this warning occurring? Why are no bookmarks getting written to my resulting pdf?
using redirection in that fashion
pdftk .\test.pdf dump_data > test.info
will cause this known problem by building wrong file structure, so change to
pdftk .\test.pdf dump_data output test.info
In addition check your alterations are correctly balanced (and no unusual characters) then save the edited output file in the same encoding.
Note:- you may need to consider
Use dump_data_utf8 and update_info_utf8 in order to properly display characters in scripts other than Latin (e. g. oriental CJK)
I used pdftk --help >pdftk-help.txt to find the answer.
With credit to the previous answer, the following creates a text file of the information parameters: pdftk aaa.pdf dump_data output info.txt
Edit the info.txt file as needed.
The pdftk update_info option creates a new pdf file, leaving the original pdf untouched. Use: pdftk aaa.pdf update_info info.txt output bbb.pdf

Save aria2 generated .torrent file with magnet's "dn" name?

Aria2 has the ability to specify a magnet URI and it will save a torrent file. This file gets saved with a name of the hex encoded info hash with suffix .torrent.
Magnet URIs have an option for ?dn=, which is a Display Name. Is it possible to use this name when saving the torrent, so that
aria2c -d . --bt-metadata-only=true --bt-save-metadata=true "magnet:?xt=urn:btih:cf7da7ab4d4e6125567bd979994f13bb1f23dddd&dn=ubuntu-18.04.2-desktop-amd64.iso"
outputs ubuntu-18.04.2-desktop-amd64.iso.torrent instead of cf7da7ab4d4e6125567bd979994f13bb1f23dddd.torrent?
I couldn't find any direct option, but there is a workaround
aria2c -S, --show-files[=true|false]
Print file listing of .torrent, .meta4 and .metalink file and exit. More detailed
information will be listed in case of torrent file.
Using this and some greping, cutting you could do something like this..
mv hash.torrent "`aria2c -S hash.torrent | grep Name | cut -c7-`.torrent"

wanted to use results of find command in custom script that i am building

I want to validate my XML's for well-formed ness, but some of my files are not having a single root (which is fine as per business req eg. <ri>...</ri><ri>..</ri> is valid xml in my context) , xmlwf can do this, but it flags out a file if it's not having single root, So wanted to build a custom script which internally uses xmlwf, my custom script should do below,
iterate through list of files passed as input (eg. sample.xml or s*.xml or *.xml)
for each file prepare a temporary file as <A>+contents of file+</A>
and call xmlwf on that temp file,
Can some one help on this?
You could add text to the beginning and end of the file using cat and bash, so that your file has a root added to it for validation purposes.
cat <(echo '<root>') sample.xml <(echo '</root>') | xmlwf
This way you don't need to write temporary files out.

Edit file contents of folder before compress it

I want to compress the contents of a folder. The catch is that I need to modify the content of a file before compressing it. The modification should not alter the contents in the original folder but should be their in the compressed file
So far I was able to figure out altering file contents using sed command-
sed 's:/site_media/folder1/::g' index.html >index.html1
where /site_media/folder1/ is the string which I want to replace with empty string. Currently this code is creating another file named index.html1 as I don't want to make the changes inplace for the file index.html.
I tried pipelining this command with the zip command as follows
sed 's:/site_media/folder1/::g' folder1/index.html > index1.html |zip zips/folder1.zip folder1/
but I am not getting any contents when I unzip the file folder1.zip. Also the modified file in the compressed folder should be named index.html (and not index.html1)
You want to do two things. Then, say command1 && command2. In your case:
sed '...' folder1/index.html > index1.html && zip zips/folder1.zip folder1/
If you pipe commands, you use the output of the first to feed the second, which is something you don't want in this case.

Resources