How can I grab download links from a website - download

Is there any way of getting download links from a website and put those like in a text file?
To download hose files later in with wget ?

You need to download the source of the website. You can use wget link-of-the-webiste-you-want-to-grab-links-from for that. Than you can sed the links like this: sed -n 's/.*href="\([^"]*\).*/\1/p' file
The this questions for details.

With this you can download jpg file, instead of jpg you can give any file format which should be present in source_file. Your downloading links list will be in link.txt
grep -Po 'href=\"\/.+\.jpg' source_file | sed -n 's/href="\([^"]*\)/\1/p' >link.txt; wget -i link.txt

Related

How to extract image urls using bash?

I would like to extract the image URL from page's html code using bash commands and then download all images from that page. I am not sure whether it is possible, as sometimes they ae stored in folders which I wouldn't have access to.
But is it possible to download them from the source code?
I have written this so far:
wget -O plik.txt $1
grep *.jpg plik.txt > wget
grep *.png plik.txt > wget
grep *.gif plik.txt > wget
rm plik.txt```
Using lynx (a text web browser) in non-interactive mode, and GNU xargs:
#!/bin/bash
lynx -dump -listonly -image_links -nonumbers "$1" |
grep -Ei '\.(jpg|png|gif)$' |
tr '\n' '\000' |
xargs -0 -- wget --no-verbose --
This will start downloading matching image URLs in the web page URL given in $1, straight away.
It will include both images in the page, and images that are linked. Removing -image_links will skip images on the page.
You can add/remove whichever extensions you want to download, following the pattern I provided for .jpg, .png, and .gif. (grep -i is case insensitive).
The reason for using null delimiters (via tr) is to use xargs -0, which will avoid problems with URLs which contain a single quote/apostrophe (').
The --no-verbose flag for wget just simplifies the log output. I find it easier to read if downloading a large list of files.
Note that regular GNU wget will handle any duplicate filenames, by appending a number (foo.jpg.1 etc). However, busybox wget for example just exits if a filename exists, abandoning further downloads.
You can also modify the xargs to just print a list of the files to be downloaded, so you can review it first: xargs -0 -- sh -c 'printf "%s\n" "$#"' _

TAR-ing on-the-fly

I'm trying to fetch all files within all directories on our SAN. I'm starting with my local to test out how I want to do it. So, at my Documents directory:
ls -sR > documents_tree.txt
With just my local, that's fine. It gives the exact output I want. But since I'm doing it on our SAN, I'm going to have to compress on-the-fly, and I'm not sure the best way of doing this. So far I have:
ls -sR > documents_tree.txt | tar -cvzf documents_tree.tgz documents_tree.txt
When I try to check the output, it is impossible for me to un-tar the file using tar -xvf documents_tree.tar after I have gunzipped it.
So, what is the correct way to compress on-the-fly? How can I accurately check my work? Will this work when performing the same process on a SAN?
You don't need to use tar to compress a single file, just use gzip:
ls -sR | gzip > documents_tree.txt.gz
You can then use gunzip documents_tree.txt to uncompress it, or tools like gzcat and zless to view it without having to uncompress it first.
Building upon your comment on the OP and using your initial command, the following works for me:
ls -sR > documents_tree.txt && tar -cvzf documents_tree.tgz documents_tree.txt

Configure pandoc to extract media to different folder

I use pandoc to convert docx to markdown with the following:
pandoc -f docx -t markdown --extract-media="pandoc-output/$filename/" -o "pandoc-output/$filename/full.md" "$fullfile"
Which works OK. However, the media is stored in:
pandoc-output/$filename/media/
I want the media to be stored in
/pandoc-output/media/$filename/
Is this possible?
UPDATE
I ended up with a sed command to search and replace the offending lines together with a mv to the proper directory.
gsed -i -r "s/([a-zA-Z0-9_-]+)\/pandoc-output\/media\/([a-zA-Z0-9]+)/\/public\/media\/\1\/\2/" $ROOTDIR"$d"_"$filename.html.md"

How to show SPECIFIC hidden files on mac?

Such as by file extension?
For instance, I don't want to see .DS_STORE files, but I want to see all .htaccess files. Is there a way to do this?
Open a terminal and type
ls -a | grep -G .YOUR_EXTENSION$
See http://www.robelle.com/smugbook/regexpr.html for more information of regex
Also man grep and man ls will be of use

how to untar the file and rename the folder in one command line operation?

I want to download a file, untar it and rename the folder.
I am able to download the file and untar it with
curl https://s3.amazonaws.com/sampletest/sample.tar.gz | tar xz
How can I rename the folder in the same command?
curl https://s3.amazonaws.com/sampletest/sample.tar.gz | tar xz | mv ???????
I do not want to use the folder name explicitly in the command.
It's possible, but not trivial. It's easier to create your own directory, cd into it, then pass --strip-components 1 or --strip-path 1 to tar if your tar (e.g. GNU Tar) supports it.
File name transformations:
--strip-components=NUMBER strip NUMBER leading components from file
names on extraction
--transform=EXPRESSION, --xform=EXPRESSION
use sed replace EXPRESSION to transform file names
If your system hasn't GNU tar installed, it might still have pax (a POSIX tool) available. The latter supports the -s option which allows arbitrary changes in the path name of the processed files.
That would then be:
curl https://s3.amazonaws.com/sampletest/sample.tar.gz | gunzip | pax -r -s "/old/new/"

Resources